Spelling Checkers

James L. Peterson

This paper has been published. It should be cited as

James L. Peterson, ``Spelling Programs'', Encyclopedia of Computer Science, Third Edition, Van Nostrand Reinhold Company, New York, 1993, pages 1266-1268.

Computers are very useful for word processing (q.v.). They are used to input, store, edit, format, and print text files ranging in size from short notes to multi-volume books. In addition to these conventional forms of word processing, computer programs can provide more advanced word processing assistance, such as spelling checking, spelling correction, and grammar checking.

The general operation of a spelling checker is simple: it checks each word in a document or file for correct spelling. Allegedly incorrect spellings are reported to the human user who can then correct the errors. A spelling corrector checks each word (just like a spelling checker) and in addition will try to suggest the correct spelling for each misspelled word that is found.

Note that spelling checkers only guarantee that each word is some correctly spelled word, not necessarily the one you meant. If ``or'' is mistyped as ``of'' or ``affect'' is misspelled as ``effect'', a spelling checker will not report an error. Detecting these kinds of errors would require a much more complicated program, called a grammar checker.

Spelling checkers are used in a number of different ways. Possibly the simplest is the stand-alone batch program. A batch program reads the (input) document and produces an (output) list of any possible spelling errors. The user must find and correct these errors in the document. An interactive spelling checker reads a document and presents each spelling error to the user as it is found. The user can see each error in context and either change it immediately or leave it alone. A word processing system may incorporate spelling checking (or correction) directly into its processing, either in response to a menu option or continually as text is entered.

How does a spelling checker know if a word is correctly spelled? Most checkers use a word list to define the set of correctly spelled words. The word list may be the list of all words in a dictionary (without the definitions) or may be accumulated from existing documents. Most spelling programs call their word lists ``dictionaries''. The list of words is generally stored in a file. A word is considered correctly spelled if and only if it is found in the speller's word list. The main technical problem, particularly for an interactive program, is to search the list as quickly as possible.

Differing search and data structure techniques are used for differing environments. A batch checker, for example, may form an alphabetically-sorted list of all words in an input document and then make one pass over a similarly sorted word list to check for misspelled words. A speller with large amounts of main memory can keep the entire word list in that fast memory in the form of a hash table and use standard hash table search algorithms (see Searching). A system with a fast disk might keep its word lists on disk with an in-core index and a cache of the most frequently referenced disk blocks.

The correctness of a spelling checker is determined by its word list. As long as its word list has no incorrectly spelled words, a checker will never ``miss'' an incorrectly spelled word that is not, coincidentally, some other legal word. On the other hand, spellers often report correctly spelled words as possible spelling errors. These may be proper names, technical terms, or uncommon words which are not in the system word list. Most systems allow a user to augment its main word list with local auxiliary word lists for special subjects, authors, or documents. Doctors and lawyers, for example, generally use extensive auxiliary word lists designed for the specialized vocabularies of their fields.

Some systems, upon flagging a suspect word, allow the user several options including: (a) correct the word, (b) add the word to the main word list, (c) add the word to an auxiliary word list, or (d) add the word to a transient word list that endures only for the duration of the document being checked.

A very large word list might seem desirable to avoid having a spelling checker incorrectly report correctly spelled words as possible errors. However, a very large word list tends to include unusual and infrequently used words. This increases the chance that a word will be misspelled as some other word and not be caught by the checker. The appearance of ``dhow'' in text might indicate that the author is writing about an Arab boat, but more likely signals a typographical error for ``show''. In general, word lists should be kept reasonably small, in the range of 20,000 to 100,000 words even though there are over a half-million English words (counting inflections).

One approach to keeping the word list short is to notice that many words are derived from a base word by the addition of common suffixes and prefixes. Some checkers keep only the base words in their word lists. If a suspect word is not in the word list, an attempt is made to remove suffixes and prefixes to find the base word. If the base word is in the word list, the suspect word is accepted as correctly spelled. Note that this approach may allow incorrect spellings to escape detection, such as if ``design'' is misspelled ``desing'' (which can be processed as ``de + s + ing''

A spelling corrector is invoked when an incorrectly spelled word is found. Its problem is to produce a list of possible correct spellings for the error. For correction, the set of correctly spelled words is thought of as a set of points in a multi-dimensional space. The corrector tries to find the nearest neighbor or neighbors of the spelling error in that space. If an error produces one candidate correction which is much closer to the error than other possible corrections, the speller may suggest an automatic correction.

The success of a spelling corrector depends largely upon the source of the spelling errors and the methods used to find nearest neighbors. For example, many systems assume that spelling errors occur because of one of the following four types of errors:

a. one extra letter in the word (``feeel'').
b. one missing letter in the word (``fel'').
c. one wrong letter in the word (``feal'').
d. two adjacent letters are transposed (``fele'').

These types of errors may account for 80% to 90% of the typing errors in a document.

Another source of spelling errors is the difference between spelling and pronunciation -- a word like ``tough'' may be spelled ``tuff''. The most common approach for correcting these types of errors is to map the error onto a sound-based encoding. Each word in the word list is also mapped and candidate corrections with the same sound as the error are generated. The proper use of appropriate data structures and search algorithms is particularly important in this case to provide adequate performance.

Another common typing error is to repeat an entire word, commonly at the end of one line and the beginning of the next line, creating such obvious errors as ``a a'' and ``the the''. Some checkers check for duplicate adjacent words, but would then report spurious errors in those sentences with repeated words, such as ``I knew that that boy had had the measles.''

Note that this type of error is not a spelling error, since each word is individually a correct word. In addition to spelling and typing errors, there are errors of grammar. Grammatical errors are defined by incorrect groups of words, not individual words. Grammar checkers try to find errors in sentences or phrases rather than separate words. While it is possible to look for certain simple errors (such as two identical words in a row, incorrect use of `a' or `an', capitalization errors, or use of any of a list of incorrect word combinations), the general problem of detecting true errors of grammar is still a research problem. We do not yet have computer programs that ``understand'' the structure of sentences. Understanding the structure of sentences would allow the detection of errors such as a sentence with no verb or a plural subject with a singular verb (``they is'').