A Syntax Checker for Written English




James L. Peterson and Elaine Rich




Department of Computer Sciences
The University of Texas at Austin
Austin, Texas 78712




December, 1980


Increasing use is being made of computer based systems for the preparation of written material. Word and text processing systems allow easy entry of text which can be modified, corrected and updated with a text editor. Reports and papers can be formatted, introducing line justification, spacing, paging, by a text formatter. Filing systems allow documents to be stored and retrieved with ease.

A more sophisticated use of computers in text processing is the use of a spelling checker. A spelling checker is a program that searches a document and helps to identify incorrectly spelled words. This raises the quality of the document, making it easier to read and understand, by eliminating misleading and annoying spelling (or keyboarding) errors.

The technology of a spelling checker is relatively simple: a large dictionary of correctly spelled words is kept in computer memory. Each word in a document is then searched for in the dictionary. If it is found, it is correctly spelled; if it is not found, the user is given the opportunity to decide whether the word should be added to the dictionary or whether it is an error.

This process is an example of work-sharing between a human user and a computer, in which the computer does those tasks that it does best (fastest, most accurately, most consistently) and the person does the tasks he or she can do better. As computer technology improves, it becomes feasible to shift more and more work onto the computer. This proposal consists of one more step in this direction, a step motivated by the limitations of current spelling checkers.

A straightforward spelling checker can only assure that every word in a document is some correctly spelled word, not necessarily the desired correctly spelled word. For example, the following sentence is perfectly acceptable to a spelling checker:

He went too the house.

Each word is correctly spelled, although we clearly see that "too" should be "to". This sort of error must still be caught by a human reader. The reason a spelling checker cannot detect errors such as this is that it looks at each word in isolation; it has neither context nor semantics. To catch such errors will require more knowledge about English and a more complex program.

We propose to attempt to create such a program: a syntax checker. This program would read through a document flagging possible syntax errors. Like a spelling checker, the syntax checker would have to have access to a dictionary. It will proceed by attempting to parse the input sentences one word at a time, looking up words in the dictionary as necessary to determine their part of speech. This will require that the dictionary contain not only a list of words but also their possible parts of speech. Syntax checking and spelling checking can go on in tandem thus allowing a single dictionary access per word.

A syntax checker such as this could detect such errors as the example above by reducing it to a statement of the form:

pronoun intransitive-verb adverb the noun

or

pronoun intransitive-verb too the noun

neither of which would be a legal syntactic construct. Exactly what the form would be is a topic for research. Some words like "the" and "which" are so special in the roles they can play in sentences that they must be treated as special cases by any parser. Words like "too" are on the borderline between words like "the" and words like "cat", which, except for unusual cases, can simply be regarded as members of a large class like "noun".




Table 1. A first classification of words into parts of speech.

noun
verb
    transitive
    intransitive
adjective
adverb

article
conjunction
preposition
pronoun
interjection

The parse may be very difficult. One problem is simple combinatorics: each word may be many parts of speech. The checker must chose a correct interpretation of each word. We realize that, without understanding the meaning or semantics of each word, we cannot make the "correct" choice for each word. However we can determine (by exhaustively trying all possible combinations if necessary) if there exists any possible interpretation which is correct syntax. We will accept any statement as correct syntax if any acceptable parse exists. We thus bypass many of the problems that have troubled previous English parsers. For example, we would not care whether the correct parse of

I saw the boy in the park with the telescope.

should be
I saw [the boy [in the park] [with the telescope]]

or
I saw [the boy [in the park [with the telescope]]]

or
I saw [the boy [in the park]] [with the telescope]

We realize that this may allow some incorrect statements to be accepted. However to do more would require a program which "understood" the meaning of the sentence - a most difficult problem which is currently under investigation by many AI researchers. We expect however that our approach will detect a significant number of typical errors. The amount of success such an attempt will achieve will depend both on the overall accuracy of the grammar rules we develop and on the way the rules apply to the sentences people actually write.

Several problems must be addressed in the design of such a program: human interface, dictionary structure, parsing method, and especially grammar definition. Some of these problems are classical systems problems (human interface, dictionary), while others (grammar definition) involve linguistics and artificial intelligence areas of research. The major problem would seem to be with the definition of an acceptable grammar. There is no complete universally accepted definition of English grammar.

The definition of the grammar will be crucial to the construction of the syntax checker. At one extreme one could allow any sequence of words to be a sentence (sentence ::= word*); at the other extreme, one could define separate productions for each existing English sentence. In between lies the more reasonable definition in terms of a few (on the order of 100) productions concerning the definitions and use of common syntactic constructs (noun phrase, verb phrase, ...). These rules will need revision as experience is obtained. Thus it seems that an easily changed, perhaps table-driven, parser is desirable. Augmented Transition Nets (ATN) seem to be a reasonable choice.

One advantage of a definable grammar is the ability to produce grammars which are progressively more sophisticated and complex in order to check for more and more errors. For example many of the problems in written English tend to involve use of mismatched types. Nouns and verbs can be typed according to many features: gender (male/female), number (singular/plural), tense (past, present, future), case (objective, subjective, nominative) and so on. These types can only be combined in defined ways. One common problem is number agreement: use of a singular noun with a plural verb (or vice versa). An interesting empirical question to be addressed by this research is that of determining the size of the grammar required to catch most of the errors people actually make. We will have to experiment with these and other features to determine when additional rules cease to provide significant increases in accuracy.

A syntax checker such as the one we are describing would be useful for correcting the ordinary typing and grammar mistakes that we all make everyday. It could be of even more benefit, however, to authors whose native language is not English, particularly when the native language does not include certain constructs (such as articles). The problems of these authors are evident in the University environment in papers and theses of foreign students, but is also a problem in papers presented at international conferences or in multinational business environments.

A couple of possible obstacles to the feasibility of a practical syntax checker are not as serious as they might at first appear. One problem, particularly with technical writing, is that new words appear considerably more often than do new dictionaries. So the syntax checker may look up a word to discover its part of speech only to discover that it is not in the dictionary. But of the major parts of speech, only four (noun, verb, adjective, and adverb) ever acquire new words. So in the worst case, only four possibilities would have to be considered. Often a simple morphological analysis will narrow the options even further. So the problem of unknown words is not likely to be as serious as it might at first seem.

Another difficulty with technical text is that it often contains many non-sentences in such structures as enumerations of things, section headings, and figure captions. Here we may be aided by one recent trend in the development of text processing systems. In at least one such system (Scribe) all such segments would be demarcated by their function (such as heading or list) so that they can be correctly formatted by the system. Using that same information, the syntax checker could discover that those segments should not be analyzed as sentences.

We believe that a syntax checker is now feasible and would be of great use in improving written material, catching syntax errors which would slow down or confuse the reader, hindering effective communication. In addition once the syntax checker is properly working, it should also be possible to extend it to check certain stylistic constructs of proper usage, such as found in Strunck and White's Elements of Style.

We are currently in the process of creating a dictionary, complete with parts of speech information, based upon a tape of Webster's Seventh Collegiate Dictionary. We expect to begin shortly to construct the parser and to experiment with various grammars.