Copyright © 1982 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or permissions@acm.org.
ACM did not prepare this copy and
does not guarantee that is it an accurate
copy of the author's original work.
Computer-based document preparation systems provide many aids to the production of quality documents. A text editor allows arbitrary text to be entered and modified. A text formatter then imposes defined rules on the form of the text. A spelling checker ensures that each word is correctly spelled. None of these aids, however, affect the meaning of the text; the document may be a well-formatted, correctly spelled set of gobbledygook.
The general problem of checking correct syntax and semantics is still very much a research problem. However, one form of assistance in producing a readable document can be very easily provided now: a readability index. A readability index is a measure of the ease (or difficulty) of reading and understanding a piece of text. Several readability formulas to compute readability indexes have been defined and are in fairly wide use.
Readability formulas were originally developed mainly by educators and reading specialists. Their primary application was in defining the appropriate reading level for text books for elementary and secondary schools. More generally, readability indexes can be used to assist a writer by pointing out possible grammatical and stylistic problems and by helping to maintain a consistent audience level throughout a document.
Defining and selecting a readability formula requires some attention to the underlying question: What constitutes a readable document? Specifically, what features of text play an important role in determining readability? A survey of the features used in various readability formulas reveals the following list of features which have been suggested and used:
Klare [1968, 1974] and others [McLaughlin 1966] have shown that the two most common variables in a readability formula are: (1) a measure of word difficulty, and (2) a measure of sentence difficulty. Clearly a sentence with a number of unusual and uncommon words will be more difficult to understand than a sentence with simpler, more common words. Similarly a sentence with contorted and complex syntactic structure is more difficult to read than a sentence with a simple structure. The problem is mainly how to measure, and combine these two variables.
The most direct measure of the difficulty of a word is its frequency in normal use. Words which are frequently used are easy to read and understand; words which are uncommon are more difficult to read. Analysis of English language text have shown that a small number of words occur very frequently while many words are quite uncommon. In the Brown Corpus [Kucera and Francis, 1967] for example, a body of 1,014,232 words yielded only 50,406 unique words. Of these the 168 most frequently occurring words accounted for half the occurrences of the total number of words, while about half the unique words were used only once. This skewness in frequency of use means that the best measure of the commonness of a word is the logarithm of its frequency of use.
However, since the computation of this measure requires a dictionary of words plus their commonness, this measure is generally not used directly. Rather, indirect measures, such as features 1 to 6 above, are used to attempt to approximate word difficulty. Most of these measures are based on the fact that common words tend to be short, while uncommon words tend to be longer, Zipf's law [Zipf 1935]. Thus, a measure such as (1) the number of characters per word, or (3) the number of syllables per word are indirect measures of the frequency of a word in the language.
The difficulty of a sentence would seem to depend mainly upon its syntactic structure. However, again, this is not readily computable or quantifiable. Thus, more easily computed measures, such as (7) sentence length are more commonly used, based upon the assumption that long sentences are more complex than short sentences.
Once the features to be measured are selected, they must be combined to produce a composite readability value. The existing formulas are generally derived in the following manner:
Many readability formulas have been developed. A survey paper by Klare in 1963 listed over 30 different formulas for determining readability; an update in 1974 listed over 30 more new or updated formulas. Many of these vary only slightly from others and all are highly correlated.
One of the earliest (1948) and most popular formulas was the Flesch formula [Flesch 1948]. Designed for general adult reading matter, it was based upon the McCall-Crabbs Lessons. The formula yields a readability index in the range 0 (hard) to 100 (easy).
where,
R = 206.835 - 84.6 * S/W - 1.015 * W/T
S = total number of syllables W = total number of words T = total number of sentences
Shortly after the Flesch formula was published, a variant of it was published by Farr, Jenkins and Paterson [1951]. They felt that computing the number of syllables per word was too difficult and so suggested instead that the number of one-syllable words be counted. The more one-syllable words there are, the simpler the text.
where,
R = -31.517 + 159.9 * M/W - 1.015 * W/T
M = total number of one-syllable words W = total number of words T = total number of sentences
Coke and Rothkopf [1970] had a similar problem with the Flesch formula, but, unaware of the Farr-Jenkins-Paterson formula, investigated the possibility of estimating the syllable count with other features. They determined that a reasonable approximation to the number of syllables (S) is possible by computing the average number of vowels (V) per word (W). The following formula relates the syllable count to the number of vowels per word.
This formula was determined from a set of samples using linear least squares to relate the actual syllable count to the average number of vowels per word.
S = 99.81 * V/W - 34.32
At about the same time as the publication of the Flesch formula, the Dale-Chall formula [Dale and Chall, 1948] was published. It is one of the more accurate general purpose formulas. There are two major differences between the Dale-Chall formula and the Flesch formula.
The original formula, published in 1948, was based upon the 1925 McCall-Crabbs Lessons.
where,
G = 19.4265 - 15.79 * D/W + .0496 * W/T
D = total number of words on the Dale Long List W = total number of words T = total number of sentences
In 1950, the McCall-Crabbs Lessons were revised, and Powers, Sumner, and Kearl [1958] recalculated the Dale-Chall formula for the new McCall-Crabbs Lessons. This produced the following formula.
G = 14.8172 - 11.55 * D/W + .0596 * W/T
Again in 1961, the McCall-Crabbs Lessons were revised, and again the Dale-Chall formula was recalculated [Holquist 1968]. This time the formula was,
G = 14.862 - 11.42 * D/W + .0512 * W/T
The relative stability of the coefficients in the Dale-Chall formula lend credence to its validity.
The Fog Index [Gunning 1952] considered "hard" words to be those of three syllables or more. It was recalculated based on the 1950 McCall-Crabbs Lessons [Powers, Sumner, and Kearl, 1958].
where,
G = 3.0680 + 9.84 * P/W + .0877 * W/T
P = total number of words with 3 syllables or more W = total number of words T = total number of sentences
Research by Coleman [1965] produced the following formula based upon the percent of correct cloze completions. This results in a scale from 0 (hard) to 100 (easy).
where,
R = -37.95 + 116.0 * M/W + 148.0 * T/W
M = total number of one-syllable words W = total number of words T = total number of sentences
Bormuth [1968] investigated 169 different variables for 330 passages using cloze procedures. He developed 24 different formulas with up to 20 variables. However, most of the extra variables added little to the accuracy of the readability index. One of the formulas uses the Dale Long List, producing an index from 0 (hard) to 100 (easy).
where
R = .886593 - .08364 * L/W + 1.61911 * (D/W)^{3} - .021401 * W/T + .000577 * (W/T)^{2} - .000005 * (W/T)^{3}
L = total number of letters D = total number of words on the Dale Long List W = total number of words T = total number of sentences
Irving and Arnold [1979] suggest that the number of words on the Dale Long List (D) can be approximated by,
D = 1.16 * W - .05 * L
The SMOG Index was developed by McLaughlin [1969] who argued that word difficulty and sentence difficulty are not independent, so that their product, rather than their (weighted) sum, is a more accurate indication of readability. This resulted, eventually, in the following formula, based on the McCall-Crabbs Lessons, and giving the grade level (1 to 12) for reading:
where
G' = 3.1291 + 5.7127 * SQRT(P/T)
P = total number of words with 3 syllables or more T = total number of sentences
Unfortunately, this grade level (G') is not directly
comparable with the grade level (G) of the Dale-Chall formula
since G is the grade level to understand half the material
while G' is the grade level to fully understand the material.
Common Readability Formulas and Their Variables
Flesch | R | = | 206.835 | - | 84.6 | * | S/W | - | 1.015 | * | W/T |
Farr-Jenkins-Paterson | R | = | -31.517 | + | 159.9 | * | M/W | - | 1.015 | * | W/T |
Coke-Rothkopf | R | = | 235.87 | - | 84.44 | * | V/W | - | 1.015 | * | W/T |
Coleman | R | = | -37.95 | + | 116.0 | * | M/W | + | 148.0 | * | T/W |
Dale-Chall | G | = | 14.862 | - | 11.42 | * | E/W | + | 0.0512 | * | W/T |
Fog | G | = | 3.068 | + | 9.84 | * | P/W | + | .0877 | * | W/T |
Automated Readability Index | G | = | -21.43 | + | 4.71 | * | L/W | + | 0.50 | * | W/T |
Coleman-Liau | G | = | -15.8 | + | 5.88 | * | L/W | - | 29.59 | * | W/T |
Kincaid | G | = | -15.59 | + | 11.8 | * | S/W | + | 0.39 | * | W/T |
W | = | total number of words |
T | = | total number of sentences |
L | = | total number of letters |
V | = | total number of vowels |
D | = | total number of words on the Dale Long List |
S | = | total number of syllables |
M | = | total number of one-syllable words |
P | = | total number of words with 3 syllables or more |
Constructing a program to compute a readability index is fairly straightforward. First one (or more) of the readability formulas are selected for computation. The text of the input file is read, accumulating the necessary statistics. Finally the readability formula is used to compute and print an index for the specific input file.
The statistics needed for computing most formulas are easily accumulated in one pass through the document. Of the formulas given above, we see that the following statistics are needed:
The most difficult statistics are probably those dealing with the number of syllables per word. These can be exactly determined by a dictionary look-up, or they can be approximated by the approach of Coke and Rothkopf [1970] or by use of techniques for determining where a word may be hyphenated [Rich and Stone, 1965].
One other point in implementation concerns the amount of text over which the index is computed. While we certainly want an index for the entire input file, we probably also want readability measures for each small piece of the file. Thus, we may want to compute a separate index for each section, each page, or each paragraph. This allows an author to quickly scan a document looking for sections which are out of line for the intended audience. These portions may be rewritten to bring them more into line with the author's intentions.
A very simple program can be written to compute a readability index for a document. Readability formulas have been developed by reading specialists to allow easy determination of the reading level of a document. With the new ability of computers to store large dictionaries of words, and their properties on-line, we expect that even better readability indexes can be produced and can help to improve the quality of documents produced with the aid of a computer system.
As an example, applying the formulas listed above to this paper results in the following readability indexes:
Readability: 0 (hard) to 100 (easy) | |
---|---|
81.7 | Coke-Rothkopf |
55.2 | Farr-Jenkins-Paterson |
53.0 | Flesch |
47.3 | Coleman |
Grade Level: 1 (easy) to 12 (hard) | |
---|---|
6.5 | Fog |
8.0 | Coleman-Liau |
9.1 | Automated Readability Index |
10.6 | Dale-Chall |
11.0 | Kincaid |
13.6 | SMOG |
Acknowledgements: We would like to thank Carol Engelhardt for her assistance in this work.