Computer-Based Readability Indexes




Douglas R. McCallum and James L. Peterson




Department of Computer Sciences
The University of Texas at Austin
Austin, Texas 78712




24 November 1980




This paper has been published. It should be cited as

Douglas R. McCallum and James L. Peterson, ``Computer-Based Readability Indexes'', Proceedings of the ACM '82 Conference, (October 1982), pages 44-48.

ACM Copyright Notice

Copyright © 1982 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or permissions@acm.org.

ACM did not prepare this copy and does not guarantee that is it an accurate copy of the author's original work.


1. Introduction

Computer-based document preparation systems provide many aids to the production of quality documents. A text editor allows arbitrary text to be entered and modified. A text formatter then imposes defined rules on the form of the text. A spelling checker ensures that each word is correctly spelled. None of these aids, however, affect the meaning of the text; the document may be a well-formatted, correctly spelled set of gobbledygook.

The general problem of checking correct syntax and semantics is still very much a research problem. However, one form of assistance in producing a readable document can be very easily provided now: a readability index. A readability index is a measure of the ease (or difficulty) of reading and understanding a piece of text. Several readability formulas to compute readability indexes have been defined and are in fairly wide use.

Readability formulas were originally developed mainly by educators and reading specialists. Their primary application was in defining the appropriate reading level for text books for elementary and secondary schools. More generally, readability indexes can be used to assist a writer by pointing out possible grammatical and stylistic problems and by helping to maintain a consistent audience level throughout a document.

2. Underlying Principles

Defining and selecting a readability formula requires some attention to the underlying question: What constitutes a readable document? Specifically, what features of text play an important role in determining readability? A survey of the features used in various readability formulas reveals the following list of features which have been suggested and used:

  1. length of words (in characters)

  2. number of words of 6 or more letters

  3. number of syllables

  4. number of words which are monosyllables

  5. number of words of 3 or more syllables

  6. number of affixes (prefixes or suffixes)

  7. number of words per sentence

  8. number of sentences

  9. number of pronouns

  10. number of prepositions

Klare [1968, 1974] and others [McLaughlin 1966] have shown that the two most common variables in a readability formula are: (1) a measure of word difficulty, and (2) a measure of sentence difficulty. Clearly a sentence with a number of unusual and uncommon words will be more difficult to understand than a sentence with simpler, more common words. Similarly a sentence with contorted and complex syntactic structure is more difficult to read than a sentence with a simple structure. The problem is mainly how to measure, and combine these two variables.

2.1. Measures of Word Difficulty

The most direct measure of the difficulty of a word is its frequency in normal use. Words which are frequently used are easy to read and understand; words which are uncommon are more difficult to read. Analysis of English language text have shown that a small number of words occur very frequently while many words are quite uncommon. In the Brown Corpus [Kucera and Francis, 1967] for example, a body of 1,014,232 words yielded only 50,406 unique words. Of these the 168 most frequently occurring words accounted for half the occurrences of the total number of words, while about half the unique words were used only once. This skewness in frequency of use means that the best measure of the commonness of a word is the logarithm of its frequency of use.

However, since the computation of this measure requires a dictionary of words plus their commonness, this measure is generally not used directly. Rather, indirect measures, such as features 1 to 6 above, are used to attempt to approximate word difficulty. Most of these measures are based on the fact that common words tend to be short, while uncommon words tend to be longer, Zipf's law [Zipf 1935]. Thus, a measure such as (1) the number of characters per word, or (3) the number of syllables per word are indirect measures of the frequency of a word in the language.

2.2. Measures of Sentence Difficulty

The difficulty of a sentence would seem to depend mainly upon its syntactic structure. However, again, this is not readily computable or quantifiable. Thus, more easily computed measures, such as (7) sentence length are more commonly used, based upon the assumption that long sentences are more complex than short sentences.

2.3. Combining the Features

Once the features to be measured are selected, they must be combined to produce a composite readability value. The existing formulas are generally derived in the following manner:

  1. An independent assessment of the reading difficulty of a collection of texts is made, either by a panel of judges, a standard set of text (such as the McCall-Crabbs [1925] Standard Test Lessons in Reading) or cloze tests. Cloze tests replace every fifth word from a passage with a blank space and then determine the difficulty of the passage by the percent of deleted words which can be correctly guessed by a reader. (More difficult passages mean that fewer words can be guessed.)

  2. The values of the chosen features are computed for each of the texts in the collection.

  3. (Linear) regression analysis is applied to produce the coefficients of a linear equation combining the features and computing the readability score which was independently derived.
Thus the coefficients in readability formulas are empirically determined.

3. Example Readability Formulas

Many readability formulas have been developed. A survey paper by Klare in 1963 listed over 30 different formulas for determining readability; an update in 1974 listed over 30 more new or updated formulas. Many of these vary only slightly from others and all are highly correlated.

3.1. The Flesch Formula

One of the earliest (1948) and most popular formulas was the Flesch formula [Flesch 1948]. Designed for general adult reading matter, it was based upon the McCall-Crabbs Lessons. The formula yields a readability index in the range 0 (hard) to 100 (easy).


R = 206.835 - 84.6 * S/W - 1.015 * W/T

where,
S = total number of syllables
W = total number of words
T = total number of sentences

This formula is based upon the average number of syllables per word (S/W), a measure of word difficulty, and the average number of words per sentence (W/T), a measure of sentence difficulty.

3.2. The Farr-Jenkins-Paterson Formula

Shortly after the Flesch formula was published, a variant of it was published by Farr, Jenkins and Paterson [1951]. They felt that computing the number of syllables per word was too difficult and so suggested instead that the number of one-syllable words be counted. The more one-syllable words there are, the simpler the text.


R = -31.517 + 159.9 * M/W - 1.015 * W/T

where,
M = total number of one-syllable words
W = total number of words
T = total number of sentences

This formula is slightly less accurate than the Flesch formula but may be easier to compute.

Coke and Rothkopf [1970] had a similar problem with the Flesch formula, but, unaware of the Farr-Jenkins-Paterson formula, investigated the possibility of estimating the syllable count with other features. They determined that a reasonable approximation to the number of syllables (S) is possible by computing the average number of vowels (V) per word (W). The following formula relates the syllable count to the number of vowels per word.


S = 99.81 * V/W - 34.32

This formula was determined from a set of samples using linear least squares to relate the actual syllable count to the average number of vowels per word.

3.3. The Dale-Chall Formula

At about the same time as the publication of the Flesch formula, the Dale-Chall formula [Dale and Chall, 1948] was published. It is one of the more accurate general purpose formulas. There are two major differences between the Dale-Chall formula and the Flesch formula.

  1. The Dale-Chall formula does not compute a number from 0 (hard) to 100 (easy), but computes the grade level (1 through 12) of a pupil who can answer correctly at least half of the test questions asked about a text passage. Thus, the range of the readability index is from 1 (easy) to 12 (high school graduate) to perhaps 16 (college graduate).

  2. The measure of word difficulty is computed directly from the Dale Long List [Dale and Chall, 1948]. The Dale Long List is a list of 3000 general words known to 80 percent of fourth grade children (in 1948, of course). Words on the Dale Long List are considered easy; words not on the Dale Long List are considered hard.

The original formula, published in 1948, was based upon the 1925 McCall-Crabbs Lessons.


G = 19.4265 - 15.79 * D/W + .0496 * W/T

where,
D = total number of words on the Dale Long List
W = total number of words
T = total number of sentences

In 1950, the McCall-Crabbs Lessons were revised, and Powers, Sumner, and Kearl [1958] recalculated the Dale-Chall formula for the new McCall-Crabbs Lessons. This produced the following formula.


G = 14.8172 - 11.55 * D/W + .0596 * W/T

Again in 1961, the McCall-Crabbs Lessons were revised, and again the Dale-Chall formula was recalculated [Holquist 1968]. This time the formula was,


G = 14.862 - 11.42 * D/W + .0512 * W/T

The relative stability of the coefficients in the Dale-Chall formula lend credence to its validity.

3.4. Fog Formula

The Fog Index [Gunning 1952] considered "hard" words to be those of three syllables or more. It was recalculated based on the 1950 McCall-Crabbs Lessons [Powers, Sumner, and Kearl, 1958].


G = 3.0680 + 9.84 * P/W + .0877 * W/T

where,
P = total number of words with 3 syllables or more
W = total number of words
T = total number of sentences

3.5. Coleman Formula

Research by Coleman [1965] produced the following formula based upon the percent of correct cloze completions. This results in a scale from 0 (hard) to 100 (easy).


R = -37.95 + 116.0 * M/W + 148.0 * T/W

where,
M = total number of one-syllable words
W = total number of words
T = total number of sentences

3.6. Bormuth Formula

Bormuth [1968] investigated 169 different variables for 330 passages using cloze procedures. He developed 24 different formulas with up to 20 variables. However, most of the extra variables added little to the accuracy of the readability index. One of the formulas uses the Dale Long List, producing an index from 0 (hard) to 100 (easy).


R = .886593 - .08364 * L/W + 1.61911 * (D/W)3
  - .021401 * W/T + .000577 * (W/T)2 - .000005 * (W/T)3

where
L = total number of letters
D = total number of words on the Dale Long List
W = total number of words
T = total number of sentences

Irving and Arnold [1979] suggest that the number of words on the Dale Long List (D) can be approximated by,


D = 1.16 * W - .05 * L

3.7. SMOG Index

The SMOG Index was developed by McLaughlin [1969] who argued that word difficulty and sentence difficulty are not independent, so that their product, rather than their (weighted) sum, is a more accurate indication of readability. This resulted, eventually, in the following formula, based on the McCall-Crabbs Lessons, and giving the grade level (1 to 12) for reading:


G' = 3.1291 + 5.7127 * SQRT(P/T)

where
P = total number of words with 3 syllables or more
T = total number of sentences

Unfortunately, this grade level (G') is not directly comparable with the grade level (G) of the Dale-Chall formula since G is the grade level to understand half the material while G' is the grade level to fully understand the material.


Common Readability Formulas and Their Variables

Flesch R = 206.835 - 84.6 * S/W - 1.015 * W/T
Farr-Jenkins-Paterson R = -31.517 + 159.9 * M/W - 1.015 * W/T
Coke-Rothkopf R = 235.87 - 84.44 * V/W - 1.015 * W/T
Coleman R = -37.95 + 116.0 * M/W + 148.0 * T/W
Dale-Chall G = 14.862 - 11.42 * E/W + 0.0512 * W/T
Fog G = 3.068 + 9.84 * P/W + .0877 * W/T
Automated Readability Index G = -21.43 + 4.71 * L/W + 0.50 * W/T
Coleman-Liau G = -15.8 + 5.88 * L/W - 29.59 * W/T
Kincaid G = -15.59 + 11.8 * S/W + 0.39 * W/T


W = total number of words
T = total number of sentences
L = total number of letters
V = total number of vowels
D = total number of words on the Dale Long List
S = total number of syllables
M = total number of one-syllable words
P = total number of words with 3 syllables or more



4. Implementation

Constructing a program to compute a readability index is fairly straightforward. First one (or more) of the readability formulas are selected for computation. The text of the input file is read, accumulating the necessary statistics. Finally the readability formula is used to compute and print an index for the specific input file.

The statistics needed for computing most formulas are easily accumulated in one pass through the document. Of the formulas given above, we see that the following statistics are needed:

  1. W = total number of words

  2. T = total number of sentences

  3. L = total number of letters

  4. V = total number of vowels

  5. D = total number of words on the Dale Long List

  6. S = total number of syllables

  7. M = total number of one-syllable words

  8. P = total number of words with 3 syllables or more
The number of letters, vowels, words (sequences of letters separated by blanks or punctuation), and sentences (sequences of words separated by period, exclamation point or question mark) are easy to compute. The number of words on the Dale Long List can be computed directly by reading in a copy of the list and searching it for each word, or it can be approximated such as suggested by Irving and Arnold [1979].

The most difficult statistics are probably those dealing with the number of syllables per word. These can be exactly determined by a dictionary look-up, or they can be approximated by the approach of Coke and Rothkopf [1970] or by use of techniques for determining where a word may be hyphenated [Rich and Stone, 1965].

One other point in implementation concerns the amount of text over which the index is computed. While we certainly want an index for the entire input file, we probably also want readability measures for each small piece of the file. Thus, we may want to compute a separate index for each section, each page, or each paragraph. This allows an author to quickly scan a document looking for sections which are out of line for the intended audience. These portions may be rewritten to bring them more into line with the author's intentions.

5. Conclusions

A very simple program can be written to compute a readability index for a document. Readability formulas have been developed by reading specialists to allow easy determination of the reading level of a document. With the new ability of computers to store large dictionaries of words, and their properties on-line, we expect that even better readability indexes can be produced and can help to improve the quality of documents produced with the aid of a computer system.

As an example, applying the formulas listed above to this paper results in the following readability indexes:

Readability: 0 (hard) to 100 (easy)
81.7 Coke-Rothkopf
55.2 Farr-Jenkins-Paterson
53.0 Flesch
47.3 Coleman

Grade Level: 1 (easy) to 12 (hard)
6.5 Fog
8.0 Coleman-Liau
9.1 Automated Readability Index
10.6 Dale-Chall
11.0 Kincaid
13.6 SMOG

Acknowledgements: We would like to thank Carol Engelhardt for her assistance in this work.

6. References

  1. John R. Bormuth, "Development of Readability Analyses", Final Report, Project No. 7-0052, Contract No. OEC-3-7-070052-0326, Department of Health, Education and Welfare, (March 1969).

  2. Ester U. Coke and Ernst Z. Rothkopf, "Note on a Simple Algorithm for a Computer-Produced Reading Ease Score", Journal of Applied Psychology, Volume 54, Number 3, (1970), pages 208-210.

  3. Edmund B. Coleman, "On Understanding Prose: Some Determiners of its Complexity", NSF Final Report GB-2604, (1965).

  4. Edgar Dale and Jeanne S. Chall, "A Formula for Predicting Readability", Educational Research Bulletin, Volume 27, (February 1948), pages 11-20, 37-54.

  5. James N. Farr, James J. Jenkins, and Donald G. Paterson, "Simplification of Flesch Reading Ease Formula", Journal of Applied Psychology, Volume 35, Number 5, (October 1951), pages 333-337.

  6. Rudolf F. Flesch, "A New Readability Yardstick", Journal of Applied Psychology, Volume 32, (June 1948), pages 221-233.

  7. Robert Gunning, The Technique of Clear Writing, McGraw-Hill, (1952).

  8. John B. Holquist, "A Determination of Whether the Dale-Chall Readability Formula may be Revised to Evaluate More Validly the Readability of High School Science Materials", Ph.D. Thesis, Colorado State University, (1968).

  9. Steve Irving and Bill Arnold, "Measuring Readability of Text", Personal Computing, (September 1979), pages 34-36.

  10. George R. Klare, The Measurement of Readability, Iowa State University Press, Ames, Iowa, (1963).

  11. George R. Klare, "The Role of Word Frequency in Readability", Elementary English, Volume 45, (January 1968), pages 12-22.

  12. George R. Klare, "Assessing Readability", Reading Research Quarterly, Number 1, (1974-1975), pages 62-102.

  13. H. Kucera and W. N. Francis, Computational Analysis of Present-Day American English, Brown University Press, (1967), 424 pages.

  14. G. Harry McLaughlin, "What Makes Prose Understandable", Ph.D. Thesis, University College, London, (1966).

  15. G. Harry McLaughlin, "SMOG Grading -- a New Readability Formula", Journal of Reading, Volume 12, (May 1969), pages 639-646.

  16. R. D. Powers, W. A. Sumner, and B. E. Kearl, "A Recalculation of Four Readability Formulas", Journal of Educational Psychology, Volume 49, (April 1958), pages 99-105.

  17. R. P. Rich and A. G. Stone, "Method for Hyphenating at the End of a Printed Line", Communications of the ACM, Volume 8, Number 7, (July 1965), pages 444-445.

  18. George K. Zipf, The Psycho-Biology of Language, Houghton Mifflin Co., Boston, (1935).