Research Plans in Text Manipulation

Communication occurs in many forms, but especially in the form of text. As computers and computer networks become more common, more and more communication is supported by computers. Computers are replacing typewriters for creating documents, the postal system for mailing them, and filing cabinets for storing them.

The focus of my research is text manipulation algorithms. This research can be further divided into 6 areas:

  1. Input. Keyboard design, scanners, and eventually, speech.
  2. Storage. File systems, data bases, character codes, compression.
  3. Editing. Character, line, and screen editors, design, implementation, user interface.
  4. Formatting. Design, implementation, user interface, hyphenation algorithms, page make-up algorithms.
  5. Printing and Display. Printer design, fonts, data streams.
  6. Analysis. Spelling checking, syntax checking, content analysis.
Various different aspects of these areas are of interest to me. Let me discuss some of the current areas of interest and specific research/development projects which I wish to consider.

Spelling

The most common and practical form of Analysis is a spelling checking and correcting program. Although Unix has a spelling program (spell), it is a simple batch program. I want to write a full-screen interactive spelling program. Such a program should be able to run both with terminals described in the TERMCAP database and under the ITC window manager for base-editor documents.

Spelling checking is generally a simple look-up in a word list. This makes it useful to collect various word lists. I currently have about 15 word lists, allowing studies of common words. Spelling correction, on the other hand, can be improved by the availability of more information about the words. For example, the most common method of correcting a spelling error is to assume that the error is a typing error, and attempt to invert the four most common typing errors.

An alternative correction technique is to look for alternative spellings for possible pronunciations of a word. For example, a word with a 'C' might be pronouned as 'K' or 'S'. Alternate spellings for the 'K' and 'S' sounds are 'C', 'K' and 'S', 'C'. Thus, it might be reasonable to consider replacing the 'C' with either a 'K' or an 'S'. The table of possible replacements, based upon pronunciations, can be derived from a list that I have obtained from Martin Kay at Xerox. This list was created for the American Heritage Dictionary, however. I have a list of words and their pronunciation encodings for Webster's 7th Collegiate. The list needs to be revised and used to correlate the written and spoken forms of the words in W7, which unfortunately are quite abbreviated and must be completed for useful work.

Formatting

Text formating is currently driven by the three major formatters: troff, Tex, and Scribe. Just as with programming languages, substantial improvements in the formatting problem can be obtained by the use of abstract formatting specifications. This is best shown in Scribe, where one defines a part of the document to be a "heading", but does not indicate, in the document, what formatting is appropriate for a heading. A separate implementation file specifies how a heading is to be implemented in various environments.

The failure of troff and Tex is that they fail to adequatedly enforce this separation of declaration and implementation. The failure of Scribe is that it does not allow sufficient computational power to be used in the implementation of the abstract formating items. I propose to marry these two approaches: presenting a Scribe-like user interface with a troff-like back-end for implementation. In addition, troff will be re-written to properly separate the actual formatting algorithms. This will allow better algorithms, such as Knuth's line-breaking algorithm and variant hyphenation algorithms to used with troff.

Hyphenation

The complete pronunciation list would also allow me to compare the syllabification of the pronunciation with the printed hyphenation of the word. The best source I can find on hyphenation indicates that American hyphenation is based on pronunciation. I would like to test this, as well as numerically compare the various hyphenation algorithms for accuracy. Frank Liang and Don Knuth, at Stanford, have been working on the hyphenation problem, but have not compared the various different hyphenation algorithms, but have only worked with their own. It is also interesting to consider means, such as Knuth's line breaking algorithm, to decrease the need for hyphenation altogether. I would like to run a large sample of text both with and without hyphenation for varying line lengths to determine the cost (in space for the final printed version) of hyphenation. I suspect that for reasonable line lengths (6 inches or more), hyphenation is of minimal impact, and can be simply done without.

A Data-Base of Lexical Properties

Spelling correction can also make use of other properties of words, such as their frequency. In general, algorithms involving words, such as a hyphenation study, will be more accurate if they consider the relative frequency of the words involved. Frequency of words has been the subject of at least 3 studies and the results seem to be rather unstable. I suspect that this is the result of the small (less than 10 million) sample size; a much larger sample study (more than 100 million words) is probably necessary for meaningful results.

Most additional algorithms of interest, such as syntactic or grammatical error detectors and correctors, depend upon additional properties of words. Some of these properties can be determined from available word-lists-with-properties, such as frequency lists or hyphenation lists, or part-of-speech lists. My copy of Webster's Seventh provides many such properties. However, a more complete set or properties, from various sources, would be preferable. The problem is obtaining these materials in a computer readable form.

Scanning

Keying in large reference documents, such as additional dictionaries and encyclopedias would seem unreasonable. The preferred approach would be to scan the documents directly from a printed image to a file. The Kurzweil Reading Machine is an example of the type of system which would be useful for text scanning. However, it is a closed system which can not be investigated, modified, or used in other situations. Thus, I propose to develop algorithms to allow the automatic scanning of text into a computer system from a printed form. I will need a scanner to input the page images. Algorithms will then identify the individual letters and their properties (fonts, facecodes, position), A certain amount of error may be introduced by this process. Higher-level algorithms (such as spelling checking and correcting) will be employed to further reduce this error.

I believe that properly developed algorithms and software, combined with the developing set of inexpensive scanners based on copier technology, will allow scanning and scanners to be an important part of modern computer systems, just as full-page high-resolution printers and screens are becoming a standard part of modern systems.

James L. Peterson
20 April 1985