A Compiler for a Pascal-like Language

A COMPILER FOR A PASCAL-LIKE LANGUAGE

James L. Peterson

Systems Programming Laboratory Note 1
SPL/1

Department of Computer Sciences
The University of Texas at Austin
Austin, Texas 78712

August, 1977

ABSTRACT

The development of major software systems for the Nova computer system can benefit greatly from the existence of a systems programming language. The development of such a language, and its supporting compiler is currently underway. This note reports on the language definition and the mechanics of the compiler.

1. The Language

The basis for the Nova language is the programming language Pascal. The Nova language is basically a subset of Pascal which is necessary for systems programming and can be supported by the Nova computers. The language design was not influenced by the target machine, but is aimed at producing a compiler for a reasonable language for a small computer.

The language is heavily typed, with predefined types of INTEGER, BOOLEAN, and CHAR. Users may define subranges, arrays, and records.

Control structures include compound statements (BEGIN/END), IF-THEN-ELSE, WHILE-DO, REPEAT-UNTIL, procedure calls, and assignment statements. Expressions include the standard arithmetic and logical statements.

The basic program structure is a sequence of five parts:

constant declarations,
type definitions,
variable declarations,
procedure and function definitions, and
the main program.

The language syntax is defined in Figures 1 - 27 by syntax diagrams. All symbols must be properly defined and of appropriate types when used. Symbols are strings of non-special characters which cannot be mistaken for an integer number. Constants are integer numbers and quoted characters.

Notice that the language does not include:

floating point numbers,
user-defined scalar types,
pointers,
FOR statements,
CASE statements, or
Input/Output.

2. The Compiler

The objective in writing this compiler is to produce a working compiler quickly which generates reasonable code. The compiler is written in (a subset of) Pascal and is to be executed on the DEC-10 until is is capable of compiling itself, at which time it will be moved to the Nova system.

The compiler is written as a recursive, top-down parser, based upon the syntax diagrams of Figures 1 - 27. Each syntax diagram corresponds to one procedure in the compiler. This procedure processes the syntactic entity with which it is associated. For example, a statement-list is parsed by a call to the statement procedure. Then as long as the next token is a semicolon, the semicolon is skipped, and another statement is parsed by calling the statement procedure again. Similar procedures have been written for each of the syntax diagrams.

2.1. Parsing and Lexical Analysis

The syntax procedures are supported by another set of routines which perform lexical analysis, reducing the input stream to a sequence of tokens. The tokens correspond to the nodes of the syntax diagram. Tokens are defined for all special characters and reserved words, leaving only symbols and constants.

2.2. The Symbol Table

Constants are treated as unnamed, self-defining symbols, so we concentrate on symbols. A symbol is a symbolic representation of a program entity with several attributes: a name, a kind (variable, constant, type, procedure, function, parameter, field of a record), a value (addresses for variables, procedures and functions, value for constants), and possibly a type (for variables, constants and functions). This information is stored in the symbol table. Every entry of the symbol table has 6 fields whose interpretation varies according to the kind of symbol. Figure 28 defines the interpretations of the symbol table entries for the different kinds of symbols.

The lexical analysis routines are responsible for the maintenance of the symbol table. Every symbol is represented by a symbol table entry. When a symbol is encountered by the lexical analysis procedure, a symbol table entry is immediately defined describing this symbol as a variable of undefined kind, type, and value. Constants are treated as unnamed symbols with defined values and types. Symbol table maintenance routines allow the syntax and semantic routines to defined symbol table entries, search the symbol table, link entries into the search structure, and get and free symbol table entry blocks. Typically, a symbol is either being defined, in which case it will be entered into the symbol table as a new symbol (by NEWSYMBOL), or it is being used in parsing, in which case the symbol table will be searched for the new symbol.

The symbol table is searched by a hash table structure. Each entry of the hash table points to a chain of symbols, linked in a stack. Searching the symbol table requires computing the hash function and searching only those symbols with the same hash value. The block structure naming conventions are automatically enforced by the stack nature of the hash chain.

Not all symbol table entries will be linked into the search structure -- only named symbols. For example, each symbol has a pointer to a symbol table entry defining the symbol's type. For named types, the type entry will be linked into the search structure and can be referred to by name. However implicitly defined types (such as in x: array [1..10] of integer where the index type 1..10 and the "array [index-type] of integer" type are both nameless) are not linked into the search structure, but do have symbol table entries.

Symbol names are a sequence of characters and a length. Only the first m characters of each symbol are stored. However, the actual length of the symbol is stored. Two symbols are equal only if they have the same first m characters and are of the same length.

Pascal is heavily typed and the Nova language also has this characteristic. Types are represented by symbol table entries. Types are either: Boolean, character, subrange, array, or record. Operations between variables are allowed only when the types of the variables are compatible. Two types are compatible if they are (1) identical, (2) subranges with overlapping ranges, (3) arrays of the same index range and element type, or (4) records with the same number, type and order of fields.

2.3. Expressions

Much of the actual work of the compiler involves manipulating symbols and expressions. Symbols are adequately handled by the symbol table. Expressions similarly have an expression table. An expression is described by the following fields. As expressions are parsed, code may be generated to effect the desired operations.

WHAT: A value or an address expression?
WHERE: Is this expression in a register (inreg), on the stack (onstack), or a constant (here) whose value is in another field?
VALUE: The value of a constant (if where=here), or the number of the register with the value (if where=inreg).
TYPEINDEX: a pointer to the type of this expression.

2.4. Code Generation

Code generation is handled by a collection of code generation routines. The most important routines include GENLOAD, which loads an expression into a register, and GENOP which generates the code to perform an operation between two expressions and construct a new expression (of the form expr1 := expr1 op expr2). Code generation will be described in a later note.

2.5. Error handling

Error handling is a big weakness in this compiler. Errors which are encountered are listed in Figure 29. Most important errors are detected, but recovery is poor, at best.
Home Comments?