Home Up Questions?




Design and Implementation for a PASCAL Compiler on Nova 3/D Minicomputer




Tzong-Yu Paul Lee




Department of Computer Sciences
The University of Texas at Austin
Austin, Texas 78712




April 28, 1978


1. Introduction

The design and implementation work for a full-scale PASCAL compiler on Nova 3/D minicomputer began in January 1978. This report summarizes the efforts as of April 28, 1978. Data structure and implementation notes are discussed in details while the future design and implementation work are roughly sketched. The language implemented is formally presented in section Three. The differences between this implementation and the standard PASCAL are outlined and justified.

2. Design Considerations

A one-pass compiler in general requires significant amount of memory to hold the compiler code and the relevant data structure such as string table and symbol table in core. For most mini- and microcomputer system, this is not practical and may not even be feasible. A one-pass compiler also tends to produce rather poor code which is not acceptable to most mini- and microcomputer applications. For instances, real-time control applications and system programming language. A minimum of two-pass compiler is proposed and designed which has the following advantages:

  1. less amount of memory is required for each pass

  2. modularization and maintenance of the compiler code is easier

  3. extra passes can be added to further optimize the codes

  4. interpreter for the intermediate language is feasible

Note that the first pass of the compiler does not even include memory allocation which makes itself completely target machine independent. The symbol table will be passed to the second pass in a condensed form. The compiler could be constructed to produce code on a procedure by procedure basis. However, this requires some kind of self-overlaying structure which might not be feasible. Therefore, the variables and labels for all procedures are mapped and passed to the second pass.

The second pass will generate assembly language code which makes the code generation task easier to handle. If MACRO assembly language is supported by the target machine, it should even further reduce the amount of efforts. Some kind of machine independent optimizations is expected in the second pass.

3. Language Definition

3.1. <identifier>


			letter
	letter
			digit

3.2. <unsigned integer>


		digit

3.3. <constant>


		+		constant identifier

		-		unsigned integer

		'	character      '

3.4. <unsigned constant>


		unsigned integer

		     NIL


		constant identifier

	   '	    character       '


3.5. <variable>


	    variable identifier			    ,

	    field identifier		[      expression     ]

					.     field identifier




3.6. <simple type>


		type identifier

	    (       identifier	       )

			,

		constant    ..     constant

3.7. <type>


	   	    type identifier

		simple type

	  packed	file    OF	simple type

			RECORD	    field list	    END

			ARRAY	 [    simple type   ]   OF   type

					   ,


3.8. <field list>


			;

		,

	   identifier	   :	 type


	  CASE 	  identifier    :    type identifier     OF


		  ,					;

	    constant    :     (    field list      )

3.9. <formal parm list>


		VAR              ,

	(		  identifier	 :    type	     )

		PROCEDURE   identifier     formal parm list

		FUNCTION    identifier     formal parm list

				:     type

				     ;


3.10. <actual parm list>


			,

	   (	   expression		)

		procedure identifier

		function identifier

3.11. <factor>


		variable

		unsigned constant

		function identifier       actual parm list

		(	expression	 )

		NOT	factor

3.12. <term>


			factor

			  *

			  /

			  DIV

			  MOD

			  AND

3.13. <simple expr>


	    +

			  term

	    -		    +

			    -

			    OR

3.14. <expression>


		simple expr	>

				<

				>=

				<=

				<>

				=		simple expr

3.15. <assign stmt>


		variable

				        :=	expression

		function identifier

3.16. <if stmt>


	   IF	   expression	 THEN	  stmt

				ELSE	stmt

3.17. <case stmt>



	   CASE	    expression    OF


		constant    :    stmt		END

		    ,

			   ;

3.18. <for stmt>


		FOR	variable identifier	:=    expression

			DOWNTO

			TO	  expression    DO   stmt

3.19. <while stmt>


	   WHILE	expression    DO    stmt

3.20. <repeat stmt>


	   REPEAT      stmt	  UNTIL	     expression

			 ;

3.21. <with stmt>


	   WITH	      record variable	 DO	stmt

			     ,

3.22. <compound stmt>


	     BEGIN	stmt	     END

			  ;

3.23. <call stmt>


	     procedure identifier	actual parm list

3.24. <goto stmt>


	     GOTO	unsigned integer

3.25. <stmt>


	      unsigned integer	   :

					assign stmt

					if stmt

					case stmt

					for stmt

					while stmt

					repeat stmt

					with stmt

					compound stmt

					call stmt

					goto stmt


3.26. <label part>


	   LABEL	unsigned integer        ;

				,


3.27. <constant part>


	   CONST	identifier	=     expression     ;


3.28. <type part>


	   TYPE		identifier	=	type      ;


3.29. <variable part>


	   VAR		identifier	:    type     ;

			     ,




3.30. <procedure part>


	   PROCEDURE	 identifier     formal parm list

	   FUNCTION	 identifier	formal parm list

					type   :

			EXTERNAL

		;	block	       ;

			FORWARD


3.31. <block>


	   label part	     constant part	type part

		variable part	    procedure part	stmt

3.32. <program>


		block		.

The differences between this implementation and standard PASCAL is listed and discussed briefly :

  1. REAL type and relevant operations are not supported at this stage.

  2. SET type and relevant operations are not supported, in particular, for limited bit manipulation machine.

  3. FORWARD and EXTERNAL procedure/function are supported in this implementation to facilitate compilation process and environmental interfacing.

  4. CONST declaration allows compile-time evaluatable expression in this implementation.

  5. PROGRAM heading is not required and not implemented in this compiler.

  6. BLOCK statement specification does not require Compound Statement, instead, Statement is implemented.

  7. Formal Parameter List requires explicit parameters specifications if it is formal procedure/function name. On the other hand, actual parameter list does not allow actual procedure/function indentifier to include parameters.

For convenience, the reserve words are listed as shown below :

	AND ARRAY BEGIN CASE CONST DIV DO DOWNTO
	ELSE END FILE FOR FORWARD FUNCTION GOTO
	IF LABEL MOD NIL NOT OF OR PACKED PROCEDURE
	RECORD REPEAT THEN TO TYPE UNTIL VAR WHILE
	WITH EXTERNAL

Also, the standard identifiers are listed as shown below :

  1. Constants : TRUE, FALSE, MAXINT.

  2. TYPES : INTEGER, BOOLEAN, CHAR, TEXT.

  3. PROGRAM PARAMETERS : INPUT, OUTPUT.

  4. FUNCTIONS : ABS, CHR, EOF, EOLN, ODD, ORD, PRED, SUCC, SQR.

  5. PROCEDURES : READ, READLN, WRITE, WRITELN, GET, NEW, PACK, PAGE, PUT, RESET, UNPACK.

4. Scanner

The major data structure for the scanner is the string table(STRTABLE). There is no practical limit on the length of an identifier. Complete character string will be stored in the string table and the length and a pointer will be returned to the paser.

The scanner returns four global parameters each time being called:

  1. TOKENLENGTH : the length of the token

  2. TOKENTYPE : reserved word, identifier, number, character string ,special character and EOF (end of file)

  3. TOKENVALUE : numerical value for number

  4. TOKENPTR : pointer to the string table for reserve word or identifier

Several other variables such as line number(LINENUMBER) and column pointers(LINEPTR and LASTLINEPTR) are maintained for listing purposes. Reserve words are organized by hashing technique. Collisions are resolved by chaining them together and ended by a pointer which points to the last element itself. Several issues in the scanning phase are treated in the following ways:

  1. 154SYMBOL : warning message will be given for this technical violation. Two tokens will be returned,i.e., 154 and SYMBOL.

  2. (*) : invalid comment.

  3. iden : integer; is the same as IDENT : INTEGER;

  4. CON(*TEST*)XYZ : two tokens, CON and XYZ, will be returned.

  5. '''' and 'don''t' will be recognized as ' and don't.

  6. '' will cause warning message

5. Parser

Recursive descent parsing algorithm is employed in this compiler. The 'IF..THEN IF..THEN..ELSE...' ambiguity is resolved in such a way that the ELSE statement is always associated with the closest IF.

In a few cases, continuation of the parsing algorithm requires some kind of semantic information:

  1. ID.FIRST := 12; in contrast with IDFUNC := 12;

  2. TYPE NEW = OLD; in contrast with TYPE NEW = OLD..100;

  3. RECORD CASE BOOLEAN OF ... END; in contrast with RECORD BOOL : BOOLEAN OF ... END;
However, experiences showed that one-token lookahead approach would be nicer in terms of less dependency with other modules.

6. Semantic Process I (Symbol Table Setup)

Symbol table is maintained in a stack(SYMTABLE). Therefore, symbol table entries can be reused due to the block structure nature of PASCAL language. Global(level 1) and intrinsic(level 0) declarations are the major memory-consuming factors. This concept also applies to the string table. However, a static display(DISPLAY) is necessary to maintain information for the nested blocks. Each symbol table entry will be associated with a unique number and will be mapped out along with the intermediate code segments. This mapping facilitates the symbol table accessing in the second pass since there is no search required.

Each symbol table entry has four common parts:

  1. LEVEL : block nesting information

  2. LENGTH : token length

  3. STRINGPTR : pointer to the string table

  4. NEXTENTRY : pointer for the search structure
If the entry has no name, the LENGTH and the STRINGPTR are both zero. Depending on different kinds of identifiers, the symbol table entry contains the following information:
  1. LABEL : value of the label

  2. CONSTANT : constant value and constant type

  3. PARAMETER : next parameter pointer, parameter type, pass by value or by reference

  4. FIELD IDENTIFIER : next field element pointer and field type

  5. TAG : pointer to the first case and type pointer

  6. CASE VARIANT : corresponding variant constant, next case pointer, pointer to the first field element and pointer to the tag

  7. TYPE IDENTIFIER : further classification is shown as follows:
    1. SCALAR : largest constant value(with smallest value equal to 0)

    2. SUBRANGE : maximum and minimum values, subrange type

    3. POINTER :type pointer

    4. PROCEDURE : pointer to the first parameter, procedure status(forward, external or in-line)

    5. FUNCTION : pointer to the first parameter, function status , function type

    6. ARRAY : pack information, subrange pointer, array element type pointer

    7. RECORD : pack information, pointer to the first field element, and pointer to the tag(variant)

    8. FILE : pack information, pointer to the type of file

Hashing technique is used for symbol table search. An array of bucket pointers is maintained for the symbol table search purpose. Not all the symbol table entries are linked into the search structure. Field identifiers are not linked into the search structure. However, separate display information for WITH statement processing and RECORD type symbol table construction is necessary. Unnamed symbol table entry will not be linked into the search structure either. For example :

	BINX : RECORD
		PARTA, PARTB : INTEGER;
	       END;
The variable identifier BINX will point to a record entry which does not have a name.

Type declaration parts allow forward reference except for the type of tag identifier which is necessary for VARIANT processing. For examples :

	LINK = PTR;
	LINKX = PTR;
	NEW = OLD;
	PTR = RECORD X,Y : OLD END;
	OLD = 0..100;

Label value is stored in the symbol table entry and the entry is linked for search purpose. The hashing value for the unnamed label is the value module bucket size. These external labels will be combined with the internal program control labels and mapped out along with intermediate code segments.

Multi-dimensional array is conceptually broken down into nested one-dimensional arrays for simplicity. For example:

	DATA = ARRAY [1..10, RED..BLUE] of BOOLEAN;
is the same as
	DATA = ARRAY [1..10] of ARRAY [RED..BLUE] of BOOLEAN;

Procedure and function declaration is conceptually treated as a procedure/function identifier pointing to a TYPE which contains the parameter information. In other words, procedure and function are types. The scope of procedure/function declaration is better illustrated by the following example :

	...
		(* level n *)
	P1(3, TRUE, FUNCID);
	...
Procedure P1(X1 : integer; Y1 : BOOLEAN; FUNCTION Z1(A1 :
	     INTEGER; A1 : BOOLEAN) : INTEGER);
		(* level n+1 *)
BEGIN
	IF Y1
	   THEN ...
	   ELSE IF Z1(10,TRUE) <> 0
		   THEN ...
END;
	...
		(* level n *)
	...

The following aspects need particular attention :

  1. P1 belongs to level n.

  2. X1, Y1, Z1 belongs to level n+1.

  3. X1, Y1, Z1 will be unlinked from the search structure when exits from level n+1, but the symbol table entries will be retained for parameter check.

  4. two A1's are acceptable since they only serve as dummy names.

7. Semantic Processing II ( intermediate Code Generation)

The second phase of semantic processing will involve at least the following tasks :

  1. compatibility test

  2. compile-time expression evaluation and/or result prediction

  3. reverse polish form will be returned as the result of expression parsing

  4. control structure breaks down to GOTO-like form

  5. variable and label mapping and generate output file

  6. generate file of intermediate code

8. Error Handling and Others

There is no easy and clean way to handle errors. However, for practical reasons, this is so essential a feature for the users to get as much information as possible in each compilation process.

Warning messages and Error messages are implemented at this stage. In general, only warning message will be given as long as the compilation process could go on. Error handling routine is treated in its most primitive way. That is, proper flag will be set and error number will be given. No recovery is attempted but proper termination is achieved.

There are two useful approaches to deal with errors in the future :

  1. Forcing expression to terminate by setting flags and skipping to the end of the statement

  2. Looking for a set of relevant key words and then keep processing ,particularly useful in control structure
Extra error messages can be suppressed by proper falg communication.

There are two more relevant issues:

  1. The trace facility should not be taken out. They could be deactivated by enclosing them as comments.

  2. All the initialization routines can be later on manually coded if so desired.

9. Acknowledgement

The experiences gained from this project are rewarding. Continuing efforts along this direction are expected.

My special thanks go to Dr. James L. Peterson for his helpful discussion and great patience on many unexpected interruptions. Without such support this project could not proceed.
Home   Comments?