SYSTEMS PROGRAMS

The loaders and assemblers of the last two chapters are only a few of the many systems programs which are used on most computers. Like most systems programs, their main purpose is to make the writing and running of other programs easier. However, they are far from being the best programming tools. As more and more programs are written, the usefulness of other system programs becomes evident.

Historically, as the desire for more sophisticated systems programs grew, so too did the capability of the computer systems available to support these programs. These larger computers are more capable of supporting the larger and more complex systems programs and their data structures. Today almost all large computers have at least one system program of each of the types described in this chapter, and many smaller computers do also. However, many of the smaller computer systems may not have sufficient memory, primary or secondary, to support some of these programs. This is generally the case for MIX machines. Since most programmers use many different machines in their careers, it is probable though that these systems will be available to you at some point in your programming.

9.1 MACRO ASSEMBLERS

When writing a large assembly language program, like the assembler of Chapter 8, it is not uncommon to encounter sections of repetitive coding. Consider, as an example, the error subroutines. Six small subroutines (MULDEF, BADSYM, UNOP, ILLFOR, EXPOV, ILLSYN) were all four statements long, of the form

STJ *+3 JMP ERROR ALF x JMP *
The letter x in the ALF statement varied according to the kind of error. Other instances of short pieces of repetitive code can be found in the assembler. It would ease the programming task if we could simply say that the section of code listed above is a schema, or form, called ERR. When we write ERR U, it would be the same as writing
STJ *+3 JMP ERROR ALF U SUBSTITUTE THE U FOR x JMP *
This sort of a feature would allow us to write programs faster.

The standard means of eliminating repetitive code is to use subroutines. But often, as in the example above, the repetitive code is too short for subroutines to be effective in reducing the amount of code to be written. Remember also that calling a subroutine with p parameters will require at least p + 1 instructions, so if the code segment is short, a subroutine call may take more code than the code itself. In many cases, in fact, the repetitive code is itself a subroutine call, the repetition being caused by the use of a standard calling sequence.

The execution time for using a subroutine call to replace repetitive code must also be considered. From our simple analysis of the cost of using a subroutine (Section 6.4), it was obvious that although a subroutine can sometimes save space, it always takes more time to execute a subroutine than to just write the code out. In some cases this additional time is of crucial importance, as when the code will be executed millions of times.

Thus, what is needed is a means of reducing the repetitive writing of similar sections of code without introducing the cost of calling a subroutine. A macro assembler provides this facility. A macro is a named sequence of instructions. The macro assembler generates the sequence of instructions whenever it encounters the macro name. A macro name is thus an abbreviation for the sequence of instructions. (The name "macro" comes from thinking of the sequence of instructions as a macro-instruction.)

Macros are sometimes called in-line or open subroutines to distinguish them from the standard closed subroutine concepts of Chapter 6. When a subroutine is used, the code for the subroutine is located out of the way of the main line of code. Hence the subroutine is out of line. Since a subroutine is only to be entered at its entry point and its internal structure is of no concern to the rest of the program, a subroutine is closed, like a black box. A macro, on the other hand, places its code right where it is called, in-line, and it is part of the routine in which it occurs. Thus, its internal structure is part of that routine and open to that routine.

Macro usage

As with a subroutine, macro usage consists of two things: a macro definition and a macro call. The macro definition defines the name of the macro and the sequence of statements for which that name stands. The macro call causes macro expansion to occur, with the sequence of instructions for which the macro name stands replacing the macro name in the program. Also, as with subroutines, macros may have parameters. Actual parameters, given in each call, replace the formal parameters used in the macro definition during macro expansion.

MIXAL does not provide for macros. To illustrate how macros typically are used, we extend MIXAL to MACRO-MIXAL, a macro assembly language for the MIX computers, based on MIXAL. MACRO-MIXAL is upwards compatible with MIXAL; that is, any MIXAL program is also a legal MACRO-MIXAL program. Thus, all of the features and facilities of MIXAL are included in MACRO-MIXAL, and the syntax is almost the same.

The extensions to MIXAL in MACRO-MIXAL are to allow the definition and use of macros. These require the introduction of two new pseudo-instructions, MACR and ENDM. The MACR pseudo-instruction indicates the start of a macro definition, and the ENDM pseudo-instruction indicates the end of the macro definition.

A macro definition consists of three parts: a macro header, a macro body, and a macro trailer. The macro header is one assembly language instruction whose opcode field is the MACR pseudo-instruction. The label field of the macro header has the name of the macro and the operand field has a (possibly empty) list of formal parameters for the macro, separated by commas.

name MACR p1,p2,…,pn <macro body> ENDM
The macro body is composed of those assembly language statements which are to be assembled whenever the macro is called. The formal parameters may be used in the macro body wherever a symbol may appear. This includes the label field, the opcode field, or the operand field. The macro trailer is simply the ENDM statement.

Once a macro has been defined, it may be called by simply writing its name in the opcode field of an assembly language statement. Actual parameters are specified by listing them in the operand field, separated by commas. Any label is processed by entering it in the symbol table with a value equal to the value of the location counter at the time that the macro call was encountered.

As a simple example, assume that the value of the variable C must be incremented at several places in a program. The macro definition for this function might be

UPC MACR LDA C INCA 1 STA C ENDM
After this definition, the three lines of the macro body can be generated simply by writing UPC as an opcode. This macro has no parameters. A more general macro, which could be used for the same purpose, could make the name of the variable to be incremented a parameter.
UP MACR N LDA N INCA 1 STA N ENDM
A call of this macro might look like
UP C
or
LOOP UP COLUMN
This latter call is equivalent to writing the lines
LOOP LDA COLUMN INCA 1 STA COLUMN

Macros can be used to introduce new instructions. For example, the BSS pseudo-instruction can be used with a macro assembler which provides only the ORIG pseudo-instruction but does not provide the BSS, by defining the macro

BSS MACR N ORIG N+* ENDM

Macro calls can be nested; that is, a macro may call another macro. A macro can be defined to save all registers upon entry to a subroutine by

ENTRY MACR NAME BSS 8 SPACE FOR REGISTERS NAME STA *-8 STX *-8 ST1 *-8 ST2 *-8 ST3 *-8 ST4 *-8 ST5 *-8 ST6 *-8 ENDM
When this macro is called, it immediately calls another macro, the BSS macro defined above. After the BSS macro is expanded, expansion of the ENTRY macro continues.

In addition to the introduction of two new pseudo-instructions, macros require one other change to the MIX assembly language. Notice that the name of a macro is defined by placing it in the label field. Hence it is possible to define a macro with a name of up to 10 characters. A macro call, on the other hand, requires putting the name of the macro in the opcode field of the assembly language instruction. Since the opcode field is only four characters wide, problems may arise. These problems may be solved in either of two ways. One method is to restrict macro names to four characters or less. This can make it difficult to use meaningful macro names, and so the alternative solution is more common: change to a free-format input. Most macro assemblers are free-format assemblers. This allows the length of macro names to be longer than the typically short opcode mnemonics. Assembly language statements are still composed of four fields: a label field, an opcode field, an operand field, and a comment field. A label, if present, begins in column 1 and continues to the first blank. Fields are separated by one or more blanks.

Parameter passing for macros is by name. The character string which defines the actual parameter is substituted for each occurrence of the formal parameter. The actual parameter in the macro call is not evaluated in any way until the expanded macro is assembled. This allows parameters to be used which would appear ill-formed if they were evaluated before substitution, as in the following example.

WEIRD MACR P1,P2 P1 P2) ENDM
 
With this definition, the macro call
WEIRD MOVE,ARR(7
will yield the expansion
MOVE ARR(7)
which is perfectly correct, although neither the sequence ARR(7 nor the sequence P2) is a correct syntactic entity by itself.

Occasionally it is convenient to include commas or blanks within parameters. Extra measures must be taken if this is the case since commas normally separate parameters and blanks terminate a line. In these cases, the parameter is enclosed in parenthesis. When the parameter substitution is done, the outermost parentheses are removed before any substitution occurs. For example, consider the macro, XCH, defined by

XCH MACR P1,P2 LDA P1 LDX P2 STA P2 STX P1 ENDM
If we wish to use this macro to generate the assembly lines
LDA X,1 LDX Y,2 STA Y,2 STX X,1
we must call the macro as
XCH (X,1),(Y,2)
If we were to write XCH X,1,Y,2, we would appear to be calling a macro of two parameters with four actual parameters.

The inclusion of blanks in a parameter is illustrated by the following macro, which performs an operation on each of the elements of the 100-element array X.

ALLX MACR OP ENT1 99 2H LDA X,1 OP STA X,1 DEC1 1 J1NN 2B ENDM
To add one to all the elements of the X array, we simply write
ALLX (INCA 1)
which expands to
ENT1 99 2H LDA X,1 INCA 1 STA X,1 DEC1 1 J1NN 2B

To perform more complicated operations, we can define a macro and pass the name of that macro to ALLX as a parameter. To insure that all X are within the range from LOW to HIGH, we could define the macro

TSTR MACR CMPA HIGH JLE *+2 LDA HIGH CMPA LOW JGE *+2 LDA LOW ENDM
and call the ALLX macro by
ALLX TSTR
This expands to
ENT1 99 2H LDA X,1 TSTR STA X,1 DEC1 1 J1NN 2B
This in turn expands to
ENT1 99 2H LDA X,1 CMPA HIGH JLE *+2 LDA HIGH CMPA LOW JGE *+2 LDA LOW STA X,1 DEC1 1 J1NN 2B

Some care must be taken with nested macros and local symbols. Consider what would have happened to ALLX TSTR, if TSTR had been defined as

TSTR MACR CMPA HIGH JLE 2F LDA HIGH 2H CMPA LOW JLE 2F LDA LOW 2H EQU * ENDM
When the ALLX macro and the TSTR macro are both expanded, the 2B in ALLX does not refer to the 2H in the ALLX macro, but rather to the second 2H in the expanded TSTR macro.

Macros are also very useful in defining data structures. The opcode table in the assembler, for example, had most of the entries of the form

ALF op op

Thus, the number of lines needed to define the opcode table could be reduced by half, with the definition of a macro such as

MACHOP MACR OP ALF OP OP ENDM

Macro implementation

How are macros implemented? What changes to the assembler are necessary to allow macro processing? Several changes are needed, including both the modification of some existing code and the introduction of some new data structures.

The major new data structure is a macro table. The macro table contains the symbolic name of each defined macro and additional information to allow the body of the macro to be found. This can be done in several ways. One of the most straightforward is to have each entry of the macro table include the macro body. Thus, each macro table entry consists of the name of the macro, followed by the macro body. Macro bodies may differ in size, and so each entry must also include length information. This allows the macro table to be searched for a given macro name by using the length information to skip over macro bodies, looking only at macro names.

The variable length of macro table entries in the above scheme causes problems in searching the macro table. (The macro table must be linearly searched.) To allow more efficient search techniques, fixed-size macro table entries are desired. To this end, an alternative approach utilizes two data structures: a macro name table, and a macro body table. The macro name table contains entries which specify the macro name, its length, number of parameters, and so on, and a pointer to the beginning of the macro body in the macro body table. The macro body table is a large array in which macro bodies are stored as they are defined. This approach separates the fixed-length macro information (in the macro name table) from the variable-length macro information (in the macro body table).

Parameters can similarly be handled in several ways. The formal parameter names can be stored in the macro body table as the first N symbols of each macro body table entry, before the macro body itself. The macro name table entry would include N, the number of parameters. During expansion of the macro, every symbol in the macro would be compared against the list of formal parameters. If one is found, the actual parameter would be used instead. This approach minimizes the amount of work at macro definition time, but increases the work done at expansion time.

Another approach to parameters increases the work at definition time, in exchange for more efficient macro expansion. In this approach, each parameter is assigned a number, according to its position in the list of parameters in the macro header. When the macro body is being stored, during macro definition, the macro is scanned for formal parameters. If a formal parameter is found, it is replaced by a special character which is not otherwise allowed in the assembly language, followed by the number, of the parameter. In MACRO-MIXAL, for example, notice that the character "$" is not used for any special purpose. If we simply declare that the $ is an illegal character (and check all input to be sure that a $ does not occur), then we can replace all occurrences of the first parameter by $A (the character code for A is 1), all occurrences of the second parameter by $B, etc.

As the macro is expanded, the assembler need only check each character in the macro to see if it is a $. When a $ is found, the next character can be used to index into the list of actual parameters for substitution. Using this convention, the macro definition

CALL2 MACR SUB,PARM1,PARM2 JMP SUB NOP PARM1 NOP PARM2 ENDM
would be stored internally as
JMP $A NOP $B NOP $C

Notice that different methods of handling the substitution of actual parameters for formal parameters may result in subtly different results, in much the same way that the different subroutine calling conventions (call by value, call by reference, call by name) could give different results in some cases. Some of these differences may show up at the programmer level, such as not being allowed to use the character $, or not being allowed over 64 parameters (the maximum number which can be held in the one byte following a $). Other differences are more subtle. Consider the two macros XPL1 and XPL2, defined by

XPL1 MACR A,B,C A *+1(0:2) ENT1 * XPL2 C STA B ENDM XPL2 MACR N LDA N INC1 C ENDM
If the search for parameters is done at execution time, then the expansion of the macro call
XPL1 STA,X,Y
would result in substituting the actual parameters STA, X, Y for the formal parameters A, B, C. The expansion of the inner macro call to XPL2 would result in the C of the definition of XPL2 being recognized as a formal parameter of the macro XPL1. The assembly code generated by the macro expansion would thus be
STA *+1(0:2) ENT1 * LDA Y INC1 Y STA X

If the parameter substitution is done when the macro is defined, replacing an occurrence of the ith parameter by $i, then the two macros would be stored as

XPL1 $A *+1(0:2) ENT1 * XPL2 $C STA $B XPL2 LDA $A INC1 C
Thus, a macro call XPL1 STA,X,Y would result in the expanded assembly code
STA *+1(0:2) ENT1 * LDA Y INC1 C STA X
The symbol C in the XPL2 macro would not be substituted for, since it was not known to be a parameter at macro definition time when parameters were identified and replaced by the $i representation.

In addition to these basic concerns, other programming techniques can be used to improve the efficiency of a macro assembler. One common technique is the use of compressed text. When a macro definition is stored for later expansion, not all of the input text is stored. Remember that for a standard assembly language input statement, most of the 80-character input is comments or blank. A typical input card may have only 10 to 20 useful characters on it. Thus, when the card is stored in the macro body table there is no need to store the entire card. Only the label, opcode, and operand fields need be saved, and multiple blanks between fields can be reduced to only one blank between fields.

The use of compressed text has several results. First, it increases the number or size of macros which can be stored in the fixed amount of memory available for the macro body table by 4 to 10 times. Second, it increases the complexity of macro expansion. Consider that the card images that are to be manipulated now are no longer of fixed length, but rather are of variable length. This requires special programming techniques for the representation and manipulation of variable-length strings of characters. Typically, this is done by appending the length of the string to the front of it (just as the length of the loader blocks was contained in a header word at the first of each block), or the use of a special end-of-line character or characters at the end of the line. Still, the extra efficiency in the use of memory by the assembler is generally considered to be worth the additional cost of using variable-length strings.

A rough idea of how macros are implemented should now be apparent to you. As with everything else in programming, there are many ways of actually writing a macro assembler. One of the simplest is to add another pass to an existing assembler to expand all macros. Thus, an existing two-pass assembler can be converted into a three-pass macro assembler. The first pass expands all macros, the second pass defines the symbol table, and the third pass generates the loader output.

The macro expansion pass copies all assembly language statements except macro definitions and macro calls. A macro definition is entered into the macro name table, and the assembly language statements which follow are stored, in compressed text form, in the macro body table until an ENDM statement is encountered. When a macro call is found the body of the macro is copied out onto the secondary storage device which holds the copy of the program for input to pass 2 of the assembler. Parameter substitution is performed as the macro body is copied out for pass 2. Nested macros require some additional programming care.

A little thought shows that there is nothing in the second pass of a three-pass assembler (symbol table definition) which cannot be done in the first pass along with the macro expansion. Similarly, it is possible to create a one-pass assembler which assembles a macro assembly language program in one pass. Because of the complexity of these assemblers, however, they must be very carefully designed, written, and documented.

The concept of macros need not be applied only to assembly languages, but can be applied more generally to any arbitrary file of characters. PL/I programs have a limited form of macros. Some computer systems have a program called a general purpose macro processor, which will input a text file, possibly with macro definitions and macro calls, and will output the text file which results from expanding all macro calls. (In business offices, macros are generally called form letters.)

The ideas behind a general purpose macro processor are discussed in Strachey (1965), while Brown (1969) gives a survey of macro processors. Kent (1969) emphasizes the characteristics of the macro assembly language for the IBM System 360/370.

9.2 CONDITIONAL ASSEMBLY

A professional systems programmer often writes a program in such a way that it may be usable in a wider set of circumstances than is necessary. He knows that certain aspects of the environment in which the program is to be run may change. These changes may include device numbers or types, the amount of memory available, and so forth. In writing a program, this type of information is provided to the program in such a way that the program can be easily changed, if necessary. The EQU statement provides one mechanism for this type of programming. Unit numbers (like CR, LP, or TAPE in the assembler) or the size of large tables (like the symbol table of the assembler) are defined in terms of symbolic constants, rather than their (current) numeric values. This allows these values to be changed easily if the configuration of the computer changes.

Motivation for conditional assembly

The EQU statement often suffices for small changes, but more drastic changes may require complete changes in the way in which a program is written. Consider the assembler of Chapter 7. This assembler was written with a binary MIX computer in mind. This is reflected in the use of an octal representation for the listing of assembled instructions, and the limitations on the ranges of numbers which can be represented in one byte, two bytes, or an entire word. By just changing these few values and the code for generating listing output, we can have an assembler for a decimal MIX computer.

Typically, making these changes will result in two separate programs, one an assembler for a MIX 1009B, the other for a MIX 1009D. This requires twice as much storage space, and any changes or modifications to the assembler must be made in both programs. Experience with this arrangement indicates that after a while small changes will be made in one assembler but not in the other, so that the two supposedly equivalent programs become different.

Another problem may deal with the use of different algorithms depending upon some property of a variable. For example, in our discussion of the opcode table, we indicated that either a linear or a binary search can be used; the choice between these two algorithms is made on the basis of the size of the opcode table. When the search routine was written, however, we did not know the exact size of the opcode table, except that it had NUMBEROPS opcodes. It would be possible to write our search routine so that, at execution time, the value of NUMBEROPS would be tested, and if greater than 32 (say), a jump would be made to a binary search; if less than 32, a jump would be made to a linear search. However, the value of NUMBEROPS is fixed at assembly time and hence for any particular assembly the non-selected search algorithm would never be used; it would only take up valuable space.

Both of these problems, and others, can be solved using conditional assembly. Conditional assembly refers to the ability, during assembly, to have the assembler test a condition. On the basis of the results of that test, assembly language statements may either be assembled (as normal) or not assembled (treated as comments). The important concept in conditional assembly is that the test is done during assembly, not during execution.

A conditional assembly feature is added to an assembly language by the introduction of additional pseudo-instructions. Two things need to be specified: the test to be performed and the assembly language statements to be conditionally assembled or skipped. The complexity and sophistication of these types of statements varies widely.

A simple conditional assembly feature

Probably the simplest form of conditional assembly would be the provision of a single new pseudo-instruction of the form

IF expression,expression
This pseudo-instruction, when encountered by the assembler, evaluates the first expression in the operand field. If this expression is zero, then the next n lines are skipped, where n is the value of the second expression. If the expression is non-zero, the n lines following the IF pseudo-instruction are assembled normally.

To illustrate the use of this new pseudo-instruction, consider the problem of modifying the assembler for both binary and decimal MIX computers. We define a variable BINARY which is 0 for a binary machine and 1 for a decimal machine

BINARY EQU 0 BINARY ASSEMBLER BINARY EQU 1 DECIMAL ASSEMBLER
Then we can write
HLBYTE CON HIGH AND LOW FOR BYTE IF BINARY-1,1 CON 63 BINARY MACHINE IF BINARY,1 CON 99 DECIMAL MACHINE
If BINARY is zero, then BINARY-1 is non-zero, so the CON 63 is assembled. The second IF has a zero expression, however, so it skips 1 line, the CON 99 statement. Thus, if BINARY is zero, the above code is identical to
HLBYTE CON 0 CON 63
On the other hand, if BINARY is 1, the CON 63 is skipped and the CON 99 is assembled. Thus, the assembled code would be
HLBYTE CON 0 CON 99
Similar conditional code could be used in the other few places where an assembler for a binary MIX machine would differ from an assembler for a decimal MIX machine.

The implementation of a conditional assembly feature such as the above is very simple. To the one-pass assembler of Chapter 8, it would be necessary only to,

  1. Add the new pseudo-instruction to the opcode and give it an opcode type of 7.
  2. Add a jump to IFOP to the jump table of the main loop which separates out opcode types.
  3. Add code to interpret the IF pseudo-instruction by first calling the EXPRESSION routine. If the value of the expression is non-zero, return control back to the end of the main loop. Otherwise evaluate the second expression and skip that many cards by calling READCARD repetitively. Then return to the end of the main loop.
IFOP ENT1 HLWORD FIRST EXPRESSION ANY VALUE JMP EXPRESSION JANZ ENDCASE IF NONZERO, CONTINUE * * IF EXPRESSION WAS ZERO, SKIP N CARDS. * FIRST DETERMINE N, PUT IN I1 * ENT1 HLADDR SKIP MAX OF 4000 CARDS. JMP EXPRESSION STA *+1(0:2) ENT1 * MOVE A TO I1 SKIPCARDS JMP READCARD DEC1 1 J1P SKIPCARDS JMP ENDCASE *

More sophisticated conditional assembly

The simple form of conditional assembly introduced in the previous Section can easily be introduced into any assembler. However, it is unsatisfactory for a number of reasons. First, the requirement of providing the number of input cards to skip is, while simple for the assembler, troublesome for the programmer. The programmer must carefully count the number of cards to skip. If any changes are made in code which is conditionally assembled which may increase or decrease the number of cards, the programmer must remember to change the skip counts on the IF instructions also.

This problem is generally eliminated by introducing a new pseudo-instruction, ENDI. When an IF pseudo-instruction is encountered and the assembler decides to skip, it skips cards until it encounters an ENDI pseudo-instruction. If an ENDI is encountered when the assembler is not skipping due to conditional assembly, it is simply treated as a comment and ignored.

In addition, the form of the IF pseudo-instruction is generally more complex. Rather than allow only a test for zero or non-zero, conditional assembly often allows tests for zero, non-zero, positive, negative, non-positive, or non-negative of an expression. These could be written as

IF Z,expression IF NZ,expression IF P,expression IF N,expression IF NP,expression IF NN,expression
In each case, the expression is evaluated. If the condition is true, then the following lines are assembled; if the condition is false, the lines which follow, up to and including the next ENDI, are skipped and treated as comments.

Even more complex conditional assembly forms allow the comparison of two expressions for equal, not equal, less than, less than or equal, greater than, and greater than or equal. In addition, it is sometimes possible to test for equality or nonequality of character strings (generally passed as parameters to macros).

Another type of conditional assembly allows a symbol to be tested for a defined or undefined characteristic. This is easily done by a simple search of the symbol table, and is most useful in macros, when a parameter may or may not be defined. For example, consider a macro to exchange the values of the A and X registers. This can be written as

AXCH MACR STA TEMPA STX TEMPX LDA TEMPX LDX TEMPA JMP *+3 TEMPA CON 0 TEMPORARY SPACE TEMPX CON 0 ENDM
However, notice that if this macro is ever called twice, the second call will result in the labels TEMPA and TEMPX being doubly defined. To avoid this we can write, assuming that the IFD pseudo-instruction will skip until an ENDI if the symbol is defined
AXCH MACR STA TEMPA STX TEMPX LDA TEMPX LDX TEMPA IFD TEMPA JMP *+3 TEMPA CON 0 TEMPX CON 0 ENDI ENDM
This macro will not generate the last three lines of code (the JMP and two CONs) if the symbol TEMPA is defined. This prevents multiply-defined labels from multiple uses of the macro.

Even more sophisticated pseudo-instructions can be added to an assembler to control what instructions are generated. These pseudo-instructions can result in only a minor change to the assembler, or can require additional passes for correct processing. Although each of these additional features may be very useful, and even necessary in some cases, they often make the use of such a sophisticated assembler much more expensive (in both time and memory) than a simpler assembler. The important concept is that a certain amount of control over what code is generated can be exercised at assembly time. The assembler itself can make decisions to include or exclude blocks of assembly language statements in order to generate more efficient, compact, or useful machine language programs.

9.3 COMPILERS AND HIGHER LEVEL LANGUAGES

Assemblers are used because programming in machine language is too boring, dull, and error-prone to be fun or cost-effective. Assemblers allow the programmer to express the specification of a program in a symbolic rather than a numeric form. This allows the programmer to create a program in a more convenient form, one closer to the way in which the algorithm for solving the problem was conceived.

Assembly language is still a very primitive, computer-oriented means of writing a program. To ease the programmer's task even more, higher-level languages have been defined. Algol, Cobol, PL/I, Fortran, Pascal, and Basic are all examples of higher-level languages. These languages attempt to be more human-oriented than computer-oriented, and, compared to assembly languages, succeed.

Still, all computer programs must be expressed in machine language in order to be executed. A compiler is a program which translates from a higher-level programming language into machine language. (Actually, of course, a compiler, like an assembler, translates into loader code and then a loader translates this loader code into machine language.) A compiler is a translator.

A higher-level language program is composed of a sequence of statements. Each statement corresponds to possibly many machine language (or assembly language) instructions. This is the major difference between the definition of an assembler (which is basically a one assembly language statement to one machine language instruction translator) and a compiler. A compiler may generate many machine language instructions for a statement in a higher-level language. Also, a program written in a higher-level language is relatively machine independent. It does not deal with bits, registers, or addresses, but rather with variables and statements which operate on these and more complex data structures.

The statements of a higher-level language are specified in two ways. First, each statement has its own syntax. The syntax of a statement defines the form or forms that a statement can take. For example, in Fortran the syntax of an assignment statement is
<variable name> = <expression>
The same type of statement in Algol or Pascal has a different form
<variable name> := <expression>
The meaning of these two statements is the same, but the syntax is different. The items in brackets are syntactic entities, items whose syntax is also defined in the syntactic definition of a higher-level language.

In addition to syntax, a higher-level language defines for each statement its semantics, or meaning. The semantics of a statement or other syntactic unit define what that statement means when it is written in a program. The semantics of the assignment statement in Fortran are that the value of the expression on the right of the equal sign should be evaluated and the value of the variable named on the left of the equal sign is set equal to the value of the expression. (In fact, the semantics of the assignment statement are considerably more complex due to such things as type conversions between different types of variables and expressions.)

The problem for a compiler is to generate code which, when executed, correctly reflects the semantics of a program with correct syntax. To do so, a compiler has great latitude in the type of code it generates. A compiler has complete control over (and responsibility for) how storage is allocated for variables, how registers are used, subroutine calling sequences, code generation, and so on.

To compile a program, each statement passes through several phases. First, there is a lexical phase. This phase groups together characters from the input statement into variable names, operators, separators, and other lexical entities. One of the important classes of lexical entities is the class of reserved words or keywords. Keywords are used in the next phase of a compiler: the syntactic analysis. Syntactic analysis identifies the type and components of a statement and checks for errors in the form of the statement. This is also called parsing.

The result of the syntactic analysis determines the input to the semantic phase. In this phase further error checking occurs to insure that proper types of variables and expressions are used in that statement. Then the machine code to be executed is generated, and the compiler continues with the next statement. Code optimization routines may go back over the generated code and try to improve it by better register allocation and use of instructions.

Many different techniques are used to compile programs. The older languages, like Cobol and Fortran, generally use ad-hoc techniques which make heavy use of the keywords in the language. In Fortran, for example, every statement except the assignment statement starts with a keyword. Thus, a compiler need only check the first few characters of a statement. If it is a keyword, then the type of the statement has been identified; if it is not a keyword, then the statement is an assignment statement, and again its type has been identified. Once the type of the statement is identified, the expected syntax of the remainder of the statement is known, and can be used to direct the parsing. Basic is another language which requires each statement to begin with a keyword.

Other languages have been designed in such a way that special compiling techniques can be used with them. The compilers for these languages determine what code to generate from the sequence of lexical entities which the compiler sees. These compilers are called syntax directed compilers. Algol and Pascal are examples of these languages.

Compiling techniques and programming languages are subjects for study in themselves. We do not consider them in this book. The basic concepts are the same as with assemblers: the input program is read and an equivalent program in loader format is output, along with a listing of the input program. Some compilers are one-pass, load-and-go systems, while others take multiple passes to produce their output loader code. (Rumor has it that there exists a twenty-pass Cobol compiler.) Compilers allow programmers to concentrate more on the problem to be solved, instead of having to constantly consider the idiosyncrasies of the computer on which the problem is to be solved.

The books by Gries (1971) and Aho and Ullman (1973) are excellent and comprehensive treatments of how to write compilers. Less formal treatments are in Donovan (1972) and Graham (1977).

9.4 INTERPRETERS

All of the systems programs we have examined so far could be classified as translators, in that they only transform a program from one form to another; they do not add anything to the program which was not there before. A different class of programs is the class of interpreters. Where a translator only translates from one language to another, an interpreter actually executes the program. Most often an interpreter executes the machine language for some computer, or sometimes executes a higher-level language directly.

There are many reasons for using an interpreter. For a machine language interpreter, the machine which is being simulated may not have been built yet. Having a simulator or interpreter available allows the basic software (loaders, assemblers, compilers) to be written at the same time that the machine is being built. This allows the computer to be useful for programming months earlier than if software development had to wait until the hardware was available.

Another use for interpreters is to allow the evaluation of a proposed computer design before committing the resources needed to actually build one. If the design is not easy to program, or does not perform well on typical programs, it may need to be redesigned or discarded altogether.

Perhaps the largest use of interpreters, however, is for education. Since, as of this writing, there are no real MIX machines, the "computer" which you have been programming is most likely a simulator which interprets the MIX instruction set. The MIX machine is used because it is typical of a number of real machines (as we shall see in Chapter 10), but does not have the hard-to-explain properties of real computers that are caused by the realities of engineering.

In addition it is possible to add to an interpreter features which make the debugging of programs much easier than debugging on a real machine. Routines which dump simulated memory when a error is found, that trace the execution of certain instructions by printing the program counter, instruction being executed, and contents of the affected registers and memory locations, that stop the program when certain conditions occur (program counter equal to one of a set of values, a given memory location is read from or written to, a certain opcode is encountered), or that can even print the instructions which were executed just before any of the above conditions occurred; all these features can be programmed into an interpreter, but would be difficult to add to the hardware of a real computer. Also, these features are really only needed when a program is being debugged, not when it is being run in a production environment.

Writing an interpreter is relatively simple. Knuth (1968), in Volume 1 of his The Art of Computer Programming presents the basic idea by presenting an interpreter for MIX, written in MIXAL. The basic data structures are an array which is used to simulate MIX memory, and variables which simulate the MIX registers. The general flow of the interpreter is to read an instruction from memory, decode the instruction according to its opcode, and then branch to short code segments which simulate each type of instruction in the MIX instruction set. If you have an opportunity, review Knuth's program. It is an example of the work of a master of the art of computer programming.

9.5 OPERATING SYSTEMS

You may have noticed that the number of programs which have been discussed is getting large, and we have not begun to discuss programs for solving differential equations, for playing chess, for computing your tax returns, for learning French, for compiling a list of all distinct words in Shakespeare's plays, and so on. As more and more programs are written and the procedure for executing them becomes more and more varied, it is necessary to write a program to organize and control the use of the computer. This program is called an operating system. An operating system is a large program, or collection of programs, whose only purpose is to make the use of the computer more convenient and more efficient.

The convenience occurs in many ways. The operating system performs services for the user of the computer. Most operating systems do all the I/O for their users. The I/O is automatically buffered and blocked to keep the I/O devices as busy as possible while relieving the user from having to write the code for this task for each new program.

In addition, this I/O is device independent. User programs perform I/O on named collections of data called files. A file generally looks to the user like a magnetic tape; it can be read, written, or rewound. The physical implementation may be a tape, or a card reader (for an input only file), or a line printer (for an output only file), or magnetic disk or drum. The user need not worry about which device his information is stored on, nor (for disk and drum) where on the device it is. The operating system maintains a symbol table, called a file directory, which maps symbolic file names onto numeric device unit numbers and (for disk and drum) track and sector numbers.

Another convenience offered to the user of an operating system is control cards. Without an operating system, an assembly language programmer must first load the bootstrap loader, then load the absolute loader, then the assembler, then the relocatable loader, then the assembled program. This can take a lot of work even for very simple programs. An operating system relieves the programmer of most of this work by defining a control card language which allows the most commonly requested uses of the computer to be specified with only one (or a few) control cards. For example, the control card

MIXAL.
might cause the MIXAL assembler to be loaded and begin reading a program from the card reader, printing a listing on the line printer and putting loader code on a file called RUN. Then the control card
RUN.
could load and begin execution of the assembled program. The control card interpreter is the part of the operating system which reads the control cards and causes the computer to perform the correct actions.

For small computers, this may be all the operating system which is needed. The operating system may be either a batch system, if it accepts its input from a card reader, or an interactive system, if it accepts its input from a terminal like a typewriter or CRT. For large computers, however, the operating systems become much more complex in an effort to use the computer as efficiently as possible. Typically, the cost of a large computer is $200 to $500 an hour. (Cost of the purchase price of the system divided by its expected lifetime plus operating expenses of power, paper, programmers, and people.)

To justify such large costs, the computer should be busy all the time. To increase the efficiency, such machines are often multiprogrammed; that is, several programs are executed together for short periods of time, to share the computer. The main idea is to execute one program until it has to wait for I/O, then, rather than sitting idly by, the computer begins to execute another program, until the I/O for the first program is complete; then it switches back to the first program. Later it will continue the second program where it was interrupted. The computer is so fast (and I/O is so slow) that often four to eight programs can be executed this way without any one program being executed any slower than if it had the computer to itself. An extreme case of this is a time-sharing system, where possibly 50 people, sitting at typewriter or CRT terminals, each give the computer commands to execute and each think that they have the computer all to themselves.

Like the subject of compilers, the subject of operating systems is a field of study in itself. A great deal of work has been and is being done concerning the services which an operating system should provide, how it should be designed, and how it should be implemented. The introductory texts by Madnick and Donovan (1974) and Tsichritzis and Bernstein (1974) have good treatments of general operating system problems and solutions. Wilkes (1968) concentrates on time-sharing systems.

9.6 OTHER SYSTEMS PROGRAMS

The major systems for a computer system are the loaders, assemblers, compilers and operating systems, but there are more.

A text editor is a system program for manipulating files of text. The text files may be programs, data, output, or any other set of textual material which is machine-readable. Typically, these programs are written for time-sharing systems, although a few have been written for batch systems. The objective of a text editor is to allow a user of the computer who has some text file to modify that file easily without the necessity of using cards.

A text editor is an interpreter. It inputs commands from the user and executes them, modifying the text file as instructed. Text editors allow characters and lines to be inserted, deleted, replaced, searched for, substituted for, and moved about in the file. Some editors allow macros of the basic editing commands to be defined and called allowing very complex editting functions to be done by a single macro command. A good text editor in a time-sharing system will allow programs to be written, compiled, debugged, corrected, and documented entirely in the computer.

Much of the use of text editors is for the writing of programs, but some uses are for simple text such as papers and books. This book was written, revised, and edited entirely on a computer. This allowed changes to be easily made as necessary. To assure an attractive appearance, another program, a text formatter was used. The text formatter read an input text file and produced a nicely formatted output file. The output file had proper indention and spacing, centered titles, automatic line counting for paging, and right-justification. Right-justification is the property of having all complete lines ending evenly at the right margin. This is fairly difficult to do on a typewriter but can be easily coded in a computer program with a little character manipulation.

Text formatters are generally driven by commands embedded in the text to be formatted. These commands control spacing, margins, paging, paragraphing, indenting, centering, underlining, and so on. The more sophisticated text formatters allow macros of commands to be defined and used and some allow conditional formatting.

There are many more, useful systems programs. Most of these are not often treated in printed sources, but mainly are developed as needed for each new system. The book by Kernighan and Plauger (1976) is probably the best published treatment of systems programs and how they should be designed and coded.

9.7 SYSTEMS PROGRAMMING LANGUAGES

The vast majority of systems programs are written in assembly language. There are many reasons for this. One reason is simply historical -- systems programs have always been written in assembly language -- but the primary reasons are function and speed. A higher-level language, such as Fortran, generally does not allow the programmer to express easily those functions which are common to systems programs, such as character manipulation or the use of absolute addresses. Higher-level languages are generally meant to be machine independent, preventing the systems programmer from exploiting the particular features of the particular computer being used. This is especially critical in interrupt handling and I/O instructions.

Just as important is the fact that no compiler can generate code which is better than the code that the best assembly language programmer can write. Consider simply that any code which a compiler can produce can also be written by a programmer, but the programmer may be able to apply local or global optimizations to improve the speed of the code or to reduce its size. On many small computers, like MIX, the problem of limited memory space may be quite severe, and this is a constraint that compilers tend to be unable to consider.

There are also arguments against using assembly language. First, assembly language demands great attention to detail, which complicates the programming process. This makes assembly language programs difficult to write, debug, and understand. It is particularly difficult to try to read an assembly language program which has been written by another programmer. Second, assembly language is specific to a particular machine, which means that programs written in assembly language for one computer cannot be transported to a different computer; they must be completely rewritten.

Finally, although compiler-generated code may not be better than the best assembly language code produced by the best assembly language programmers, all programmers are not assembly language experts; most are average programmers writing average code. Compilers can often generate reasonably good, average code. Further, the real determining factor in efficient programming is not the coding but the algorithm design. A good algorithm, with average code, will far out perform a poor algorithm, with excellent code. For example, a binary search on a large table will take far less time than a linear search, no matter how well the linear search is coded.

These considerations have resulted in the development of a number of new languages, called systems programming languages (or machine-oriented languages, or systems implementation languages). These languages lie between the low-level assembly languages and the higher-level procedure-oriented languages. One early language was PL360 for the IBM 360 (and IBM 370) computers. It was designed by Wirth (1968) to include the advantages of both assembly language (in terms of control over generated code) and higher-level languages (in terms of ease of reading and coding). PL360 allows a program to be written in a syntax similar to PL/I or Algol, but allows anything which can be written in assembly language to be written in PL360.

Many other systems programming languages have been developed, including some, like the SIMPL language of Basili and Turner (1975) which are reasonably transportable; that is, they can be used on several different computers. A systems programming language exists for many current computers, so that many systems programs need no longer be written in assembly language. Knuth (1968) mentions PL/MIX, a systems programming language for the MIX computer which will be described in Chapter 9 of his set of books, The Art of Computer Programming.

9.8 SUMMARY

The first pieces of system software which are developed for a computer are the loaders and assemblers. As the use of a computer grows, the sophistication of the software generally does also. A macro assembler with conditional assembly features can make assembly language programming easier. Text editors, compilers, text formatters, and operating systems are programs which help the user of the computer to accomplish useful work without being forced to write assembly language programs. Systems programming languages can give the programmer the power of an assembly language but with the syntax of a higher-level language.

EXERCISES

  1. What is the primary difference between a subroutine and a macro? Give an advantage of subroutines over macros and an advantage of macros over subroutines.
  2. A macro
    1. is an open subroutine.
    2. is an inline subroutine.
    3. is a closed subroutine.
    4. is a reentrant subroutine.
    5. passes parameters by value.
    6. passes parameters by name.
    7. passes parameters by address.
  3. Assume that we have a three-pass macro assembler. What is the most likely purpose for each pass?
  4. Why isn't a macro checked for errors when it is first defined?
  5. Some macro generators permit assembly language operations to be redefined as macros. How could this feature be used advantageously in debugging?
  6. If an assembler allows recursive macro calls, what else must it allow? (For example: Nested macro definitions? Conditional assembly? Forward references in expressions? Default parameters?)
  7. Explain how macros are implemented in an assembler. What new data structures are needed?
  8. What is the name of the system software which,
    1. makes one computer act like another?
    2. translates Fortran into machine language?
    3. translates MIXAL into machine language?
    4. coordinates and controls the entire computer and its use?
  9. What is an interpreter?
  10. Some systems have a form of conditional loading (similar to conditional assembly) that works like this: Input to the loader is a collection of segments with defined entry points and externals (as discussed in Chapter 7). In addition, associated with each segment is a special bit. If this bit is on, then loading proceeds as normal. If this bit is off, then this segment is loaded only if there is a reference to an entry point of this segment by some other segment which is loaded.

    This allows library subroutines to be included, and if they are never used, they will not be loaded into memory. For example, if the segment for a SQRT function is marked to be conditionally loaded, it will be loaded only if it is used by a segment which is loaded.

    Notice that loading a segment which was marked for conditional loading may require that other conditionally loaded segments be loaded also.

    How many passes would a loader need to be to implement conditional loading?

  11. What are some advantages of higher-level languages over assembly languages? Of assembly languages over higher-level languages? How do systems programming languages fit in with these other languages?
  12. Are there reserved words in MIXAL?
  13. Why would programs be run on an interpreter rather than a real computer?
  14. What is the purpose of control cards?
  15. What is device independent I/O? What is a file?
  16. Why can't a compiler generate better code than an assembly language programmer? How does a systems programming language approach this problem?