A Critique of the Programming Language CLU

James L. Peterson

Department of Computer Sciences
The University of Texas at Austin
Austin, Texas 78712

December 1978

This paper has been published. It should be cited as

James L. Peterson, ``A Critique of the Programming Language CLU'', CLU Design Note 78, Laboratory for Computer Science, MIT, (April 1979), 33 pages.

1. Introduction

The programming language CLU [4,5] was designed in 1974 and 1975 at the Massachusetts Institute of Technology by B. Liskov and her students and has since been implemented on at least two computer systems with other implementations planned. The CLU language is the result of an effort to produce a programming language which reflects and supports modern programming concepts and methodologies. CLU, along with Alphard [7], and perhaps Euclid [3], Gypsy [1], Mesa [2] and Modula [6], are the most commonly referenced recent language designs. The concepts of these languages have, and will, influence all future language design.

In this paper, we begin by discussing some of the major new features of CLU which are not found in more conventional modern languages (such as Pascal). Then we discuss some problems we have found in CLU. These problems can be grouped into four areas: structural, semantic, functionality, and syntactic.

2. A Brief Introduction to CLU

CLU is a modern programming language descended from the tradition of Algol and Pascal, but with several major new features.

The most important feature is the cluster. A cluster implements a data abstraction by defining an abstract type, its representation (in terms of lower level data types), and the operations which are possible on objects of this abstract type. The cluster heading gives the name of the new type and the set of operations which are allowed on that type. Then the representation of the type is defined, and a set of procedures given. Each operation for this new type corresponds to one of the procedures in the cluster. For example, a new type complex_number might be defined by a cluster of the form,

     complex_number = cluster is add, subtract, multiply, ....

rep = record [ real_part: real, imag_part: real ]
add = proc ... end add;
subtract = proc ... end subtract;
multiply = proc ... end multiply;
end complex_number;

Two types of routines are available in CLU. Procedures are essentially the same as in Algol, while iterators are a coroutine form which is used to generate, one after another, a sequence of data objects. Iterators solve the problem of processing each of the elements of a user defined type which is a conglomerate of simpler objects (such as a list, array, or set). Iterators are used only in for statements allowing the user to iterate through the components of an abstract type.

Another distinctive feature of CLU is its view of the program data space as a set of objects which are denoted by the variables of the program. A variable does not contain an object but rather a description of (think: pointer to) the object. Several variables may thus denote (point to) the same object. (In fact, assignment is simply a matter of copying pointers.)

In addition, objects are of two kinds: mutable and immutable. Immutable objects (like integers, characters, strings) cannot be changed, rather the result of an operation on an immutable object produces a new object different from the original object. Mutable objects (such as records and arrays), on the other hand, can be changed by modifying their internal components. This means that if two variables, x and y, denote (point to) the same array (as the result of x:=y for example), and the value of x[i] is changed, then the value of y[i] is also changed.

This sort of sharing occurs with parameters also, producing a parameter passing mechanism called call-by-sharing.

CLU is strongly typed; all variables must be declared and have a type. This type may be either a predefined type (integer, string, array, ...) or a user-defined type, defined by a cluster. In either case, only the operations which are declared for that type may be applied to the variable.

(The definition of type varies some from Pascal. One problem which has been mentioned for Pascal is that an array type is defined by both the component type and the array bounds. Thus an array of integers subscripted from 0 to 10 is a different type from an array of integers subscripted from 0 to 9. This makes it difficult to write procedures to operate on arrays of arbitrary lengths. In CLU, all arrays are dynamic and no subscript range is given to define the array type. The type of the array is defined by the component type only.)

The correct operation of a program requires detection and appropriate recovery from errors. Error handling is integrated into the CLU language by provision for exceptions. When an error or special condition is noticed, either by the system or by a user routine, a signal is raised. This interrupts the current execution and transfers control to an exception clause for the specific signal raised. Errors may be handled by the user, if desired.

A more complete introduction to CLU is provided by [4] or [5], but we hope that the basic nature of CLU is now apparent. Appendix I is a CLU program, illustrating most of the features of CLU. The program in Appendix I is a substitution program; it first reads a set of rules giving patterns (left hand sides) which are to be replaced by corresponding new patterns (right hand sides) throughout an input file to produce a new output file. The program reads the rules and constructs from them a deterministic finite state machine which is then used to determine when a pattern has been matched in the input and a replacement must occur. This program was selected to be written in CLU because of the variety of its data structures (rules, sets of rules, states, ...). It can be used to illustrate many of the problems of CLU.

3. Problems

In evaluating CLU, we must keep in mind the goals with which CLU was designed. This can best be defined by quoting from [5, page 565]:

"The motivation for the design of the CLU programming language was to provide programmers with a tool that would enhance their effectiveness in constructing programs of high quality -- programs that are reliable and reasonably easy to understand, modify, and maintain."

It is with respect to this criteria that we see, in CLU, four areas of problems: structural, semantic, functional, and syntactic. We discuss each of these areas and the problems in each below.

3.1. Structural

One of the basic tenets of modern programming methodology theory is structure. The structure of a program is crucial to its quality. Therefore, a modern programming language provides mechanisms for the structuring of a program. Most programming languages provide only procedures as a structuring mechanism; CLU provides both procedures and clusters. Thus it would seem that CLU has better facilities for structuring than most languages. But strangely, the designers of CLU have left out a major part of structuring: nesting. Nesting of procedures within procedures, clusters within clusters, or clusters within procedures is not allowed.

We view this lack of nesting as a serious problem in the construction of large programs. Nesting can directly reflect the structure of a top-down design. In addition, the names of nested procedures and clusters can not be used externally to the module in which they are nested. This structure-hiding property increases the modularity of the system by hiding the details of the construction of the containing module.

An analogy can be made here to the implementation hiding structure of clusters. A cluster is an explicit mechanism to hide the representation of the abstract data type of the cluster. Thus, the implementation of a data structure is hidden by the cluster. The implementation of a procedural abstraction cannot be so hidden, however. Procedures which are written only for use in a specific context, to assist some higher level procedure, cannot be hidden from either the view or the use of other procedures. Similarly, abstract data structures created solely for use within another abstract data structure cannot be hidden.

For example, in [5] the wordtree cluster was created for the sole purpose of implementing the wordbag cluster; wordtree objects and operations are not meant to be used except by the wordbag cluster. However, this cannot be enforced. An arbitrary routine at any place may suddenly create and operate on wordtree objects. A similar problem occurs in the substitution program of Appendix I. The following top-down structure should exist for this program:

     1. substitute
          1.1 get_files
          1.2 input_rules
               1.2.1 get_rule_part
          1.3 construct_state_machine
               1.3.1 first_state
               1.3.2 construct_all_successors
          1.4 sub
               1.4.1 buffer

This structure was defined by the top-down decomposition of the original problem, but it cannot be enforced. (In fact, this decomposition is violated by the use of first_state in define_successor, a violation of the intended structure of the program.)

Thus, CLU has created a global name space for clusters and procedures. While it can be argued that conventions can be used to prevent problems due to this global name space, the designers of CLU themselves provide the best answer to this argument [5, page 576]:

"Conventions are no substitute for enforced constraints."

Arguments can also be made that these problems can be resolved by proper use of the CLU library. We would respond, however, that this is an extra-linguistic mechanism which is necessary only because of the lack of an appropriate mechanism within the CLU language itself. Further, it implies that programs are not self-contained, leading to more complex maintenance problems.

The global name space of procedures and clusters is in sharp distinction to the strictly local nature of variable names. All variables are local only to the procedure in which they are declared and cannot be referenced from any other routine. This means that all communication between procedures must be by parameters. While this has the advantage of clearly indicating what communication is going on, it has the disadvantage of leading to very long parameter lists, and very long parameter lists are not conducive to programs which are easy to read and understand.

In addition, this constraint may result in parameters which are necessary, but hard to explain without describing the entire internal structure and algorithms of the procedure. Consider the find procedure of Appendix I. This procedure searches all existing states for a given state, and if found, returns the existing state; otherwise, it returns the new state, after adding it to the list of existing states. However, since the list of existing states can be neither global nor own (see below), it must be passed in to find from define_successor which never uses the list, but gets it from construct_all_successors, which also only passes it on from construct_state_machine, which creates it. Thus two procedures are involved in the communication of an object from construct_state_machine to find; two procedures whose abstract description (documentation) has no reason to mention the list of states (they never need to use it) and which should not be allowed to use it, but which cannot be prevented from using it if they wanted.

For higher-level procedures, implementing relatively complete functions on high-level objects, communication solely by parameters is quite reasonable. However, one of the purposes of many low-level procedures is specifically to manipulate structures in their global environment (symbol tables, output files, and so on). Also many procedures at the lowest level may be merely abbreviation mechanisms (macros) and a complete list of parameters may be longer than the procedure itself, destroying the usefulness of these procedures.

The problem of long parameter lists is compounded by the lack of own variables. In some languages without own variables (such as Pascal), global variables are used to simulate own variables, relying on conventions to prevent other routines from disturbing the "own" variables. However, this is not possible in CLU, meaning that such simulated own variable must be passed as parameters at all levels. Consider the case of a program which at the lowest level uses a pseudo-random number generator; the seed for the generator may need to be created by the top-most program and passed as a parameter through all programs which may someday need to use the pseudo-random number generator. Another program which wished to count the number of input characters and output characters would need to pass these counters as parameters to all programs which may do input or output (or may in the future need to do input or output).

The problem of own variables shows up in a different form in the first_state procedure in Appendix I. This procedure creates and returns the initial state of the state machine. This same value is needed later when a new state is created, to serve as the base value of the new state. In a Pascal version of this program, a global flag determined if the first state was already calculated, and if so, simply copied a global first state variable, or if not, created the global first state variable. Thus the first state computation was necessary only once; this value was simply reused on subsequent calls. To accomplish the same optimization in the CLU program was deemed to so complicated the program that it was not attempted.

The lack of own variables in clusters is particularly annoying. Consider the writing of a compiler. The symbol table is a global object; it must be used throughout the compiler. Thus either it will end up on nearly all parameter lists, or it should be global. The cluster concept provides the mechanism for restricting access to the symbol table to those operations allowed on it; it would be most logical to place the symbol table itself in the symbol table type cluster, as an own variable.

As another example of the need for own variables in clusters, consider the problem of instrumenting an existing program to determine the number of times that each of the operations on an abstract data type are invoked. The necessary changes are obvious and straightforward with global variables or cluster own variables, but would involve major modifications to a program without such a feature.

Notice that iterators, as coroutines, do have own variables of a form, producing a disparity between iterators and procedures. Similarly, files can be considered as either global variables or cluser own variables.

3.2. Semantics

The major semantic concept of CLU which differs from conventional languages is the concept of objects. This concept leads directly to the distinction between mutable and immutable objects and to the call-by-sharing parameter passing mechanism. These concepts interact to create serious problems for procedures and parameters.

CLU has serious potential problems with aliasing. Since variables simply denote objects, it is entirely possible for two variables to denote the same object. If the object is mutable and the two variables are passed into a procedure, then a change to one parameter will also change the other parameter. This aliasing can be extremely difficult to detect, and in fact would require a run-time detection method.

A further problem is caused by the mutable/immutable nature of objects. It is not possible to determine in a program whether an object of an abstract type is mutable or immutable since this depends upon the representation of the abstract type, which is (correctly) hidden from the users of the abstract type by the cluster mechanism. An abstract type can be changed from mutable to immutable (or vice versa) by merely changing its defining cluster and although no compile-time error will result, the program using that abstract type may then fail.

As a case in point, consider the abstract type state in Appendix I. A state in a deterministic state machine consists of the list of positions in the left hand sides being recognized plus a list of the next states which are entered under possible input characters. A state may also be a final state, in which case the name of the rule whose left-hand side has been recognized must be known. This was originally defined as a record,

     record [ position_list: array[position],
                 transition_list: array[transition],
                 final: oneof [no: null, yes: rule]

One of the operations on a state is to declare it to be final. As a new state is being constructed, we may discover that the end of a left-hand-side has been reached, and at that point, the existing state is declared to be final, by modifying the final field of the state record.

After the program was successfully working, it was noticed that a final state has no need for the lists of positions and transitions since the action associated with a final state is to perform the appropriate replacement and then start over at the start state of the state machine on the next input character. Thus the definition of a state was changed to:

     oneof [final: rule,
               notfinal: record [ position_list: array[position],
                                       transition_list: array[transition]

This changed the primary structure from a record to a oneof construct and was beautifully made by modifying only the state cluster. It appeared that the change in representation could be made simply by changing the state cluster.

At this point, however, the program ceased to work. The problem was that in the add_next_position procedure where it might be discovered that the successor state currently being constructed was final, the state was declared to be final by invoking the state operation make_final. Record objects are mutable and so this change was originally possible. Oneof objects are immutable, however, and so it is not possible to make an existing state final; a new state must be created. The change to the representation of the state type required that add_next_position be changed to allow for the possibility that the make_final operation return a new object. This meant that the result of add_next_position might be a new state, and so it had to return a possibly different result state, requiring the define_successor procedure to be modified.

One of the basic tenets of data abstraction is that a change of representation can be made by changing the defining cluster only. Quoting from [5, page 575],

"Once it has been determined that an abstraction must be reimplemented, CLU guarantees that the code of all modules using that abstraction will be unaffected by the change."

This is not true, however, as illustrated by the above example.

(It would have been possible to make the state object mutable by defining the above oneof definition as the sole field of a record. However, this sort of "trick" to get around the language is certainly not conducive to producing programs that are "easy to understand, modify and maintain.")

3.3. Functionality

CLU is a high-level general-purpose programing language with a particular emphasis on data abstraction and types. In this regard, it follows from the success of the strongly typed nature of Pascal. We therefore find it surprising that two important type classes in Pascal are missing: enumerated types and subranges.

An enumerated type is a user defined type where symbolic names define the possible values of the type. The classic example of an enumerated type is color = (red, green, blue, yellow). A new type is defined and variables of this type can take on the values: red, green, blue, or yellow. Another enumerated type might be suit = (clubs, hearts, diamonds, spades). The representation of these types is typically by integers (0,1,2,3 in the examples given) and so the symbolic names can be considered as symbolic constants. However it is important to note that the type is not integer even though the representation is integer. Thus one cannot compare clubs (represented by a 0) with red (also represented by 0) because they are of different types. This is an important difference from the mere use of symbolic constants.

CLU provides no reasonable mechanism for the definition of enumerated types. (The oneof construct could be twisted to serve this purpose, but we would suggest that this is not reasonable.) Thus, those situations which are most naturally represented with an enumerated type will force the CLU programmer to explicitly represent the type as an integer. With suitable conventions, this can be successful, but places an unnecessary additional burden on the programmer and leaves open the possibility of accidental or deliberate misuse. Again, "conventions are no substitute for enforced constraints".

Subranges provide another example of a missing feature of CLU. A subrange type a..b is restricted to those integers from a to b (inclusive). Our experience with Pascal indicates that subranges are of enormous value, serving two purposes: error checking and documentation. The first purpose, error checking, is of great importance in the initial programming and debugging phases of a project. Unfortunately, only a limited amount of this checking can be done at compile time; most of it must be done at execution time. This can be expensive, and so most Pascal compilers allow run-time range checks to be suppressed, if desired, by compiler directives.

However, we would suggest that the problem of the cost of run-time checking of subranges is a temporary problem -- the remains of excessive concern with small-scale cpu efficiency coupled with the temporary inability of most compilers to prove that assignments will be in the correct range, eliminating the need for run-time checks. The technology for compile-time verification of most range checks is being developed and will eventually eliminate the need for most run-time checks.

Even in the absence of compiler enforcement of subranges, the ability to specify, in a natural way, the intended range of the value of a variable is extremely useful as an aid to understanding a program. Knowing that a value should only be positive, or only in the range from 1 to 10 can ease reading and following the logic of a program.

A separate lack of function occurs in the use of iterators. Some abstract data types are composed of lists or sets of items; it is occasionally desirable to process each of these items one after another. The iteration mechanism supplies this function. For example, to process all rules on a rule list, as in first_state in Appendix I, we can say simply,

     for i: rule in list_of_rules$all_rules (rule_list}
      do   <process rule i>   end;

This feature of CLU is very well done, extremely useful and very pleasing; it is one of the major new features which should definitely be in any new language, certainly one with data abstractions.

However, even with this concept problems can arise. At one point in the construction of the program of Appendix I, it was felt necessary to compare the position lists of two states to see if they were equal. A program-external proof showed that the positions in a list would always be sorted. Hence, to compare the elements of each list it was simply necessary to simultaneously iterate through the two separate lists. An iterator for iterating through a single list existed. However, it is not possible in CLU to use an existing iterator to iterate through two sequences simultaneously, processing corresponding elements.

Another problem illustrating the limitations of the CLU form of the iterator construct can be easily shown. Assume that we are given an iterator and wish to find the maximum element of the generated sequence. Typical code to find the maximum would look like:

     current_max := <first element>;
     for item in <all elements except the first>
      do if item > current_max then current_max := item;
          end; end;

The problem here is that we want to separately generate the first element and the remaining elements. This is not possible with the CLU iterator. The result would either be code similar to:
     first := true;
     for item in <all elements>
      do if first
          then current_max := item; first := false;
          else if item > current_max
               then current_max := item;;
         end; end; end;

which has an extra test (if first) in every cycle of the loop (and therefore could double the execution time of the inner loop), or of the form,
     for item in <all elements>
      do current_max := item; break; end;

for item in <all elements> do if item > current_max then current_max :_ item; end; end;
which misuses the break statement (exit from the enclosing loop) to generate just the first element and needs an extra compare of this element with itself in the main loop. Neither of these two approaches to finding the maximum is as clear and understandable as would be desired. (Try to write a type-parameterized procedure to compute the maximum of two elements of arbitrary type.) The problem is in the constrained nature of the allowed uses of the iterator mechanism.

3.4. Syntax

Questions of syntax are often a matter of style and so criticisms of syntactic issues are difficult to justify. No one can deny however that some syntax is more or less pleasing and this effect is important to program writers and readers. Thus, however unjustifiably, we would make the following comments about syntactic issues in CLU.

Variables must all be declared in CLU, but no separate declaration part exists in a routine. Rather, a variable can be declared when it is first used in a program. While this freedom may be appreciated by the programmer who writes the original code, it is unlikely to be appreciated by those who must later read this code for purposes of debugging, correcting, or enhancing. First-use declaration requires a reader to scan the entire program looking for the first use of the variable to determine its type. While a cross reference list may assist with this problem, a much better solution is to require all declarations at the beginning of a routine, making them easier to find and explicitly indicating the amount of local storage which will be needed by this routine.

CLU has self-delimiting statements. A separate statement delimiter (such as a semicolon) is unnecessary. This is most obvious with structured statements such as the if, for, and while statements, each of which must be followed by an explicit end terminator. The traditional approach is to allow exactly one statement in these constructs; the end of the component statement (determined recursively if the component statement is itself a structured statement) is then the end of the structured statement.

What are the advantages and disadvantages of these two approaches? The motivation for the different approaches is the expected frequency of multiple component statements. The perceived problem with the traditional approach is that when several statements are to be the components of a structured statement, it is necessary to surround the several statements with a begin-end pair to form a single (compound) statement. This is not necessary with the self-delimiting approach. However, if the component is exactly one statement, the self-delimiting approach requires an additional end over the traditional approach.

The question then is whether components of structured statements are more typically single or multiple statements. We would contend that a component consisting of many statements is in fact an abstract operation which is best expressed as a separate procedure. Using this style, we find that most structured statements have only a single component statement -- a procedure call. Thus, the self-delimiting approach leads to long chains of " end; end; end; ..." which are annoyingly unnecessary. One can argue however that two and three statement components are not unusual. For this case, we would suggest that a statement connector, rather than a delimiter is needed. For example define "amp;" (read as "and") as a statement connector so that S1 amp; S2 amp; ... amp; Sn is semantically equivalent to begin S1; S2; ...; Sn end. Now consider the following equivalent statements.

                     while i < size(s)
                        do begin
                               i := i + 1

Self-delimiting while i < size(s) do process(s[i]); i := i + 1 end;
With amp; Connector while i < size(s) do process(s[i]) amp; i := i + 1;

(The current use of amp; for logical conjunction can be reasonably replaced by the use of the word "and" which is, in our opinion, more readable anyway.)

Another syntactic annoyance which can be quite frustrating when writing a program is the requirement that a complete compound name be given for operations on abstract data structures. The name is of the form: type_name$operation_name. This is necessary since many types may have operations of the same name (such as create). If the complete name was not given, ambiguity can arise over which of the potentially many operations is meant. As was pointed out in [5, page 567], "the ambiguity [of operation names] could in most cases be resolved by context" by examining the (declared) types of the parameters and selecting the correct unique operation. The (completely valid) objection to this is that "we have found in using CLU that the compound form enhances the readability of programs" [5, page 567]. This is most likely true. However, we would suggest that the appropriate solution is that the compiler accept incompletely defined names, resolve any ambiguities and replace the incomplete name with a completely defined name for future use. The program which is written need not be the program which is read, since the compiler can change many aspects of a program (such as indentation) to make the program easier to read.

Also, it can be argued that ambiguity (which can be resolved by the compiler) is not always harder to read. This is demonstrated by the ambiguity introduced by CLU's syntactic sugar in the form of array references, record references and infix operators. In all of these cases, these operations are converted into completely defined type$operation names by compiler look-up in the symbol table of the types of the parameters. Thus the mechanism to resolve incompletely specified names already (partially) exists and was introduced specifically to make programs easier to read.

CLU lacks an in-line comment capability such as Pascal's (* *) or PL/I's /* */. All comments are terminated by a newline which seems to us to mistakenly confuse the display functions of newline with the orthogonal need to terminate comments.

Some built-in names, specifically int (integer), bool (boolean), proc (procedure), and iter (iterator) have been shortened, to the detriment of readability; apparently for the sole purpose of saving some key strokes. This seems unnecessary and arbitrary. (Why shorten boolean and not continue?)

The use of '$' as the separator for compound names is understandable from the point of view that there were probably few unused special characters from which to choose. However, it is an unfortunate choice as it is too dense graphically to separate the two components. Psychological studies could probably prove that punctuation characters (. , : ; ...) are typically small with lots of space to make them more obvious as separators. We would suggest therefore that a less dense character be used, like the colon (type : operation), period (type . operation), vertical bar (type | operation) or such.

4. Conclusions

This paper is an attempt to evaluate the design of CLU as a practical programming tool. This is based upon the author's recent experiences with CLU as well as more conventional languages, particularly Pascal. Our intention is to bring to light some of the problems which, we believe, exist in the CLU language and may limit its widespread use. Our intentions are benign; we would hope that this discussion might lead to an improved design (CLU/2) or prevent similar problems with other new language designs which may be influenced by CLU. We also admit that the perceived problems may not prevent the widespread adoption of CLU since the adoption of a language seems to be largely determined by non-technical decisions (for example, consider PL/I).

Whatever the eventual fate of CLU, the language has already had a substantial impact on the field of language design by virtue of its many new features and concepts. We do not believe that the problems listed in this paper, (with the exception of the call-by-sharing semantics and problems with mutable and immutable objects) are inherent to the current CLU design. Thus, we feel that they could be easily corrected while retaining the basic flavor and integrity of the CLU language. CLU is a great step forward in language design, but like many projects, it may need minor mid-course corrections.

5. References

  1. A. L. Ambler, D. I. Good, J. C. Browne, W. F. Burger, R. M. Cohen, C. G. Hoch, and R. E. Wells, "Gypsy: A Language for Specification and Implementation of Verifiable Programs", Proceedings of an ACM Conference on Language Design for Reliable Software, Sigplan Notices, Volume 12, Number 3, (March 1977), 1-10.

  2. C. M. Geschke, J. H. Morris, and E. H. Satterthwaite, "Early Experience with Mesa", Communications of the ACM, Volume 20, Number 8, (August 1977), 540-552.

  3. B. W. Lampson, J. J. Horning, R. L. London, J. G. Mitchell, and G. J. Popek, "Report on the Programming Languge Euclid", Sigplan Notices, Volume 12, Number 2, (February 1977), 1-79.

  4. B. Liskov, E. Moss, C. Schaffert, B. Scheifler, A. Snyder, "CLU Reference Manual", Computation Structures Group Memo 161, Laboratory for Computer Science, Massachusetts Institute of Technology, (July 1978), 138 pages.

  5. B. Liskov, A. Snyder, R. Atkinson, C. Schaffert, "Abstraction Mechanisms in CLU", Communications of the ACM, Volume 20, Number 8, (August 1977), 564-575.

  6. N. Wirth, "Modula: A Language for Modular Multiprogramming", Software -- Practice and Experience, Volume 7, Number 1, (January 1977), 3-35.

  7. W. A. Wulf, R. L. London, and M. Shaw, "An Introduction to the Construction and Verification of Alphard Programs", IEEE Transactions on Software Engineering, Volume SE-2, Number 4, (December 1976), 253-265.

Appendix I

An Example CLU Program

The following CLU program illustrates some of the problems which we believe exist in the design of the CLU programming language.
%  The following clusters and procedures are a substitution
%  program for simultaneously substituting a set of "new" patterns
%  for their corresponding "old" patterns throughout an input
%  file, producing a new output file.  When started, the program
%  asks the user for the names of three files:
%  	The input file - an existing text file in which
%  	the substitutions should occur.
%  	The output file - the name of the file which is
%  	to be produced by the substitution program.
%  	The rule file - a file specifying the "old" and
%  	"new" patterns to be used in the substitution.
%  This program can be used as a simple (parameter-less) macro
%  processor, or as an aid in renaming variables in programs,
%  expanding abbreviations, correcting misspellings in text, or
%  even as a file copy routine (empty rule file).
%  All substitutions are done simultaneously -- thus it is
%  possible to switch all a and b by giving the rule to
%  substitute a for b and b for a.  Substitution is not
%  performed on the results of a substitution, but only on the
%  original input text.
%  The rule file specifies the substitutions to be performed.
%  Each separate substitution is on a separate line.  Each line
%  consists of a left-hand-side (the "old" part)  and a
%  right-hand-side (the "new" part) separated by the < character.
%  Thus the rule file to substitute a for b, b for a, new
%  for old and perhaps for prehaps would be,
%  	b>a
%  	a>b
%  	old>new
%  	prehaps>perhaps
%  Non-printing and control characters may be specified by giving
%  their (decimal) numeric character code, enclosed in brackets
%  []. Thus, to substitute two line-feeds (ASCII 10) for each line
%  feed after a carriage return (ASCII 13), we would write,
%  	[13][10]>[13][10][10]
%  Finally to allow > and [ to be used as old or new characters
%  freely, a quote character is defined to be either the single
%  quote ' or the double quote ".  Thus to substitute John for
%  "<name>", we could write a rule like,
%  	'"<name'>'">John
%  where we used the single quote as the quote character to quote
%  both the " and > characters.
%  Summary of special characters
%  >			separates left hand side from right hand side
%  ' and "		quote characters
%  [number]		decimal numeric specification of a character
%  newline		terminates right hand side
%  The basic algorithm of the program is to read the rules in and
%  from these rules construct a deterministic finite state machine
%  which is then used for the actual substitution.  For this, we
%  define the following abstract data types,
%  rule:	a substitution rule consisting of a left hand
%  		side (left: string) and a right hand side
%  		(right: string).
%  list_of_rules:  an array of rules
%  state:	a state for the finite state machine.  Most states
%  		are defined by the set of positions (position_list)
%  		in each of the rules that they represent.  The start
%  		state starts at the far left of each rule.  A list of
%  		transitions (transition_list) for each state indicates
%  		what the next characters may be and the successor states
%  		for each.  If the end of (the left hand side of) a rule
%  		is reached, the state is a final state, and only the
%  		rule (rule) need be remembered.
%  position:	a position in a rule, specified by a rule (r) and an
%  		index (n) into that rule.  0 <= n < size(r.left).
%  		The major item of interest is the character (c) at
%  		(following) that position.
%  transition:  specifies a character (c) and the resulting state
%  		(next_state).
%  list_of_states:  a list of all the states.  Used to identify that newly
%  		constructed states are the same as previous states.
%  		Two states are equal if they are both final, with the
%  		same rule, or if they represent the same set of positions
%  		in the rules.
%  For each abstract type, a cluster is defined with the
%  appropriate operations. The clusters are defined first.
% * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
%  rule:	a substitution rule consisting of a left hand
%  		side (left: string) and a right hand side
%  		(right: string).

rule = cluster is create, get_left, get_right, similar;
rep = record [left: string, right: string];
create = proc (l: string, r: string) returns (cvt); return ( rep${left: l, right: r} ); end create;

get_left = proc (r: cvt) returns (string); return (r.left); end get_left;

get_right = proc (r: cvt) returns (string); return (r.right); end get_right;

similar = proc (s: cvt, t: cvt) returns (bool); return (s.left = t.left); end similar; %used to determine if equal states

end rule;

% % * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * % % list_of_rules: an array of rules %
list_of_rules = cluster is create, define_rule, all_rules, fetch;
rep = array [rule];

create = proc () returns (cvt); return ( rep$create(1) ); end create;

define_rule = proc (list: cvt, l: string, r: string); rep$addh (list, rule$create(l,r)); end define_rule;

all_rules = iter (list: cvt) yields (rule); for i: rule in rep$elements(list) do yield (i) end; end all_rules;

fetch = proc (list: cvt, i: int) returns (rule); return (list[i]); end fetch;

end list_of_rules;
% % * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * % % % transition: specifies a character (c) and the resulting % state (next_state). %
transition = cluster is create, get_c, get_next_state, similar;

rep = record [ c: char, next_state: state];

create = proc (ch: char, s: state) returns (cvt); return ( rep${c: ch, next_state: s} ); end create;

get_c = proc (t: cvt) returns (char); return (t.c); end get_c;

get_next_state = proc (t: cvt) returns (state); return (t.next_state); end get_next_state;

similar = proc (s:cvt, t:cvt) returns (bool); return (rep$equal(s,t)); end similar;

end transition;

% % * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * % % position: a position in a rule, specified by a rule (r) and an % index (n) into that rule. 0 <= n < size(r.left). % The major item of interest is the character (c) at % (following) that position. %
position = cluster is create, get_rule, get_n, get_c, similar;

rep = record [r: rule, n: int];

create = proc (r: rule, n: int) returns (cvt); return ( rep${r: r, n: n} ); end create;

get_rule = proc (p: cvt) returns (rule); return (p.r); end get_rule;

get_n = proc (p: cvt) returns (int); % index into a rule starts at 0 return (p.n); end get_n;

get_c = proc (p: cvt) returns (char) signals (end_of_rule); if p.n >= string$size(p.r.left) then signal end_of_rule else return (p.r.left[p.n+1]); % index starts at 0, string starts at 1 end; end get_c;
similar = proc (s: cvt, t: cvt) returns (bool); % similar positions if same index in same rule return ((s.n=t.n) cand rule$similar(s.r, t.r)); end similar; %used to determine if equal states

end position;
% % * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * % % state: a state for the finite state machine. Most states % are defined by the set of positions (position_list) % in each of the rules that they represent. The start % state starts at the far left of each rule. A list of % transitions (transition_list) for each state indicates % what the next characters may be and the successor states % for each. If the end of (the left hand side of) a rule % is reached, the state is a final state, and only the % rule (rule) need be remembered. %

state = cluster is create, equal, add_to_positions, add_to_transitions, all_positions, all_transitions, make_final, final, get_rule;
nonfinaltype = record [ position_list: array[position], transition_list: array[transition] ];
rep = oneof [ notfinal: nonfinaltype, final: rule ];

create = proc () returns (cvt); return (rep$make_notfinal(nonfinaltype$ { position_list: array[position]$new(), transition_list: array[transition]$new() })); end create;

equal = proc (s: cvt, t: cvt) returns(bool) return (rep$similar(s, t)); end equal;

add_to_positions = proc (s: cvt, r: rule, n: int) signals (final_state); tagcase s tag final: signal final_state; tag notfinal(slist: nonfinaltype): array[position]$addh (slist.position_list, position$create(r, n)); end; end add_to_positions;

add_to_transitions = proc (s: cvt, c: char, k: state) signals (final_state); tagcase s tag final: signal final_state; tag notfinal(slist: nonfinaltype): array[transition]$addh (slist.transition_list, transition$create(c, k)); end; end add_to_transitions;

all_positions = iter (s: cvt) yields (position) signals (final_state); tagcase s tag final: signal final_state; tag notfinal(slist: nonfinaltype): for j: position in array[position]$elements (slist.position_list) do yield(j) end; end; end all_positions;

all_transitions = iter (s: cvt) yields (transition) signals (final_state); tagcase s tag final: signal final_state; tag notfinal(slist: nonfinaltype): for j: transition in array[transition]$elements(slist.transition_list) do yield(j) end; end; end all_transitions;

make_final = proc (s: state, r: rule) returns (cvt); % make the partially constructed state s final return ( rep$make_final(r) ); end make_final;

final = proc (s: cvt) returns (bool); return (rep$is_final(s)); end final;
get_rule = proc (s: cvt) returns (rule) signals (not_final); tagcase s tag final (r: rule): return (r); tag notfinal: signal not_final; end; end get_rule;
end state;

% % * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * % % % list_of_states: a list of al the states. Used to identify that newly % constructed states are the same as previous states. % Two states are equal if they are both final, with the % same rule, or if they represent the same set of positions % in the rules. %
list_of_states = cluster is create, add_to, all;

rep = array [state];
create = proc () returns (cvt); return (rep$new()); end create;

add_to = proc (list: cvt, s: state); rep$addh(list, s); end add_to;

all = iter (list: cvt) yields (state); i: int := 1; while i <= rep$high(list) do yield(list[i]); i := i + 1; end; end all;
end list_of_states;
% % * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * % % The substitute main program. % % get the three files to be used by this substitution then read % all of the rules, and construct the state machine for the % substitution. Finally, carry out the substitution requested. % The maximum_rule_length is used to buffer the output. %
substitute = proc (); in_file: stream; out_file: stream; rule_file: stream; root: state;
rule_list: list_of_rules; maximum_rule_length: int; in_file, out_file, rule_file := get_files();
rule_list, maximum_rule_length := input_rules(rule_file);
root := construct_state_machine (rule_list);
sub (in_file, out_file, root, maximum_rule_length);
end substitute;
% % * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * % % get_files -- a procedure to get the three files needed % % for each file, ask the user for its name, parse the name, % and open the file. % % note that this procedure could be changed to interact % with the user in a different way if desired. For example, % to ask for an input file name and construct the output and % rule file names from this.
get_files = proc () returns (stream, stream, stream)
inname: string; filename: file_name;
ttyi: stream := stream$primary_input(); ttyo: stream := stream$primary_output();
stream$puts(ttyo, "input file name: "); inname := stream$getl(ttyi); filename := file_name$parse(inname); in_file: stream := stream$open(filename, "read");
stream$puts(ttyo, "output file name: "); inname := stream$getl(ttyi); filename := file_name$parse(inname); out_file: stream := stream$open(filename, "write");
stream$puts(ttyo, "rule file name: "); inname := stream$getl(ttyi); filename := file_name$parse(inname); rule_file: stream := stream$open(filename, "read");
return (in_file, out_file, rule_file);
end get_files;
% % * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * % % input_rules -- the procedure to read the rules from the rule % file. % % read the input rules one at a time and put them in the list % of rules. The left hand part is delimited by a > while the % right hand part is delimited by a newline. The maximum rule % length is needed when the output file is buffered during % actual substitution. %
input_rules = proc (rule_file: stream) returns (list_of_rules, int);
l: string; r: string;
list: list_of_rules := list_of_rules$create(); maximum_rule_length: int := 0;
while ~stream$empty(rule_file) do l := get_rule_part(rule_file, '>'); r := get_rule_part(rule_file, 'n'); list_of_rules$define_rule(list, l, r); if string$size(l) > maximum_rule_length then maximum_rule_length := string$size(l) end; end;
return (list, maximum_rule_length);
end input_rules;
% % % get_rule_part -- get the next part up to the delimiter and return it % % the string desired is up to the delimiter. Here we check for the % delimiter, quotes, and the left bracket which indicates a numeric % representation of the character. The characters are read one at a % time and appended to the growing string in the variable part. %
get_rule_part = proc (rule_file: stream, delimiter: char) returns (string);
c: char; part: string := ""; c := stream$getc(rule_file);
while c ~= delimiter cand ~stream$empty(rule_file) do if c = '"' | c = ''' then c := stream$getc(rule_file) elseif c = '[' then c := character_read_as_number(rule_file); end;
part := string$append(part, c); c := stream$getc(rule_file); end; return (part);
end get_rule_part;
% % character_read_as_number -- read number and return as character % % the number is decimal but can be easily changed to octal or % hexadecimal if desired by changing this procedure. % % Note that we don't check that the number is properly terminated % by a left bracket. %

character_read_as_number = proc (rule_file: stream) returns (char);
c: char; n: int := 0;
c := stream$getc(rule_file); while '0' <= c cand c <= '9' do n := n * 10 + char$c2i(c) - char$c2i('0'); c := stream$getc(rule_file); end;
return (char$i2c(n));
end character_read_as_number;
% % end of input routines % % * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * % % construct the finite state machine from the rules % % the initial state is defined as the zero position in each rule % Then we construct the successors to this state, the successors % to these states, etc., until all successors to all states have % been generated. %
construct_state_machine = proc (rule_list: list_of_rules) returns (state); state_list: list_of_states := list_of_states$create();
root: state := first_state(rule_list); list_of_states$add_to (state_list, root);
for s: state in list_of_states$all(state_list) do construct_all_successors(s, rule_list, state_list) end;
return (root);
end construct_state_machine;
% % construct the first state (zero position of all rules) %
first_state = proc (rule_list: list_of_rules) returns (state);
first: state := state$create();
for i: rule in list_of_rules$all_rules (rule_list) do state$add_to_positions(first, i, 0); end;
return (first);
end first_state;
% % * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * % % We need to construct all successors to the state s. If the % state is final, there are not successors. Otherwise, for each % position in the state, we see if the successor state is % already defined. If it is not, we define it. % % The check for an existing successor is caused by a state with % (for example) 6 positions of (in order) a,b,c,a,c,d. We want % to define the successor of this state under a, then under b, % then under c. At this point, the successors under the next % two positions (a and c) have already been defined. Finally we % define the successor under d. %

construct_all_successors = proc (s: state, rule_list: list_of_rules, state_list: list_of_states);
if ~state$final(s) then for j: position in state$all_positions(s) do successor(s, j.c) except when no_successor: define_successor(s, j.c, rule_list, state_list); end; end; end;
end construct_all_successors;
% % find the successor of state s under the character c. % Check all transitions for a transition under c. If there % are no transitions, signal that there is no successor. %

successor = proc (s: state, c: char) returns (state) signals (no_successor);
for j: transition in state$all_transitions(s) do if j.c = c then return (j.next_state); end; end; signal no_successor;
end successor;
% % Define the successor of state s under c % % The successor is the set of positions which result from all % current positions with c as the next character. Once this % state is constructed, we need to check if this is a new % state or an existing one (by using find). Finally, we add % this new transition to the list of transitions for s. % % remember that at any time we may start over at the start % state, so that every (successor) state includes the start % state. %

define_successor = proc (s: state, c: char, rule_list: list_of_rules, state_list: list_of_states);
successor: state := first_state(rule_list);
for j: position in state$all_positions(s) do if j.c = c then successor := add_next_position(j, successor); end; end; k: state := find(successor, state_list);
state$add_to_transitions (s, c, k);
end define_successor;
% % we want the state to move over the next character in the % position p. This will either be a new position at p.n+1 % or if we go off the end of the rule, a final state (bingo!) %

add_next_position = proc (p: position, s: state) returns (state);
if p.n+1 < string$size(p.rule.left) then state$add_to_positions(s, p.rule, p.n+1) else s := state$make_final(s,p.rule); end; return (s);
end add_next_position;
% % search the list for this (possibly) new state. If found, return % the name of the existing state; if not, add this new state to % the list of states and return it. %

find = proc (s: state, state_list: list_of_states) returns (state);
for i: state in list_of_states$all(state_list) do if i=s then return (i) end; end;
list_of_states$add_to (state_list, s);
return (s);
end find;

% end of routines to construct the states % % % * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * % % The state machine is built and we are now ready to do the % actual substitution. The basic strategy is easy. We start % at the start state (root), and read characters one at a time, % following the state machine transitions as determined by the % input character. The characters are output as they are read. % If the new state is final, then we delete the last n characters % (where n is the length of the left hand side of the rule) and % output the right hand side of the rule. The deletion is % possible by creating a buffer of length equal to the length % of the longest left hand side. The buffer is an abstract type. % % % the buffer is defined by the output file, the length of the % buffer and an array of characters %
buffer = cluster is create, out, replace, flush;
rep = record [ f: stream, length: int, buff: array[char] ];
create = proc (f: stream, length: int) returns (cvt); return (rep${f: f, length: length, buff: array[char]$new()});
end create;

out = proc (b: cvt, c: char);
% a character is output. put it in the buffer % if the buffer is full, print the oldest
array[char]$addh(b.buff, c); if array[char]$size(b.buff) > b.length then stream$putc(b.f, array[char]$reml(b.buff)) end;
end out;

replace = proc (b: cvt, r: rule);
% replace the last n characters in the buffer % by deleting them and then adding the most % right hand side of the rule
array[char]$trim( b.buff, array[char]$low(b.buff), array[char]$size(b.buff)-string$size(r.left) );
for c: char in string$chars(r.right) do buffer$out(up(b), c) end;
end replace;

flush = proc (b: cvt);
% empty the buffer. the program is over.
for c: char in array[char]$elements(b.buff) do stream$putc(b.f, array[char]$reml(b.buff)) end; stream$close(b.f);
end flush;
end buffer;

% % * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * % % The substitution program. % % read each character and follow the state machine around. If % a final state is found, replace according to the applicable % rule. If no successor state exists, start over at the root. %
sub = proc ( in_file: stream, out_file: stream, root: state, maximum_rule_length: int);
current: state; c: char;
b: buffer := buffer$create(out_file, maximum_rule_length);
current := root;
while ~ stream$empty(in_file) do c := stream$getc(in_file); current := successor(current, c); except when no_successor: current := root end; buffer$out(b, c);
if state$final(current) then buffer$replace (b, current.rule); current := root; end;
end sub;
% % end of program %