Translating Java Programs into C++




James Peterson1,2,4, Glenn Downing2,3, and Ron Rockhold1,2




1IBM Austin Research Laboratory
Austin, Texas 78758


2Department of Computer Sciences
The University of Texas at Austin
Austin, Texas 78712


3Have OOPL, Will Travel
15707 Racine Cove
Austin, Texas 78717

4Netpliance
7600A Capital of Texas Highway
Austin, Texas 78731





July 1998


Abstract

Java has become a very popular system for creating object-oriented applications. C++ is the more traditional language for object-oriented programs. We have investigated the problem of converting programs from Java to C++. While the two languages are in many ways similar, there are some interesting semantic problems in translating from Java to C++.

A translator has been written to automatically convert Java programs into C++. Once this is done, a runtime system must be provided. The provision of a runtime system allows programs written in Java to be converted to C++ and executed.


Introduction

Java [1,2] has become very popular in a short period of time. With its combination of object-orientedness and portability across platforms, it would seem to provide a solution to the code reuse problem. Being able to effectively reuse code should result in higher programmer productivity.

The portability of Java across platforms is accomplished by compiling Java programs into a machine independent byte code which is stored in a class file. The resulting code is then interpreted by a Java Virtual Machine (JVM). While there are many applications that may execute satisfactorily in an interpreted mode, computationally intensive applications tend to suffer poor performance when they are interpreted. In this case, a Just-In-Time (JIT) compiler can be used to translate, on the fly, the Java byte code instructions into the native instruction set of the current processor. Compiling the Java program into native code should result in faster execution.

The quality of the native code, however, depends upon the algorithms, memory, and time spent first in the Java compiler, and then the algorithms, memory, and time spent in the JIT. While more sophisticated Java and JIT compilers may produce better native code, the total execution time (which includes both the time for running the JIT compiler as well as the time to run the native code) may increase or decrease relative to the time for the JVM to interpret the Java byte codes, or as compared to a different JIT compiler.

But while Java was designed to allow portability across many platforms, there are still significant platforms for which Java is not available: no JVM has been written nor is a JIT available. In addition, some systems are not designed to run a JVM, but only more conventional programs.

For a language such as Java, an alternative language would seem to be to C or C++ [3]. C is generally a mature language with compilers which can produce very good performance. However, given the object-oriented nature of Java, it would seem that C++ might be a closer match. Programs in Java are designed as object-oriented programs; C++ is also an object-oriented programming language.

Indeed, it has been suggested that Java is a simpler and cleaner version of C++, leaving out complex or troublesome portions of the language (such as pointers, multiple inheritance, and operator overloading) while providing useful features which are not in C++ (such as garbage collection) but avoiding the complexity and obscurity of the more difficult parts of C++. For example, C++ allows multiple inheritance, while Java restricts itself to single inheritance of classes. Seeing Java as a well-designed subset of C++, it would seem reasonable to believe that a Java program could be easily translated into C++, while a more complex and sophisticated C++ program would have difficulty being translated directly in Java. It seems plausible that a Java program can be automatically converted into C++.

Several other projects have attempted to convert Java into C or C++ [4]. Toba [5] and j2c [6] convert Java into C. However both start, not with the Java source file, but with the Java class files. In some sense then, they are more like JIT compilers, converting a Java byte code sequence into a C program which is then compiled to produce native code. The C program is not a useful program for its own sake, but only as a means of creating native code. Similarly jcc [7] converts Java source code to C source code, but since C does not support objects directly, includes code to effectively interpret the Java program in C.

Our goal is to convert Java programs into equivalent C++ programs which closely match the original Java program at the source level, to translate (not to compile) a Java program into a semantically equivalent C++ program. Translation requires examining the various language features of Java and determining equivalent language features in C++.

Translating from Java to C++ would allow an application programmer to begin developing in Java and, if necessary, convert the Java program to C++ and continue development from then on in C++. Since the C++ program is a faithful replica of the Java program, we can continue with its implementation in C++ to determine how much the various features of Java (as represented in C++) cost in performance, space or complexity. For example, for a correctly executing program, we could disable the runtime check for array subscript errors or the check for dereferencing a NULL pointer (both required parts of Java), allowing us to quantify the cost of such a runtime check.

In the interests of full discussion, we would note that coversion of Java to C++ would not be interesting for Java applets which are remotedly downloaded over a network. In fact, generally, C++ systems do not support the dynamic loading capabilities of Java systems. In addition, translating programs which use reflection and introspection would probably not yield the expected result. Java programs which use reflection and introspection examine their own class files to determine their own capabilities; the translated C++ program would also examine the Java class files to determine the capability of the Java classes -- functionally equivalent, but not what you might expect. None of these are really aspects of the Java language, but rather of the Java runtime classes. We are interested only in the Java and C++ languages.

Language Translation

To translate from Java to C++, we must consider each syntactic and semantic part of the Java language and define the equivalent C++ construct. We begin by examining the Java data types, then the control structures, and finally, the class structure.

Data Types

The basic data types in Java are a set of builtin types (byte, short, int, long, float, double, boolean, char), objects, and arrays. The builtin types would seem to mostly translate directly to C++, but there are important differences. Java has well-defined statements of the number of bits and the representation of these types, while C++, following C, is more vague.

In Java, an int is exactly a 32-bit two's complement integer. In most C++ implementations, an int is 32 bits, but it could be only 16 bits or maybe 64. We know only that an int in C (and C++) is at least as large as a short (at least 16 bits) and no larger than a long (at least 32 bits). While most machines use a two's complement representation of integers, C or C++ could also use ones' complement.

Accordingly, it is not, in general, possible to translate a Java int into a C++ int. However while it may not be universally possible to translate directly from Java ints to C++ ints, it is practically possible for each platform to translate a Java int into some basic C++ type which is a guaranteed 32-bit integer for that platform. The specific mapping of a Java int may vary from platform to platform, but there should be some mapping for each platform. To allow the C++ definitions to be varied as necessary, we translate the Java builtin types to named types in C++, and assume that every C++ file will #include a general header file (JavaSupport.h) which will define them as necessary.

Java TypeC++ TypeTypical definition
byte_j_bytetypedef signed char _j_byte;
short_j_shorttypedef short _j_short;
int_j_inttypedef int _j_int;
long_j_longtypedef long long _j_long;
float_j_floattypedef float _j_float;
double_j_doubletypedef double _j_double;
boolean_j_booleantypedef bool _j_boolean;
char_j_chartypedef unsigned short _j_char;

The inclusion of JavaSupport.h also gives us a place to define bool if it is not supported by the compiler, as well as other declarations that may be needed to support the translation from Java to C++.

The major extensions in Java are the long (64-bit) type and the char (16-bit Unicode), but these are easily supported on most C++ compilers. In the worst case, it might be necessary to provide a class, say for 64-bit arithmetic, and overload the C++ operators for that class for a system which did not provide direct support for the needed type.

Objects

Object variables in Java are not really objects, but are always references to objects. Since all Java objects are references, Java automatically dereferences as necessary. In C++, we can do the same, but since we can have both objects and references to objects in C++, it is necessary in C++ to explicitly reference and dereference. So Java objects are translated to C++ pointers to objects:

JavaC++
money m;money *m;
loan mortgage;loan *mortgage;
mortgage.m.valuemortgage->m->value

Names

Most objects in both Java and C++ are referenced through variables. The names of the variables have both type and scope. In general we want to keep the same names in C++ as are used in Java.

Class names in Java are identified by a complete package name. Alternatively, using the Java import statement, it is possible to refer to classes with partial names. For example, since an import java.lang.* is implicit to each Java program, the java.lang.Object and java.lang.String classes can be referred to in Java as simply Object and String.

C++ has no corresponding notion of a package, and so we always use the complete package name for all types in our translation from Java to C++. Since we cannot use the period to delimit parts of the name, for syntactic reasons, we transliterate the period separator to an underbar. Thus, a reference to Object in Java is first translated to the complete java.lang.Object and then to java_lang_Object in C++.

Although C++ has no concept of a package, it does have namespaces. We use a separate namespace for each class to provide appropriate isolation of names between the classes.

The use of namespaces also shows an interesting difference between the overloaded use of the period (or dot) as syntactic separator in Java and the differing delimiters used in C++. The Java name java.lang.System.out.println is composed of three parts.

Although the C++ equivalent has the same parts, it uses different delimiters to separate them.

JavaC++
java.lang.System.out.printlnjava_lang_System::out->println

Java has another troublesome situation with names. In Java, the compiler uses the syntactic type of the name to allow different syntactic tokens to use the same name. Thus, in Java, you can use the same name for a local variable, a method, a field, and a label. In C++ each of these names must be unique. From a programmer's point of view, the use of the same name for different purposes serves only to make programs more difficult to understand, edit, and maintain. However, since it is not necessary to write bad programs in Java, the ability to overload names is not commonly used. (It is deliberately used in some Java obsfucation programs, to deter reverse engineering.)

It would be easy to prepend a "type of name" identifier to a Java name when converting it to C++. All labels could be of the form "l_xxx", all local variables "v_xxx", all methods "m_xxx", and so on. This would assure that each use of a name is matched only with the correct declaration of the name, but it would make the resulting C++ program more obtuse. Accordingly, we try to transfer Java names directly to C++ unless there is a clear conflict.

Notice that, in general, we cannot translate names only when a conflict arises unless we have global knowledge of all uses of a class. For example, assume a Java class A has a field X while an interface B has a method X. Neither of these have a name conflict since each name is local to its declaring class. However if a class C then extends A and implements B (which is to say becomes a sub-class of both the original class and the interface), it inherits both the method X (from B) and the field X (from A). At least one of these will need to be renamed both for C, and to maintain the inheritance, for either A or B.

For complete correctness, it would be necessary to translate names into unique names before any name conflict arises. In our translator, we rename only when an actual conflict arises, and expect the programmer to rename things at the Java level, if need be, to avoid conflict. This allows names in Java and names in C++ to be the same.

We also must rename those Java names which are reserved words in C++, such as extern, auto, inline, union, and so on.

Classes and Interfaces

Java requires, and C++ supports, programs structured as a set of classes. Inheritance among classes defines a class hierarchy. In addition, Java also defines the concept of an interface which defines a set of constants and methods, but provides no implementation for the methods. Classes which inherit (implement) the interface are required to provide the actual code for the methods.

Java allows a class to inherit from only one superclass and implement any number of interfaces, while C++ allows multiple inheritance. Java interfaces can only inherit from other interfaces, with one exception. All classes, interfaces and arrays effectively inherit from java.lang.Object.

An initial possibility in translating Java to C++ is to completely ignore interfaces. Interfaces provide no code to translate or run, but only a compile time requirement that those classes which implement an interface must provide the bodies for the interface methods. If we assume the class is complete and correct in Java (and use the Java compiler to check or enforce this requirement), then the methods are provided by the classes; the interfaces merely create a requirement for the existence of the methods, and the requirement is met.

However, since we are converting from Java source to C++ source, we would like to maintain the equivalent structure and understanding in the C++ source. Discarding interfaces might create significant holes in the class structure in the C++ code.

In addition, there are two properties of interfaces which require that they be part of the C++ code:

Accordingly both classes and interfaces in Java are converted to classes in C++. Interfaces, since they have methods with no code, are abstract and contain pure virtual functions. A class in C++ is defined to inherit from all of the classes corresponding to its Java superclass and Java superinterfaces. If a class or interface in Java does not explicitly declare a superclass or a superinterface, it implicitly inherits from java.lang.Object. We make this inheritance explicit when we translate from Java to C++. So every class in C++ (except java_lang_Object of course) either directly or indirectly inherits from java_lang_Object.

Since C++ supports multiple inheritance, you might expect that inheritance from both classes and interfaces would be no problem in C++. However, there is one difficulty. Since interfaces can inherit from other interfaces, a class may end up inheriting from a given interface, either directly or indirectly, multiple times. In particular, every class inherits from java_lang_Object. Classes which inherit from both a superclass and some superinterface, inherit from java_lang_Object multiple times. This is resolved by making java_lang_Object and all interfaces virtual base classes [3, page 396].

The use of virtual base classes resolves the difficulties with multiple inheritance (especially of java_lang_Object). It also requires that explicit casting may be necessary in C++ to provide the compiler with the ability to coerce types. Static casting can be used to cast up the class hierarchy; dynamic casts are needed to cast down the class hierarchy.

Arrays

Arrays in Java are a unique construct having the properties of both classes and built-in types. Arrays have a field (length) as well as memory for the array elements themselves. Arrays can be represented in C++ by a class containing two fields: the array length and a pointer to the dynamically allocated memory for the array.

Java arrays are thus translated into a class in C++. The array class inherits from java_lang_Object, although java_lang_Object need not be a virtual base class in this case, since arrays inherit from no other class. Since the type of the array element can vary, a template class [6, Chapter 13] is used in C++, creating a separate C++ array class for each Java array type. The template class defines constructors and overloads the subscript operator. To correctly reflect Java semantics, the subscript operator checks the subscript index against the length of the array for every reference to an element of the array, and throws an exception if the index is out of bounds.

Multiply-dimensioned arrays are implemented as an array of pointers to arrays (a dope vector). Since each array has its own length, we can support non-regular multiple dimensioned arrays, just as in Java.

Java Strings

Literal Java strings create their own difficulties in C++. The overriding problem is that literal strings in Java are Unicode, while literal strings in C++ are ASCII. In converting from Java to C++, we simply wrap all literal strings with a function (or macro) call. The function is to take a literal ASCII string and return a pointer to a Java String, an instance of class java_lang_String. At the moment this is supported by a conversion function at runtime, although this has a minor negative performance impact.

A complete solution to this problem requires either C++ compiler support for literal Unicode strings, or moving the conversion from an ASCII string to a Java String to be part of program initialization rather than during actual execution. The use of a tagging wrapper function on all literal strings makes either of these possible.

A developer of a Java program which is translated to C++ must decide if the translated program is to continue to use Unicode or should be implemented with simple ASCII. This would allow the cost of supporting Unicode to be explicitly defined.

Another difficulty in translation is the use of the plus (+) operator for string concatenation. Java defines the actual meaning of a + b, where a and b are Strings to be:

	new StringBuffer().append(a).append(b).toString()

Converting this to C++ then produces:

	(new java_lang_StringBuffer())->append(a)->append(b)->toString()

By converting the "syntactic sugar" of the plus operator for strings to its explicit implementation, in terms of appending to a StringBuffer, the normal conversion to C++ can handle this too.

A runtime switch in the translation program leaves the C++ code with a plus operator for strings. This would allow the C++ programmer to overload the plus operator on strings to be string concatenation. Java does not allow programmers to overload operators.

Operators and Statements

Most of the operators and statement types of Java are directly equivalent to the corresponding operators in C++, and so can be translated without difficulty. The similarity between Java and C++ in operators and statements is what led to the idea of automatic translation, at the source level, from Java to C++.

The >>> and instanceof operators require translation.

JavaC++
a instanceof t(dynamic_cast<a,t> != NULL)
a >>> b((unsigned int)a) >> b

The use of super to identify methods in Java is resolved by substituting the actual class name, resulting in a minor loss in generality.

Statements in Java also match statements in C++ pretty well. The labeled break and continue statements can be easily replaced by a goto to an appropriately placed (and uniquely named) label. An extra semicolon may be needed at the end of the body of a switch statement if there are trailing switch labels.

A synchronized block is converted to an explicit call to _j_enter_monitor, followed by the block contents, and a call to _j_exit_monitor. Synchronized methods are handled the same way. Both monitor calls have to pass the object on which to synchronize as a parameter. Threads and synchronization have no impact on the syntax or semantics of the Java and C++ programs. The influence is in the runtime support; the programs simply call functions and procedures, which do the actual work. No compiler or language support is needed in C++ for threads or synchronization (only runtime support).

Exception handling in Java is much the same as exception handling in C++ with the addition of the finally clause. The finally clause in a try statement is always to be executed after all other processing, either normal execution or if an exception is caught. Since there is no equivalent in C++, it becomes necessary to determine how to construct the equivalent C++ code from the Java code.

One difficulty is that the finally clause must be executed before every exit from the try block. There are a surprising number of potential exits from a general try block: a normal exit, throwing an exception, a return statement, a break statement, a continue statement, or failing to catch an exception. The finally clause must be executed before each of these (if they are used in the try block or a catch block).

For example, the following Java code shows most of the ways to exit from a try block:

	try
	  {
	    switch (x)
	      {
	      case 0: continue next;
	      case 1: break out;
	      case 2: throw new TException();
	      case 3: return(0);
	      }
	  }
	catch (Error e)
	  {
	    throw(e);
	  }
	finally
	  {
	    System.out.println("finally");
	  }

We considered making the finally clause a procedure (or technically, a method). However, the execution context of the finally block (the set of variables it can access and the meaning of names) is the execution context of the try block. Moving the finally block to a procedure would require either that there be no use of the try block execution context, or that that context be passed to the method as a parameter.

Another alternative would have been to set a local variable followed by a goto statement to transfer control to the finally block, followed by a switch statement to select a goto to transfer control back to the right place after the finally block.

Our solution then is to simply duplicate the text of the finally block in all places where it must be executed. If the finally block is significant in size, this might suggest that the Java programmer might want to make it a method, in order to minimize the code repetition in C++.

The following C++ code shows how the finally block must be replicated for all the exits from a try block:

    try
    {
      switch(x)
        {
        case 0:
	  /* finally clause, continue from try block */
	  java_lang_System::out->println(_j_toString("finally"));
	  goto next;

        case 1:
	  /* finally clause, break from try block */
	  java_lang_System::out->println(_j_toString("finally"));
	  goto out;

        case 2:
	  /* finally clause, throw exception from try block */
	  java_lang_System::out->println(_j_toString("finally"));
	  throw(new TException());

        case 3:
	  /* finally clause, return from try block */
	  java_lang_System::out->println(_j_toString("finally"));
	  return(0);
        }
    }
    catch(java_lang_Error *e)
    {
      /* finally clause, caught exception from try block */
      java_lang_System::out->println(_j_toString("finally"));
      throw(e);
    }
    catch(...)
    {
      /* finally clause, uncaught exception from try block */
      java_lang_System::out->println(_j_toString("finally"));
      throw;
    }
    /* finally clause, normal exit from try block */
    java_lang_System::out->println(_j_toString("finally"));

Constructors

A major difference between Java and C++ is the specification of constructors. Like C++, Java constructors are methods with the name of the class. However, the body of Java constructors can have three different forms.

The first two types of constructors in Java can be easily converted to C++, but the last is more difficult.

Our solution is to separate the memory allocation aspect of construction from the initialization of those values. All constructors in Java are converted to initialization methods in C++. Thus a Java class foo, with constructors foo(int) and foo(float), is translated into a C++ class foo with methods foo_init(int) and foo_init(float). All Java constructors are invoked by the new operator. A call to new foo is translated to (1) a call to allocate memory (using an undefined, and hence default, constructor for foo), followed by (2) an explicit call to initialize this object with the current parameters.

JavaC++
new foo(p)(new foo())->foo_init(p)

The init methods explicitly return this.

With this construction, each this constructor for foo is converted into a simple init method in C++. Init methods, being just methods to the C++ compiler, can freely call other init methods, either of their superclass or of the same class.

In addition, since Java explicitly defines default values (zero) for all otherwise uninitialized fields, we add a method to initialize all fields to their default (zero) values.

Notice that the separation of constructors into memory allocation and initialization portions is possible only because Java classes contain only references to objects, not objects themselves. The inclusion of objects would have required the use of the member initialization list in C++ as part of the constructor.

Class Initialization

A final problem involves the issue of when classes are initialized. Classes in Java may have initialized static fields, static blocks, and other code to initialize the class (as distinct from the initialization of the instances of the class by the constructors and instance initializers). C++ also allows class initialization, but the approaches of the two are quite different.

C++ gives no specification of the order in which multiple classes are initialized, only that all class initialization will be complete before the main program is started. This is useless in implementing class initialization for translated Java programs, since Java defines and expects a very specific initialization sequence.

Java initialization is defined for and by the order of loading. Classes in Java are dynamically loaded when the first reference is made to it, and the class is initialized when it is loaded. Since in our implementation, the equivalent C++ program will be statically bound and loaded, initialization must be programmed to happen at a time different than the loading and linking time.

Our solution is to create in C++ a specific class initialization method for each Java (and C++) class (or interface). All static field initialization and static blocks are collected in the order given, into the class initialization method. This collected code is called the core initialization code for this class. The problem then remains to get this code executed at the right time.

One approach would be to add a test in every method (actually only in every static method and constructor and before every reference to a static field from a class different than the current one), to see if the class has been initialized. If it has not been initialized, a call would be made to the class initialization method. While this would be a simple test of a class static boolean variable, we felt that the runtime performance impact of this constant testing would be a problem.

Depending upon the initialization only at load time produces very fragile Java programs. If the order of initialization of two unrelated classes is implicitly expected to be in a particular order, a small change in the program, an additional reference to a class or class static variable can change the order of loading and hence the initialization without warning. It would be dangerous to write Java programs whose class initialization depended on other than the explicit class dependencies.

In the case where the order of initialization is only determined by explicit dependencies, a different initialization strategy is possible. Basically, Java requires that:

Our preferred approach is to add a call to the class initialization method of the main class (the class with the main method which is the entry point for the Java program) before we call the main method. This one call must initialize not only the main class but any classes that it needs to initialize, in the correct order.

Looking at the class initialization method, we can examine the core initialization code. The superclass of this class and any classes used by the core initialization code must be initialized before we execute the core initialization code. Thus, if the core initialization code uses the java.lang.Math.sqrt function, we must call the class initialization method for java_lang_Math before we can execute the core initialization code for this class. A scan of the core initialization code is used to create a list of all classes used in that core code. Calls to the class initialization methods for these classes are inserted into the class initialization method before the core initialization code itself.

After the core, we also need to be sure that any other class used by any method of this class is initialized. That is, if some other method of this class will want to call a constructor or reference a static variable of java_lang_String, then the initialization of this class is not complete unless the java_lang_String is initialized. Accordingly, we form another list of all the classes used by any method in this class that were not initialized before the core initialization code. Calls to the class initialization methods for this second list of used classes are added to the class initialization method after the core class initialization code.

Thus, the class initialization method for the main method calls the class initialization methods of anything which it uses. These classes in turn call the class initialization methods of anything they use, flooding the entire C++ class structure until every reachable class has been initialized. If a class is not initialized, it will be because it is unreachable, and so does not need to be initialized.

A private class static boolean variable is used to prevent initializing a class twice.

The Translator

With this analysis of the Java programming language, we have written a program to automatically translate Java source application programs to C++ source programs. The translator assumes its input is a correctly compiling Java file, freeing us from detecting syntax errors and from enforcing compile time constraints on the input Java file.

The Java source for each Java class is converted into 3 C++ files: a *.G, *.H, and *.C file. The declaration of a class (or interface) foo is used to generate a foo.H header file, while the actual code for the methods is put in a foo.C code file. The foo.C code file #includes the foo.H header file to get the class declaration.

The foo.H file must #include the definition of the classes which are used in its definition. For example, if a class has a field which is a string, it might have a line like java_lang_String *name. This means that the foo.H file must also provide the compiler with a declaration for java_lang_String. However, circularity soon arises, since, for example, everything must include java_lang_Object, which includes java_lang_Class and java_lang_String which include java_lang_Object.

Our solution is to notice that, other than for superclasses (where we must #include the *.H file, but there is known to be no circularity), we do not need the complete definition of other classes that we use; we need only the fact that they are classes. Thus, we produce for every class foo, a file foo.G (a "pre-foo.H" file) which simply declares that foo is a class:

	foo.G:	class foo;

The class declaration header files (*.H) then #include the *.G files for all classes that they reference. This is sufficient to be able to declare the classes.

In addition, the foo.C code file #includes the class definition header files (the *.H files) for every class that it uses. In some sense, these #include lines replace the import statements in the Java file. Including the *.H declaration files gives a complete definition of each class to the compiler before it generates any actual code.

So the *.C files include the *.H files which include the *.G files. This structure avoids circularity in the include files. (Once this structure was in place and working, we have skipped the #include foo.G statements, replacing the #include statement directly with the contents of those files, the declaration of foo as a class: class foo;, since it takes fewer characters and is somewhat simpler to compile.)

The translator program uses the LALR(1) grammar for Java of Chapter 19 of The Java Language Specification[2], suitably rewritten as an input to yacc. Combined with a lexical analyzer from lex, we are able to parse Java programs.

The output of the lex/yacc code is a complete parse tree, including, for each token, information about comments and line breaks, to allow the output C++ code to reflect the comments and line breaks of the input Java code. The parse tree is normalized to eliminate some Java language variations. For example, the archaic declaration type name[] is converted to the more general type[] name.

The parse tree is then scanned to produce a symbol table, and to identify and propagate type information, producing a type-annotated parse tree, which is used to generate the output C++ files.

While the parse tree is being scanned, we may encounter references to other classes. Since we need the symbol table information for these classes, we recursively parse and scan these classes as they are encountered. The symbol table information can be obtained either from Java source code or from the class file form of these classes.

When the output C++ is generated, we attempt to meet the peculiarities of various C++ compilers. For example, the xlC compiler (from IBM), cannot handle repeated declarations of the same index in for-loops, so we generate an extra set of braces to create a block for each for-loop.

	{
	  for (int i = 0; i < n; i++)
		process(i);
	}

We are able to successfully translate Java source programs into C++ and compile these C++ programs without error with the IBM xlC compiler, the GNU g++ compiler, and with MicroSoft's Visual C++ compiler.

An Example

As an example, consider the simple class Point from page 120 of The Java Handbook [1]:

class Point {
  int x, y;
  Point(int x, int y) {
    this.x = x;
    this.y = y;
  }
  double distance(int x, int y) {
    int dx = this.x - x;
    int dy = this.y - y;
    return Math.sqrt(dx*dx + dy*dy);
  }
  double distance(Point p) {
    return distance(p.x, p.y);
  }
}


This is converted into the following Point.G, Point.H, and Point.C files:

Point.G:

class Point;

Point.H:

#ifndef __POINT_H_
#define __POINT_H_

/* **************************************************** */
/*							*/
/*	Header file Point.H created from Point.java	*/
/*							*/
/* **************************************************** */

#include "JavaSupport.h"

#include "Point.G"

/* Import #include's */
#include "java/lang/Object.G"
#include "java/lang/Math.G"
#include "java/lang/Class.G"
#include "java/lang/String.G"

/* Inheritance #include's */
#include "java/lang/Object.H"

class Point
: public virtual java_lang_Object
{
  public: static void _j_class_initialization(_j_int level);
  private: void _j_instance_initialization();
  public: static java_lang_Class *TYPE;
  _j_int x;
  _j_int y;
  Point *Point_init(_j_int x, _j_int y);
  virtual _j_double distance(_j_int x, _j_int y);
  virtual _j_double distance(Point *p);
};
#endif

Point.C:


/* **************************************************** */
/*							*/
/*	C++ file Point.C created from Point.java	*/
/*							*/
/* **************************************************** */

#include "JavaSupport.h"

#include "Point.H"

/* Import #include's */
#include "java/lang/Object.H"
#include "java/lang/Math.H"
#include "java/lang/Class.H"
#include "java/lang/String.H"

java_lang_Class *Point::TYPE;

void Point::_j_class_initialization(_j_int level)
{
  static _j_int _j_class_init_level = 0;
  while (_j_class_init_level < level)
  {
    _j_class_init_level += 1;

    /* Initialize super class */
    java_lang_Object::_j_class_initialization(_j_class_init_level);

    /* Initialize classes used in class initialization. */
    java_lang_Class::_j_class_initialization(_j_class_init_level);
    java_lang_String::_j_class_initialization(_j_class_init_level);

    if (_j_class_init_level == 1)
    {
      TYPE = java_lang_Class::forName(_j_toString("Point"));
    }
    else
    {
      /* Initialize classes used in this class. */
      java_lang_Math::_j_class_initialization(_j_class_init_level);
    }
  }
}

void Point::_j_instance_initialization()
{
  this->x = 0;
  this->y = 0;
}

Point *Point::Point_init(_j_int x, _j_int y)
{
  java_lang_Object::java_lang_Object_init();
  this->_j_instance_initialization();
  this->x = x;
  this->y = y;
  return(this);
}

_j_double Point::distance(_j_int x, _j_int y)
{
  _j_int dx = this->x - x;
  _j_int dy = this->y - y;
  return(java_lang_Math::sqrt((_j_double )(dx * dx + dy * dy)));
}

_j_double Point::distance(Point *p)
{
  return(this->distance(p->x, p->y));
}

Runtime Support

To test the translation, we chose the _210_si test that was considered during the preliminary definition of the SPECjvm98 benchmark [8]. This test is a "small interpreter" written in Java which reads a small program to compute an approximation of pi and a mortgage amortization table. The translation and compilation of _210_si went without error, converting 3 java files (with 6 classes) into 6 *.G files , 6 *.H files, and 6 *.C files. We added a small main program to call the class initialization method, and then to convert the normal argc, argv parameters to a Java array of strings, and call the main method of class Si.

The _210_si test uses other SPECjvm98 support routines, increasing the test from 6 classes to 22 classes.

Once these classes were translated from Java to C++, and compiled without error, we linked them together to create an executable. This link step produced a list of 394 undefined symbols. These 394 undefined symbols were easily recognized as being from the standard Java library. References were made to 35 different classes, including java_lang_Object, java_lang_String, java_lang_Math, and so on.

These link errors made very clear that a Java application consists not only of the code for the application itself, but also assumes the existence of a large supporting library. It is reasonable, in fact, to argue about the best way to convert Java to C++ with respect to library support. For example, the Java statement:

  System.out.println("Principal: " + Principal_Amount.CommaFormat());
will be translated by our converter to

  java_lang_System::out->println(
	((new java_lang_StringBuffer())->java_lang_StringBuffer_init())
	->append(_j_toString("Principal: "))
	->append(this->Principal_Amount->CommaFormat())
	->toString());
while a C++ programmer would probably expect it to be translated to:

  cout << "Principal: " << Principal_Amount.CommaFormat() << endl;

Converting a Java program to C++ thus raises two possible approaches:

  1. The calls to the Java library can be converted to calls to an equivalent C++ library, or

  2. The calls to the Java library can be left as they are and a C++ implementation of the Java library can be provided.

The Java library consists of about 1500 different Java classes. To do (1) means we must find (or write) semantically equivalent standard C++ library classes and methods for all the methods of these classes. For the problem at hand, just running _210_si, we need to find the C++ library calls for the 394 undefined symbols that are used in _210_si.

On the other hand, most of the classes and methods for the Java library are written in Java, and with our Java to C++ translator, we can easily convert the Java library source code into C++ source code to provide the support needed for (2). Thus, we have decided to use approach (2) to provide the library support needed as a result of converting a Java program into a C++ program.

The original 22 SPECjvm98 classes for _210_si referenced 35 classes from the Java library. These 35 classes, in turn, referenced other Java Library classes, and so on, until closure is reached with 221 classes from the Java Library. (One can look at this increase from 22 classes to 22 classes plus 221 supporting library classes as either a triumph of code reuse or an example of code bloat.)

Native Methods

With the mechanical translation of the Java Library from Java to C++, we have almost everything that is needed to run our test case. We are still missing native methods. Java allows methods to be implemented by code written in a language other that Java (typically C). These are called native methods. Native methods are required to provide the interface from Java to the underlying operating system or platform. There are 409 native methods in 54 classes of the Java Library. Providing C++ code for these 409 native methods will provide a complete support library for converting Java programs to C++.

Writing these 409 native methods is essentially a port of Java to a C++ platform. The implementation of them will depend, in part, on the target C++ system. For example, the support of threads will depend upon the underlying operating system more than on the fact that it is written in C or C++.

For the specific _210_si test of Spec Java, only 110 native methods are referenced and only 42 of these are actually called during execution. Implementing these 42 native methods allows the C++ version of _210_si to be executed. As expected, the printed outputs of _210_si in Java and in C++ are identical; the two implementations, in Java and in C++, are semantically equivalent.

Garbage Collection

One additional aspect of supporting the C++ translation of Java programs is the problem of garbage collection. Wilson [9] has written about the difficulties of providing garbage collection for C++. For this environment, however, all objects inherit from java_lang_Object. By adding a variant of the Boehm-Demers-Weiser conservative garbage collector [10] as a superclass of java_lang_Object, garbage collection can be added to the C++ code. This reduced the memory needed to run _210_si from about 110 M-bytes of malloced memory to about 500 K-bytes of garbage collected memory.

Runtime Comparisons

After converting the _210_si SPECjvm98 program from Java to C++, we converted and compared a larger subset of the preliminary SPECjvm98 programs. The following times are the result of running Java 1.1.0 on a RISC System/6000, model 560F running AIX 4.1.2 with version 3.6 of xlC as the C++ compiler. The specific times are not of great interest, since they will vary from system to system with processors, compilers, compile options, and runtime systems.

Running the _210_si test in Java took 101 seconds under Java 1.1.0. Moving to Java 1.1.4 with a JIT reduced the time to 68 seconds. Converting the test to C++ and compiling -g reduced the time to 57 seconds. Compiling -O reduced the time to 41 seconds.

Examining the memory usage of the test, we find that 110,231,136 bytes are being dynamically allocated. In particular, 190,339 of these allocations are one particular array of 512 bytes which is allocated inside a parsing method of the test, which is called 190,339 times. This particular array is allocated every time the parsing method is called, used strictly locally as a temporary buffer. The reference to it is lost when the parsing method exits.

Java has no local objects or arrays -- all memory is allocated from the heap and garbage collected. C++, on the other hand, supports both heap-based memory management and automatic stack-based memory management. If we change the C++ version to allocate this array, whose use is strictly local to this method, on the stack as a local automatic variable, we reduce the total heap memory used to approximately 8 M-bytes (instead of 110 M-bytes), and execution time (with -O) drops to 24 seconds. This Java program spends a lot of its time allocating memory which is then immediately thrown away and later collected for reuse.

Use of advanced data flow techniques might be able to detect that Java objects are effectively local and put them on the stack rather than on the heap.

Another performance problem is caused by the use of virtual base classes for the java_lang_Object. Going back and changing the translator to not use virtual base classes requires that interfaces not inherit from java_lang_Object. This requires that objects declared to be an interface type are explicitly cast to an java_lang_Object when we execute methods from the java_lang_Object class. Thus java_lang_Object is no longer a virtual base class. Since interfaces may have multiple inheritance, it is still necessary, at times, to use virtual base classes when the same interface is inherited in multiple ways. Removing the virtual base class for java_lang_Object provides some speed-up. For the best case, the C++ code dropped from 223 seconds to 86 seconds.

Looking at another test case, a profile of the C++ code shows that array references use the bulk of the time. Java arrays, when translated to C++, are implemented as a template class and the array subscript operator is not being in-lined. Thus, each array reference in C++ is a method call.

Replacing the use of the template array subscript operator with special macros to in-line the array references in one file drops the time from 334 seconds to 118 seconds. Replacing the array macros with real C++ arrays (with no array bounds checking) reduces the time to 42 seconds, compared to the Java time of 54 seconds.

Another test case has a different problem. Profiling shows that the most time-consuming activity for it was a dynamic cast. There are some 12,797,913 dynamic casts in total. One of these, inside a loop in an important method is unnecessary -- it casts a return value of type Constraint to the same type.

public Constraint constraintAt(int index) ...

public void execute()
{
for (int i = 0; i < size(); ++i)
{
  Constraint c = (Constraint) constraintAt(i);
  c.execute();
}

This is translated to C++ as

void spec_benchmarks__214_deltablue_Plan::execute()
{
  for (_j_int i = 0; i < this->size(); i++)
  {
    spec_benchmarks__214_deltablue_Constraint *c =
      _j_dynamic_cast(spec_benchmarks__214_deltablue_Constraint *,
		      (this->constraintAt(i)));
    c->execute();
  }
}

Removing this dynamic cast drops the total number of dynamic casts to 7,390,713, and drops the execution time from the C++ version from 42 seconds to 26 seconds. This compares with the Java time of 21 seconds. The Java time does not drop with the removal of the unneeded cast; the Java compiler has already removed the unneeded cast from the Java execution.

All of these changes suggest the difficulty of defining an equivalent C++ and Java program. Starting from a Java program, it is necessary to support the complete semantics of Java in a translated C++ program. Thus, converting an array reference in Java to an array reference in C++ requires that a bounds check be made and an exception thrown if the array reference is out of bounds. Since the array index may be either self-modifying (such as i++), or dynamically evaluated (such as a call to a method), the C++ program must capture the value of the index and evaluate it only once. This can be done by a method or operator call, which passes the value as a parameter, or, in a macro, by storing the value in a temporary location. However, temporary values cannot be created on-the-fly within C++ macros, nor can exceptions be thrown within an expression (since throwing an exception is a statement and not an expression, thus you have to make a procedure call to throw an exception from within an expression). These constraints tend to prevent a C++ compiler from in-lining the array reference.

A further problem in optimizing array references is the need, in some cases, to dynamically check the type of an assigned value. Java programs commonly pass an array to a method which types it only as an array of Object. Assignment to this array requires a run-time check that the type of the assigned value is compatible with the actual run-time type of the array parameter.

In addition, in many cases, a global review of the use of the array can show that the array bounds checking is unnecessary -- the array is allocated with a size n, and all indexing is clearly in the range 0 to n-1. This level of optimization is possible for a Java compiler or JIT compiler which knows that the purpose is to safely reference an array element, but is unlikely for the equivalent C++ code where the various checks and exceptions have had to be made explicit, and the "intent" of the code is not as obvious to the compiler.

It is be possible to do these optimizations when converting from Java to C++, but this raises the issue of translating from Java to C++ versus compiling from Java into C++. We have chosen to constrain our translation to a relatively simple source level translation from Java to C++. This exposes the ability to create a more optimized version of the C++ program by data and control flow analysis and other compilation optimizations, the same sorts of optimizations that are needed from a JIT compiler.

Summary

By analyzing the Java programming language, we have been able to develop C++ programming constructs which are equivalent to those in Java. From this research, we then developed a translator which automatically converts Java source files into equivalent C++ source files. This shows that Java programs can be converted into equivalent C++ programs; Java is programmatically a subset of C++.

By providing suitable runtime support, we can then run both Java programs and equivalent C++ versions of these same programs, allowing us to investigate the performance differences between the two implementations of the same programs, with the aim of improving Java performance to the level of C++.

Our early results suggest that some aspects of Java programs (such as array references and casting) can be very expensive in performance, suggesting that these are places to apply optimizations to improve performance.

References

  1. Patrick Naughton, The Java Handbook, Osborne McGraw-Hill, 1996, 424 pages.
  2. James Gosling, Bill Joy, and Guy Steele, The Java Language Specification, Addison-Wesley, 1996, 825 pages.
  3. Bjarne Stroustrup, The C++ Programming Language, Third Edition, Addison-Wesley, 1997, 910 pages.
  4. The Java Oasis Team, Java to C Compilers, http://www.oasis.leo.org/java/development/java-to-c/00-index.html, June, 1998.
  5. The Sumatra Project, Toba: A Java-to-C Translator, http://www.cs.arizona.edu/sumatra/toba/, April, 1998.
  6. Yukio Andoh, j2c/CafeBabe java .class to C translator, http://www.webcity.co.jp/info/andoh/java/j2c.html, Oct, 1996.
  7. Nik Shaylor, JCC - A Java to C converter, http://www.geocities.com/CapeCanaveral/Hangar/4040/jcc.html, May, 1997.
  8. SPEC JVM98 Benchmarks, http://www.spec.org/osg/jvm98/, August, 1998.
  9. Paul R. Wilson. Uniprocessor Garbage Collection Techniques, ftp://ftp.cs.utexas.edu/pub/garbage/bigsurv.ps,
  10. Hans-J. Boehm, A garbage collector for C and C++, http://reality.sgi.com/employees/boehm_mti/gc.html, April, 1998.