A Discussion of Inverted Page Tables

James L. Peterson

10 Feb 1990

Almost all modern computer systems provide virtual memory. Virtual memory is typically provided by paging. The virtual address space of a process is broken up into fixed size pages. A page is typically 1K, 2K, or 4K bytes or words. The pages are then allocated to frames in the physical memory of the computer. Each frame holds one page.

Page Tables

A page table (or page map) is used to translate each virtual address into a physical address. A page table is an array of page frame numbers. The virtual address is separated into two parts: the page number and an offset into that page. The page number is used to index into the page table. The frame number is recombined with the offset to create the physical address.

	          virtual address
         -------------------------------------
         |    page number     |    offset    |
         -------------------------------------
              |                            \
              |                             \
              |				     \
              |				      \
              |      ------------	       \
	      |	     |          |	        \
              |      ------------		 \
	      |	         ...			  \
              |      ------------     --------------------------
	      ------>| frame    |---->|   frame   |   offset   |
                     ------------     --------------------------
		     |          |        physical address
                     ------------
		         ...
                     ------------
		     |          |
                     ------------

		     Page Table

The size of the page number and size of the frame number need not be the same size. Originally, the page number was much smaller than the frame number, allowing a system to have more physical memory than any given process could address. For example, a 16-bit virtual address (with an 8-bit page number and an 8-bit offset) might be mapped to a 18-bit physical address (with a 10-bit frame number and the 8-bit offset), allowing 4 times as much physical memory as could be addressed by a given process. The additional memory was used to allow more processes to be resident in memory at one time, or special system calls allowed the user program to change the mapping, allowing the user process to have larger memory spaces, not all of which was visible at a time.

With 32-bit virtual addresses (allowing 4 gigabytes of address space), however, it became more common for the page and frame number to be the same size. In this case, a 32-bit virtual address was mapped to a 32-bit physical address. It is generally not considered necessary for a machine to have more than a 32-bit virtual or physical address space, although this assumption is beginning to be questioned.

The RT/RIOS architecture varies from the norm in this respect. An effective address of 32 bits is mapped (via the segment registers) into a 52-bit virtual address. The 52-bit virtual address is then mapped back to a 32-bit physical address.

Size of Page Tables

With the larger 32-bit (or 52-bit) virtual addresses, the size of the page table becomes a concern. If, for example, a 32-bit virtual address is broken into a 20-bit page number and a 12-bit offset, the page table would need to have 1,000,000 entries. This is too large to be held in registers, and so it would be held in memory. A register (the page table base register) would point to the page table in memory. Techniques are then used to limit the size of the page table. For example, in addition to a base register, a page table limit register allows the page table to be limited to a much smaller size. For example, a program that addresses only pages 0 to 40 would need only a 41-entry page table. On the other hand, a program that addressed pages 0 to 40 and pages 4000 to 4002 would need a 4003-entry page table, even though it was using only 44 of those pages. And a program which put program code at low memory addresses and data structures, such as a stack, at high addresses would need a complete page table.

Segmentation, and support for a segmented address space in the system and compiler routines, can be used to further limit the size of the page table. For example, with a 32-bit virtual address, the top 4 bits might be a segment number, the next 16 bits a page number within the segment, and the last 12 bits an offset in the page. The segment number defines a page table base register and limit register for that segment. Memory is allocated in contiguous sections within segments. Thus, for example, user code might be allocated in low addresses, starting at location 0 and continuing up as far as necessary. Local variables are allocated starting at the next segment boundary. Heap space starts at the next segment boundary. The user stack is placed in another segment. These allocations of memory are enforced by the compiler and linker/loader. A small program then would need 4 segments, each with its own small page table. A larger program might need more segments, but the memory in each segment would be allocated contiguously, so there would be no holes in the page tables, and the space needed for the page tables is strictly related to the amount of virtual address space used.

Inverted Page Tables

Still, since some programs may have very large virtual address spaces, resulting in (relatively) large page tables. A different approach, currently adopted on the RT/RIOS, is an inverted page table. The basic function that we want to accomplish for paging is to map a virtual address into a physical address, by mapping a page number into a frame number. The page table stores the frame number into an array indexed by page number: given the page number, the frame number is achieved by a simple indexed load. An inverted page table stores the page number in an array indexed by frame number. Given the page number, you search for the entry with that page number: the entry number specifies the frame.

	          virtual address
         -------------------------------------
         |    page number     |    offset    |
         -------------------------------------
              |                            \
              |                             \
              |				     \
              |				      \
              |          -----------	       \
	      v	         |         |	        \
                         -----------		 \
	      	            ...			  \
                         -----------   ------------------------
	          frame: | page #  |-->| frame   |   offset   |
                         -----------   ------------------------
		         |         |        physical address
                         -----------
		            ...
                         -----------
		         |         |
                         -----------

		       Inverted Page Table

An inverted page table has one major advantage: it need never be larger than the size of physical memory. Assuming that a page table entry is 8 bytes, and pages are 4K bytes, then an inverted page table always take 8 bytes per frame, or 8 bytes for every 4K bytes of physical memory -- about 0.5 per cent. By contrast, the size of a page table is determined by the amount of virtual memory used. If only small programs are run, the page table sizes will be small; if large programs (in terms of their virtual memory size) are run, the page tables will be large (by about the same factor: a page table entry is probably about 4 or 8 bytes, and each 4K page requires one page table entry). Since each process requires its own page tables (the page tables are swapped as part of a context switch), if several large virtual memory programs are run, the total amount of page table space needed is the sum of the page table space for each process.

An inverted page table has one major disadvantage: it is more complex and expensive to use (in terms of time) for mapping the virtual address to a physical address. For a small inverted page table, it would be possible to keep the entire thing in a set of associative registers, but for a larger inverted page table (corresponding to a larger physical memory), it will need to be kept in memory and searched. For example, a 4M memory with 4K pages, requires an inverted page table of 1K entries (8K bytes at 8 bytes per entry).

The search of the inverted page table can be improved by using a hashed search. The virtual address is hashed to get an index into a hash table. The hash table indicates the first address of a chain of values which with the same hash value; the chain is searched to determine the frame associated with the given virtual address. With a large enough hash table and a good hash function, the length of each hash chain would be small, and the search would be quick. The current guess is that, for the RT, the hash chain is normally only 1 or 2 entries long.

This approach uses memory for the hash table and the hash chain pointers in each table entry, to allow a faster search of the inverted page table. In addition, if the size of the hash table is fixed, adding (and using) more physical memory would tend to lengthen the hash chains and slow the search. Thus the size of the hash table for the RIOS is set as a function of the size of physical memory, varying from 4K to 1M entries.

Speed problems can be statistically improved by the careful use of cacheing. For example, a Translation Look-aside Buffer (TLB) is commonly used to remember recent virtual-to-physical translations in an associative memory. This can be particularly useful for inverted page tables, since it avoids both the cost of the hash function and the search down the hash chain.

So, on the RT/RIOS, concern about the size of page tables led to the use of an inverted page table rather than a page table. This sacrifices the time to translate a virtual address to a physical address (a hashed search of variable length for an inverted page table versus exactly one indexed access for a page table) for the more predictable memory requirements (the memory required for the inverted page table is more predictable than those for page table). The use of TLB's (which can be used for either addressing scheme) was felt to minimize the performance impact.

Sharing

For more complex systems, however, other problems arise with the inverted page table. The primary property of the inverted page table is that it records, for each frame, the specific virtual address page that is in that frame. This then implies that the virtual address associated with each page must be unique. This restriction is not true for a page table: several page table entries can map to the same frame, allowing several different virtual addresses to map to the same memory. This is known as aliasing. Aliasing is easily supported for page tables; it is more difficult to support with an inverted page table.

Aliasing is a special case of the more general concept of shared memory. As systems are developed, memory sharing between different processes has developed for two reasons: efficiency and parallel processing. The efficiency issues are the result of wanting to reduce the total amount of memory needed for a system. Even simple systems, for example, try to share read-only code of programs with other invocations of the same code. Thus, if there are three compilations at the same time, only one copy of the compiler code is in memory -- the three invocations share the code by mapping it into different virtual address spaces. The code, however, is most likely at the same virtual address in the three address spaces, so the same mapping is applied to all three. The local data used for the three computations, on the other hand, must map to different physical locations, since the values stored will vary for each computation.

If the three computations have their local data at the same virtual addresses, an inverted page table cannot store all three at the same time. (If there are three different frames in memory, each at virtual address 0xffaf000, for each of three different processes, and the inverted page table then has three different entries with the same content, 0xffaf000, how can we search the inverted page table to find the 'right' frame? The interpretation of 'right' varies for each process.). There are two solutions to this problem. One is to change the inverted page table when the cpu is switched from one process to another; thus only one frame has the aliased virtual address at a time.

The other approach is to modify the virtual address so that the addresses are unique, adding a value which is unique to the process. On the RT/RIOS for example, this situation is handled by selecting the values of the segment registers so that the virtual addresses for the different processes are different. The 4-bit segment field maps to a 24-bit segment number. 24 bits allows a large number of segment numbers, so that the allocation of segment numbers for RT/RIOS allows the virtual addresses (which include the segment id) to be unique to a process. Alternatively, if the same segment id is used, different processes can share memory. Memory sharing is limited in RT/RIOS (because the sharing is based on segments) to being on segment boundaries (the granularity of sharing is 256M), and to no more than 16 shared segments between processes (since there are only 16 segment registers).

Copy-on-Write

Other efficiency considerations also introduce sharing. For example, in Mach, the fork system command will create a new process with a duplicate address space. Rather than actually allocate physical memory and copy the address space, however, Copy-on-Write is used. With Copy-on-Write, the page tables for the two processes are initially the same: two virtual addresses point to the same physical frame (which is set to be read-only). If neither process ever writes into the page, the page will be shared, eliminating the need to copy the page, saving both the memory to store the copy and the time to create it. If either process attempts to write into the frame, a copy of that page is made in a different frame and the two processes now have different physical frames associated with their pages. Mach uses Copy-on-Write for other operations (such as message transfers) to reduce copying pages.

Copy-on-Write is a general technique that can be used to improve the efficiency of sharing at the page level. While there are several ways to change the segment numbers assigned to the processes, it is not possible to avoid two identical virtual addresses mapping to the same physical address if sharing is attempted on a page basis, as assumed by Copy-on-Write.

It is possible to use Copy-on-Reference, rather than Copy-on-Write and avoid aliasing, but at an additional cost. With Copy-on-Reference, a copy of a shared page is made upon the first reference (read or write) rather than the first write. There are currently no numbers showing how many pages that are subject to Copy-on-Write are referenced but not written. In all of these cases, it is possible to avoid aliasing by simply making additional copies of pages (possibly copies that are never used). The introduction of sharing in this way is a result of an attempt to introduce efficiency by avoiding both the allocation of memory and the copying of data into it unless and until it becomes necessary.

Parallelism

The other use of shared memory is for parallel processing. In this situation, the sharing of memory is not for efficiency, but because of the basic nature of the problem. The problem either dictates or is posed in terms of multiple processes sharing at least part of a common memory. Threads are one way to create processes sharing a common memory. Shared mapped files (using the mmap system call) is another way to share memory. Unless this sharing is restricted to being on segment boundaries (so that the same segment can be mapped to different segment ids and hence to different virtual addresses), aliasing will result: we will have a situation where two different virtual addresses must be mapped to the same physical address. Sharing on segment boundaries restricts the number of shared areas of memory and the granularity of sharing. For the size of segments currently supported, this would not be generally acceptable.

In this situation, copying is not an acceptable alternative. The changing values in one copy of the shared memory must be reflected immediately in the other copy. However, since only one thread or one process will be active at a time (since there is only one cpu), it is possible to alter the inverted page table as each process is dispatched to hold only the addresses that it sees in its address space. Hence it is possible to allow sharing and aliasing between processes on a single processor.

It is possible, but unlikely, that the same frame might show up in two different places in the address space of a single process. In this case, on a single processor, it is possible to "ping-pong" the aliased virtual addresses in an inverted page table. One virtual address is put in the inverted page table. If a reference is made to the other virtual address, a page fault will result. The operating system can then recognize that the aliased page is being addressed and change the inverted page table, replacing the first address with the second, and restarting the faulted instruction. As long as both addresses cannot be generated in the same instruction, the computation will proceed, although possibly slowly.

In a multi-processor system, however, this would not be possible. If memory is shared between processors, then there must exist simultaneously different mappings to map two different virtual addresses in the different processors to different physical addresses; each processor must have a different mapping. This would argue then that one of three approaches is necessary for a multi-processor RT/RIOS system:

Each processor will need its own inverted page table. The total memory needed is then a product of the number of processors and the number of frames.
A many-to-one inverted page table would be needed that can find several virtual addresses associated with a given physical address.
The inverted page table should be discarded and a more conventional page table used. Sharing and aliasing is easy with a conventional page table.

The first and second approaches have been considered in a note by Bryant, Fitzgerald, and Knight of IBM Research ("Multiprocessor Memory Mapping for RIOS2: Straw-Individual", 26 Jan 1990). Their proposed solution was to modify the IPT to allow for aliasing. The current IPT is constructed for a one-to-one mapping function (one virtual address maps to one physical address); aliasing requires a many-to-one mapping function (many virtual addresses map to the same physical address). Logically what is needed is a list of aliased pages for each frame. When the IPT is searched, it is necessary to search the list of aliased pages. (When the mapping is one-to-one, there will be no more than one page on the list for each frame, and the current IPT results.)

The general problem then for multiprocessor memory mapping with aliasing is as follows:

	       		  ----------------------
			  |  (vp[1], frame #)  |
			  |  (vp[2], frame #)  |
virtual address page  ->  |       ...	       |  -> physical frame #
			  |  (vp[i], frame #)  |
			  |       ...          |
			  ----------------------

This is a general lookup problem, common to many situations. A large collection of solutions have been suggested in the computer science literature (consider, for example, Knuth, Volume 3: "Sorting and Searching"). Typical solutions include indexing (which is basically what a page table is), or various searches (including hashed searches).

Bryant, et al, suggest continuing on the hashed search approach to supporting the mapping. The straight IPT kept only one virtual address and "computed" the frame from the index of the IPT that was found to hold the virtual address; the frame was not explicitly stored in the table. With aliasing, two different virtual addresses cannot be in the same entry in a table, and hence will not have the same table index. If the two addresses are to be aliases, additional information will be needed to indicate the frame number explicitly, rather than implicitly. The suggestion is to expand the memory provided for the Hash Table, and to change the information stored in each entry. The Hash Table is composed of two parts: the Hash Anchor Table, which is addressed by the initial hash, and the Hash Overflow Table, which is used to store the hash chains.

The new Hash Table entries would store (1) the virtual address, (2) the frame that it occupies, and (3) a pointer to the next hash table entry on this chain. Protection information would also need to be kept in the hash table entry, since the protection (read-only, read-write, ...) might not be the same for two different virtual address aliases. The frame number would be used to access a Page Frame Table. Each entry in the Page Frame Table would contain information which is associated with the physical frame (as opposed to the virtual page) such as reference and change bits and possibly locking information.

The hardware would hash the virtual address. The contents of this location would be compared with the virtual address. If the virtual address is different, the search would continue at the next hash table entry for this chain. If the virtual address is the same, the frame would be used to access and update the page frame information (reference and change bits) and to construct the physical address for this access. If the search terminates at the end of the chain without finding the desired frame, a page fault occurs.

There are two main problems with this solution: (1) the size of the Hash Table, and (2) the length of the search.

The size of the Hash Table is no longer predictable, but rather is the sum of the total number of virtual address in use in memory at any given time. Because each alias requires a duplicate entry in the table, the length of the table depends upon the number of aliases.

It is possible that the number of aliases is large. Consider a program that reads an m x n matrix into memory and then creates m*n processes, each computing on one element of the matrix, based upon the values of its neighbors. Assume that the matrix elements are large, each occupying one page of memory; so the entire matrix requires m*n pages of memory. When each process is forked, it will get a shared copy of the matrix. When it first stores in its element, a separate copy of that element must be created (Unix fork semantics), but the remaining m*n pages can be shared. On the RIOS, the different processes will each need to assigned a different segment id for the shared matrix. (Otherwise we would have the same virtual address for two different processes mapping to different frames, at least for their unique elements, which is not possible.) Thus, each of the other m*n-1 pages will have different virtual addresses for their frames; that is we will have m*n-1 aliases per process or a total of (m*n-1)*m*n aliases in the system.

The number of aliases, for a different job set, may also be quite small. As a result, the size of the overall Hash Table is unpredictable. This implies that the size cannot be fixed (in the way that the Hash Anchor Table is fixed at IPL time). Rather the Hash Overflow Table must be of variable size, changeable by the operating system as necessary. This makes the management of this table more complex. In particular, we must consider the problems associated with managing this table, and the memory that it occupies. When a new entry (or set of entries) must be added to the table, the table may need to grow. If the table must be contiguous, how is space provided to allow it to grow? If the table later becomes nearly empty, can it be shrunk?

These same problems impact the search time. As the size of the hash table grows, given that the Hash Anchor Table is fixed, the average length of the hash chains will also grow. Thus the time to search the table and determine the physical address for a virtual address will vary considerably, depending upon the extent of aliasing.

An Alternate Proposal

As a result of these considerations, we suggest (1) that the search time for an IPT or a modified IPT as proposed by Bryant, et al, is a major problem for a machine which is striving to provide very high performance. In addition, (2) the supposed memory advantage of an IPT is lost when it is modified to support aliasing, both because the number of hash table entries is no longer bounded and the size of the hash table entries is larger (since the frame must be explicitly kept). We suggest then that RIOS2 use a conventional page table. This will provide very well defined performance (exactly one memory access in all cases). Memory requirements are uncertain, but probably similar to those of the modified IPT.

We provide below a specific page table proposal. This is to allow a discussion to focus on the detail necessary to completely understand the relative advantages and disadvantages of the two general approaches; it is certainly not the case that the details cannot or should not be changed; there are a number of places where there are several alternative solutions which would work at least as well, if not better than what is proposed here.

Starting with a 32-bit virtual address, the high order bits are taken to indicate a segment. For now, let us assume 4-bit segment indicators, although we may want to increase the number of bits in a segment indicator (to 6 or 8 bits). The segment indicator is used to access a segment table entry. Since there are only 16 segment table entries (but eventually possibly up to 256), we assume the segment table is kept in registers in the cpu; a memory access is not needed to access the segment table. The segment table entry has three values: (1) a pointer to the base of the page table for this segment, (2) a lower bounds, and (3) an upper bounds. The page number within the segment is compared with the lower and upper bounds and must be between them (else a fault). The page number is added to the page table base to provide the memory address of the page table index for this page. The page table index contains the frame number for this page. The frame number is combined with the byte offset from the virtual address to form the physical address.

       	          virtual address
         -------------------------------------
         | seg  |  page number   | offset    |
         -------------------------------------
              |
              |
              |
              |
              |
	      |
              |
	      |	    ---------------------------------------
              |	    |              ...                    |
	      ----->| PT Base | Lower Limit | Upper Limit |
		    |              ...                    |
                    ---------------------------------------
		     		Segment Table



             page number > Upper Limit   ---> fault

	     page number < Lower Limit   ---> fault


	     page number + PT Base  --
				     |
				     |
				     |       ---------
				     |	     |	...  |      ------------------
				     ------->| frame |  --> | frame | offset |
					     |  ...  |      ------------------
					     ---------
					     page table
					    for this segment

With this approach, translating a virtual address into a physical address requires exactly one memory reference (to get the frame number from the page table for the segment). A TLB can eliminate that reference in most cases, but in the worst case, only one memory reference is needed.

Page tables are kept for each segment. Accordingly, page table space is not wasted if memory within segments is kept contiguous. This can certainly be true for memory that is allocated by the loader/linker, user stack segments, heap space, and user defined common blocks. The upper and lower bounds are kept for each segment to allow space to grow from either end (or the middle). Specifically, it allows a stack to grow down with other memory allocation being up. It is possible to use other approaches to require only one limit value.

Aliasing and memory sharing is easy to accomplish with this approach. Protection can be provided on either the segment, the page, or the frame (or any combination of these). A fork causes only the duplication of the segment table; Copy-on-Write can then be used to make copies of the page tables and pages as necessary.

The segment table is part of the cpu process state. A context switch requires switching the segment table.

The memory requirements are for segment and page tables to support the virtual address space of each process.

Summary

Shared memory is an increasingly important aspect of modern computer systems; it will be particularly important for multiprocessor systems. Shared memory, at the page level, is used both explicitly, for parallel algorithms, and implicitly, for improved performance. Shared memory at the page level creates aliasing, more than one virtual address mapping to the same physical address.
A (classical) inverted page table cannot handle aliasing, since if allows only one virtual address per frame. Modifying the inverted page table concept to support aliasing results in a memory mapping structure with variable, unpredictable memory requirements and variable, unpredictable time to translate.
An alternate proposal, to use a conventional page table, easily supports aliasing. The time to translate is fixed and predictable. The memory requirements are variable and unpredictable.

Notice also that other IBM systems (AIX ESA for the 370 architecture and AIX PS2 for the IBM PS/2 architecture) do not use inverted page tables. Switching to a conventional page table would improve the commonality of code among our own family of systems.

Also, the advantage of inverted page tables in memory requirements may be less real than stated. The operating system will need to keep a number of tables and data structures of its own to manage both processes and memory. It is likely that independent of the hardware support for paging, the operating system will need both page tables (to manage processes and their memory), and a frame table (effectively an inverted page table) to manage memory. Thus the total amount of space needed by the operating system for a paged system may include both the equivalent of page tables and inverted page tables. If the operating system data structures cannot simultaneously serve as the hardware data structures, then additional memory will be needed for whatever is used by the hardware. Total memory may then be either (p + i) + p or (p + i) + i. If the hardware and software could use the same data structures, then total memory would be reduced to just (p + i).

It thus seems that the original premise upon which inverted page tables where selected, memory requirements, is unlikely to be an important factor in total system structure. The total space required for data structures to support virtual memory is much larger than just the hardware data structures; the hardware data structures are a small part of the memory needed for data structures, algorithms, and code to support virtual memory. As an intuitive measure of the insignificance of the page table memory requirements, consider that the memory requirements for an RT and a comparable Sun (M68000 based) system, both running variants of Unix, for the same workload, do not vary because of the fact that the RT has an inverted page table and the Sun a conventional page table. It is not the case than the Sun system would require 4M bytes of memory while the RT would need only 3M bytes. The use of an inverted page table in the RT has not resulted in lower overall memory sizes for RT systems.

A more subtle problem arises because of the use of an memory scheme that varies from the conventional approach. Aliasing, in my belief, is based upon concepts that are natural for conventional page tables. The concept of aliasing would never have arisen in systems based on inverted page tables, since it is impossible to support. The bulk of the computer science community conceives of paging as being done by a page table, not an inverted page table. Hence, the solutions and techniques that they arrive at for problems are based upon and work with a conventional page table. While we may be able to continue to enhance the inverted page table to work, I believe that it is likely that additional techniques will arise in the future which again will be based on conventional page tables, not inverted page tables. If we are to support systems which use these techniques, we will need to either continuously modify our inverted page table hardware as necessary. This puts us in the position of always playing catch-up, with no opportunity to lead.
Home Comments?