Software Cache Consistency

James L. Peterson
James W. Van Fleet

Advanced Workstation Division
IBM
11400 Burnet Road
Austin, Texas 78758

26 Sept 1990

This paper was submitted to the International Symposium on Shared Memory Multiprocessing, held April, 1991 in Tokyo.

Abstract

We examine the problem of cache consistency for multiprocessor systems. Hardware cache consistency solutions reduce the overall performance of a system by imposing additional overhead of each processor. We suggest that cache consistency can also be achieved in software, at least for ``embarrassingly parallel'' user problems. Within the operating system, ``lock and flush'' techniques can be used to modify existing multiprocessor systems to run on systems without hardware cache consistency. The operating system can also be restructured as a collection of ``active monitors'' which use RPC for communication and are scheduled with processor affinity.

Introduction

Performance is an important aspect of computer systems. As computing systems have increased the speed with which they can solve problems, more problems have been defined, leading to a demand for even faster computers. Steadily improving electronics has provided faster computers.

In addition, new processor architectures, such as Reduced Instruct Set Computers (RISC), have allowed faster and smaller computers. Organization concepts, such as caching, can be applied to both hardware and software, to improve average system performance. The IBM RISC System 6000, for example, achieves high performance from its RISC architecture and extensive use of memory caching [1].

Increased performance allows both a single problem to be solved faster, and more problems to be solved in a given amount of time. Doubling the speed of a system means that each problem takes only half the time and twice as many problems can be solved. As a rule of thumb, it is estimated that peak computer performance doubles every 12 to 18 months.

Multiprocessing

Another approach to increased system throughput is the introduction of multiprocessing. Multiprocessing allows several processors to be applied to a set of problems, and can result in significant performance gains. For independent problems, the number of problems solved in a given time can be increased by using multiple processors, even though the amount of time needed to solve each individual problem remains the same. By modifying the algorithms used to solve the problem, it may be possible to reduce the time to solve a given problem by using multiple processors.

The concept of multiprocessing can be applied to a wide range of computing systems.

Networks of computers are created by linking together conventional uniprocessor computer systems with a communication network. Each processor is completely independent, with its own I/O devices. Because of the time required to move information from processor to processor over the communication network, only coarse-grained parallelism is possible.

A cluster of computers allows a smaller grain of parallelism. While a network may be physically distributed over a large area, the processors in a cluster are physically close together. The closeness allows faster communication between processors and the possible sharing of I/O devices, such as disk and tape units. The software for a cluster can provide a single system image, with a shared file system and common accounts and passwords on each processor. Still, sharing information between processors requires explicit use of the communication subsystem, typically with I/O instructions. Clusters can be called loosely coupled multiprocessor systems.

A tightly coupled multiprocessor system (TCMP) allows communication through shared memory. Sharing memory between processors allows very high speed communication. Immediately after one processor stores information into memory another processor can read it, allowing communication at memory bandwidths. The faster communication between processors allows fine-grain parallelism to be used in the design of algorithms and the solution of problems.

The architectural difference between clusters (or networks) and tightly coupled multiprocessors has given rise to two different design methodologies for parallel programs: message passing and shared memory. While it has been shown that the two are theoretically equivalent [2], shared memory systems should always be faster. Thus, tightly coupled multiprocessors are the preferred choice for high performance computing.

Cache Consistency

TCMP systems are not easily built, however. It is fairly easy to take an existing high performance processor and design a computer network or cluster using the processor. It is more difficult to design a TCMP using the processor. In particular, one of the features of modern high performance processors, the use of memory caching, can cause severe problems.

In modern systems, processor speeds are often held back by the speed of memory. While it is possible to build fast memory, it is more expensive than the commonly available memory systems. In a uniprocessor, caching improves the effective memory access time. A small high-speed cache stores the most recently used memory items. When the processor wants a memory item, it first checks the cache. If the item is available in the cache, the processor can proceed immediately. Otherwise the request is sent to memory while the processor waits. When the memory item is received from memory, the processor can continue. In addition, the item is put in the cache for later reuse [3].

A store into a memory item is put into the cache and either stored back in memory (store-through) or left in the cache for storage later (store-back). A store-back cache can reduce memory traffic over a store-through cache, since repeated stores to the same location (for example, the calculation of a running total) can affect only the cache until the final value is stored in memory.

Caches have two effects in a TCMP system. The good news is that by reducing memory traffic, caches reduce the amount of memory contention and memory interference between processors. Since each processor accesses memory less often (only when the item it wants is not in its cache), it is less frequent that memory is busy with one processor when it is needed by another processor. The bad news is that the information in the cache of one processor may become inconsistent with the information in the cache of a different processor, resulting in incorrect computations.

The cache consistency problem is fairly easy to see. Assume two processors both read the same memory unit, copying its value into their caches. After some processing one (or both) of the processors updates the value. With a store-through cache, the updated value is written into memory; for a store-back cache it is left in the processor's cache. In either case, the other processor, if it tries to access the shared value, will find and use the old value from its cache. It may even update this value, resulting in a new (and incorrect) value in its cache, and possibly in memory. (With a store-through cache, the first update will be lost; with a store-back cache it is nearly impossible to tell which updated value will eventually be in memory, or when.)

Cache consistency problems can arise even without multiple processors. Accessing a memory item involves two steps: (1) calculating the physical address of the memory item from the virtual address, and (2) accessing the physical memory address. For maximum performance, a cache which stores virtual addresses, not physical addresses, allows both steps to be avoided for memory items that are in the cache. The processor can simultaneously begin the address translation and the cache access. If the desired memory item is in the cache, the address translation can be stopped (or at least ignored), and processing can continue. The difficulty with virtual address caches is that aliasing (where two virtual addresses map to the same physical address) could create two different entries in the cache under different virtual addresses for the same physical address. Aliasing in a virtual address cache can cause cache consistency problems.

Hardware Cache Consistency

One approach to designing a TCMP is to disable the cache completely. However, this would drastically reduce the performance of the resulting system, both from the additional time for each memory access and from the increased memory contention among the processors.

More commonly, additional hardware is used to provide cache consistency among the processors. Various schemes can be used to provide hardware cache consistency [4]. Most hardware cache consistency schemes allow multiple read-only copies of a memory item to flow from memory to processor caches as needed. When a processor changes its copy, however, all other copies must be either invalidated (write-invalidate) or updated with the new value (write-update). In addition, cache writes can be either write-back or write-through. For a write-through policy, the memory copy is kept consistent with the modified value, so subsequent reads can be directed to memory. With a write-back policy, the memory copy may be inconsistent with the updated cache copy of the writing processor, so future reads must be directed, not to memory, but to the cache with the most recent copy. The main problem in designing a hardware cache consistency algorithm is determining where current copies of information are, for either access or update.

One approach, for example, creates a hardware cache directory which records what memory items are stored in which caches. A write by a processor to its cache also sends a signal to the cache directory. Using the information in the cache directory, additional signals are sent to all caches which have copies of the written information; these copies are now out-of-date. The copies are either invalidated or updated with the new copy. If the copy is invalidated, a subsequent read attempt will cause a cache miss, for that processor. The read then queries the cache directory to learn the location of the desired item (either in memory or in the cache of another processor), obtains a copy of the item, and registers the existence of the copy with the cache directory. Note that the size of the cache directory is affected by the number of processors and the size of the cache for each processor. This places a limit on the number of processors which can be supported by this approach. Too many processors requires too large of a cache directory.

Another hardware cache consistency approach is the use of a snoopy bus. With a snoopy bus approach, memory access and cache consistency commands are broadcast to all caches. Each cache must watch the bus and process any action which affects the correctness of its cached values. A read request, for example, is broadcast on the bus and will be filled either by memory or from the cache of another processor. A write request requires each cache to either invalidate or update its cached value from the broadcast address and value. As with the cache directory approach, the performance of a snoopy bus design is limited by the number of processors. As the number of processors grows, the amount of traffic on the bus grows, both for memory accesses and updates and for broadcasts to keep the caches consistent.

Hardware cache consistency schemes require additional hardware, increasing the cost of the TCMP system over the cost of the processors and memory alone. In addition the processors must implement the cache consistency algorithm, slowing each memory access. It has been suggested that adding multiprocessor cache consistency support to a physical processor may slow the individual processor down by 10% to 20% from the performance of a processor designed with no consideration for use in a multiprocessor configuration.

Software Cache Consistency

Another approach to cache consistency in TCMP systems is to rely on software to maintain cache consistency. Instructions can be provided to flush cache items back to memory and to invalidate items in the cache. Such instructions allow software to be responsible for maintaining cache consistency. Software supported cache consistency would eliminate the need for complex and expensive hardware to provide cache consistency. It would remove the resulting limit on the number of processors that can be linked together in a TCMP system.

An example of software supported cache consistency is the TLB shootdown algorithm of [5]. The Translation Look-aside Buffer (TLB) is a hardware cache of virtual memory translations. On multiprocessor systems, each processor may have its own TLB. When software modifies the page map tables (such as when access to pages change because they are swapped out), it is necessary for each processor to invalidate entries in its TLB which no longer reflect the information in the page map. Since software knows when it is changing the page map, it is not uncommon for software to have the responsibility for assuring consistency between the TLBs of the processors and the page map.

For systems without hardware cache consistency for main memory, two separate strategies are possible for software cache consistency, one for user processes and another for the operating system itself. User programs, operating in virtual memory, can be constrained by the virtual memory system. If a set of user processes are sharing memory, their initial accesses to that memory will cause page faults. The page fault results in the page being added to the page map of the faulting processor which is then restarted. Associated with each page is protection information that allows either read-only or read-write access to the page. Initial faults to access a shared page will result in each processor mapping the shared page read-only. Items from that page can now flow into the cache of the processor.

An attempt to write into the shared page will cause a page fault, since the page is marked read-only by the memory protection hardware. The faulting processor, seeing from the operating system tables describing the page that writing is allowed but that the page is shared, must notify all other processors. Each notified processor must remove the page from its page map, so that further accesses are not possible, and invalidate any entries from that page in its cache. (The entries were read-only, so the cache entries duplicate the memory copy). When all processors have changed their page maps and invalidated any cache entries, the faulting processor can enable write access to the page and resume processing. The processor now has exclusive access to the page and can allow modified copies to flow into its cache.

Eventually another processor may attempt to read or write the page. It will suffer a page fault and from the operating system tables see that another processor has exclusive access to the page. If a write to the page was attempted, it asks the current owner of the page to release it. The owner must flush and invalidate any cache entries for the page and remove it from its page map. Then the faulting processor can assume exclusive access to the page, putting the page in its page map, giving itself write access to the page, and changing the operating system tables accordingly.

A read access by a processor other than the owning processor of a page is handled differently. The faulting processor notifies the current owner of the page that it wishes to read the page. This can be handled just as a write request, with the current owner giving up its access to the page, or the current owner can simply flush its changes back to memory and change its page map to allow read-only access (rather than its previous read-write access). In either case, the faulting processor can then map the page with read-only access, and multiple processors can again share the page.

This scheme allows multiple copies of memory items for read-only pages, but only one copy of a writeable page. Caches are explicitly flushed and invalidated as necessary by interprocessor requests to keep memory consistent in the face of cached shared memory without hardware consistency.

How does this scheme perform? There are many situations where performance could be quite reasonable, particularly for ``embarrassingly parallel'' problems with little sharing. When unrelated problems are to be run, there is little or no sharing and performance should be very good. On problems with fine-grained parallelism, however, the performance of this scheme is likely to be poor. A major problem is the granularity of sharing: the page. Pages tend to be fairly large (1K bytes or more). Thus there may be much ``false'' sharing. False sharing results when two (or more) unrelated values happen to fall into the same page. Although each value is used only by one processor, it will appear that both are trying to use and change the same page. Hence, the processors will be constrained to sharing the page, with only one processor being able to write its value at any given time. Note that smaller pages result in less false sharing, but produce larger operating system tables for memory management.

In addition to the coarse granularity of page sharing, the time to gain access to a page that is owned by another processor is large. The faulting processor must interrupt the owning processor and wait for its request to be handled. The owning processor must accept the interrupt, determine the reason for the interrupt, flush and/or invalidate its cache entries, change its page map (including changing its TLB), acknowledge the completion of the request to the faulting processor, and return from the interrupt. Only when the owning processor has acknowledged the completion of the request can the requesting processor modify its page map and continue its computation. The overhead when two (or more) processors are actively accessing the same page at the same time would be overwhelming. Thus, this software cache consistency scheme is only reasonable for low or moderate levels of sharing between user-level processes.

A different approach to software cache consistency is possible for the operating system itself. Since the operating system code is written by the system provider, it can be constructed to consider the problems caused by the lack of hardware cache consistency. A multiprocessor operating system must be designed to prevent interference from the multiple processors. Thus, access to any shared data structures must be controlled by the use of locks. Particularly when shared information can be modified, locks are needed to prevent two or more processors from modifying the shared information at the same time, resulting in incorrect operation.

As a result, code in a multiprocessor operating system which modifies shared data structures is generally of the form:

Obtain the lock for the shared data.
Examine and/or modify the shared data.
Release the lock.

In a system without hardware cache consistency, we can modify the above sequence as follows:

Obtain the lock for the shared data.
Examine and/or modify the shared data.
Flush any cached data back to memory and invalidate the cache entry.
Release the lock.

Since all accesses to the shared data are protected by the lock, the data will not be in the cache of any other processor when the lock is obtained. References to the protected shared data will cause it to flow into the cache of the processor holding the lock. That processor explicitly flushes the shared data back to memory before it releases the lock, thus assuring that the memory copy of shared data is correct when the next processor uses it.

For a multiprocessor operating system, the locks must already be in place; hence it is necessary only to identify the ``unlock'' operation and augment it to flush and invalidate the appropriate part of the cache. Providing the address and length of shared memory items with the lock operation allows the appropriate cache entries to be identified. If there are references to shared memory that are not protected by locks, they must be explicitly tagged to allow the cache to be invalidated after use. Invalidating the cache copies after each use assures that the next use will be to a new copy from memory (not an old copy from the cache).

As an example of this form, consider the following code from the OSF/1 operating system:

	simple_lock(rwl, sizeof(lock_t));

	rwl->read_count++;
	LSTATS(rwl->lock_nreads++);
	if (rwl->recursion_depth != 0)
	    rwl->recursion_depth--;
	else if (rwl->want_upgrade)
	    rwl->want_upgrade = FALSE;
	else
	    rwl->want_write = FALSE;

	if (rwl->waiting)
	   {
		rwl->waiting = FALSE;
		thread_wakeup((int) rwl);
	   }

	simple_unlock(rwl, sizeof(lock_t));

The shared memory item is the variable rwl of type lock_t. Accordingly, we need to lock the memory starting at rwl and continuing for length sizeof(lock_t). Within the bracketing calls to simple_lock and simple_unlock, the various fields of the shared memory item are referenced and modified. To work in a system without hardware cache consistency, the simple-unlock routine (or macro) is modified to flush and invalidate the cache entries for the address range from rwl to rwl+sizeof(lock_t)-1.

Comparing this software based cache flushing and invalidation with a hardware cache consistency scheme, we see that the software scheme may lock, access, modify, flush, invalidate, and re-read the same data, even though no other processor is using it. To avoid the unnecessary flush/invalidate operations, we could establish both a lock and ownership for each piece of shared data. The ownership information defines the last owner of the shared data. The owner may still have that data (modified) in its cache. In this case, the procedure for accessing shared data becomes:

Obtain the lock for the shared data.
Check the ownership for the shared data. If the owner is not the locking processor, notify the owner and request that the data be flushed from its cache. Wait for the owner processor to receive the interrupt, flush and invalidate its cache entry and acknowledge the request. Set the ownership to the locking processor.
Examine and/or modify the shared data.
Release the lock.

For this scheme, repeated uses of the same shared data, with no intervening uses by another processor, do not require cache invalidation and the consequent memory access. However, access to data which was last used by another processor requires a lengthy sequence of interprocessor communication to gain (exclusive) access to the data.

Another problem with this ad-hoc software support for cache consistency is caused by the granularity of cache entries. Just as the size of a user level page may result in false sharing, so may the size of a cache entry. On the IBM 6000, for example, cache entries are 128 bytes. Thus, locking cannot be done only for the specific shared data, but must also cover all data in the cache entry with the desired shared data, which is typically more than the shared data itself. A lock must be identified with the cache entry (or entries) that hold the shared data that the lock protects. Unless care is taken in the memory layout of shared data, several logically separate locks may be mapped onto the same cache entry, and hence become the same lock. This would reduce potential parallelism and open the possibility of deadlocks.

Also, as with the user-level page approach for providing cache consistency, false sharing can occur even when unrelated data items, which are used by different processors, happen to fall into the same cache entry. Data items that are used by only one processor are not protected by locks. However, if data items from two different processors both fall in the same cache entry, then they will implicitly be shared between the two processors and their use must be controlled. The simplest solution to this problem is to allocate data items to be used by one processor separately in memory from the data items that will be used by other processors. For a system programming language, such as C, this may affect how the compiler allocates memory.

Another approach towards providing cache consistency is to modify the code generated by the compiler. When the compiler sees an access to a shared memory location, either because it has been told that the memory is shared (programmer declaration) or it has itself determined that the memory is shared (through program analysis by the compiler), explict cache invalidate and flush instructions can be used to maintain cache consistency. Each read would be proceeded by a cache-invalidate instruction, and each store would be followed by a cache-flush instruction. This assures that each read comes from memory, not cache, and each store is reflected back into memory immediately, not left in the cache. The compiler would also need to allocate shared variables to memory to prevent false sharing. This compiler approach is effectively an automated version of the explicit software cache consistency algorithm proposed above for the operating system kernel. However, because the compiler does not understand the effect of locks, more flush and invalidate instructions will be generated than actually necessary.

Notice that we have avoided discussion of the locks themselves. It will be necessary to provide an ability of lock and unlock regions of memory. Locks are often provided by special memory instructions. The code for locking and unlocking must be useable without cache consistency. Since locking is a well-defined function which is critical to the correct operation of a multiprocessor system, special hardware assistance may be provided (for example, special lock/unlock instructions which work on special non-cached lock data structures). Alternatively, the software lock/unlock code must be carefully written to consider the lack of cache consistency.

Active Monitors

An alternative approach to this lock-based software cache consistency is to restructure the operating system to reduce cache consistency problems. Both shared data structure and the code that accesses it can be grouped together into a unit similar to a monitor [6]. This ``monitor'' would define the only allowed access and update code for the shared data. As defined by Hoare, a monitor is a passive entity, executed by a processor as needed, providing only mutually exclusive access to its protected data structures. We suggest, however, that an active monitor be defined which, like a monitor, encapsulates all accesses to its data. An active monitor is a scheduleable process, however, and so is accessed either by a remote procedure call (RPC) or by messages.

An active monitor would normally be assigned to a processor and remain on that processor (processor affinity). This would allow the data of the monitor to flow into the cache of the processor. Rescheduling an active monitor to a different processor would require that the cache of the old processor for the active monitor be flushed and invalidated before the active monitor can be resumed on a different processor. The active monitor structuring concept is related to current object oriented design concepts.

Notice that an operating system designed and implemented to support software cache consistency (by locating items in memory to avoid different shared items being in the same cache entry or by isolating the code and data in an active monitor) works by managing the way in which data is shared. Accordingly, systems designed to work with software cache consistency will also work well with hardware cache consistency by reducing the number of times hardware cache consistency algorithms are actually used.

Summary

In this paper, we discussed how tightly coupled multiprocessor systems provide the opportunity for increased system performance. We saw that shared memory for processors with caches introduced the problem of cache inconsistency. While it is possible to provide hardware solutions to cache consistency, the hardware solutions introduce additional expense and reduced performance. In addition, hardware solutions limit the number of processors which can be supported.

As an alternative, we propose that software can provide cache consistency for some TCMP systems. User level programs can provide cache consistency by use of the virtual memory page tables. Existing operating system code can be modified to explicitly flush and invalidate accesses to shared data in the cache, based upon the existing locking protection. Even better cache consistency support can be provided by restructuring the operating system into a collection of active monitors which can be scheduled onto specific processors. As long as the monitors are not rescheduled onto a different processor, no cache invalidation would be necessary.

References

``The IBM RISC System/6000 processor'', IBM Journal of Research and Development, Volume 34, Number 1, (January 1990), 136 pages.
H. C. Lauer and R. M. Needham, ``On the Duality of Operating System Structures'', Second International Symposium on Operating Systems, IRIA, (October 1978).
A. J. Smith, ``Cache Memories'', Computing Surveys, Volume 14, Number 3, (September 1982), pages 473-530.
``Cache Architectures in Tightly Coupled Multiprocessors'', Computer, Volume 23, Number 6, (June 1990), 128 pages.
D. L. Black, R. F. Rashid, D. B. Golub, C. R. Hill, and R. V. Baron, ``Translation Lookaside Buffer Consistency: A Software Approach'', Third International Conference on Architectural Support for Programming Languages and Operating Systems, Boston, Massachusetts, (April 1989), pages 113-122.
C. A. R. Hoare, ``Monitors: An Operating System Structuring Concept'', Communications of the ACM, Volume 17, Number 10, (October 1974), pages 549-557.

Home

Comments?