Windows NT in a ccNUMA System

B.C. Brock, G.D. Carpenter, E. Chiprout, M.E. Dean, E.N. Elnozahy, D. Glasco, J.L. Peterson, R. Rajamony, F.L. Rawson, R.L. Rockhold

Austin, Texas 78758

March, 1999

This paper has been published. It should be cited as

Bishop Brock, Gary Carpenter, Eli Chiprout, Mark Dean, Elmootazbellah Elnozahy, David Glasco, James Peterson, Ramakrishnan Rajamony, Ron Rockhold and Andrew Zimmerman, ``Windows NT in a ccNUMA System'', 3rd USENIX Windows NT Symposium, Seattle, Washington, (July 1999).


We have built a 16-way multiprocessor system using Intel's 350MHz Xeon processors equipped with 4G of physical memory. Our implementation follows the ccNUMA paradigm. Such an environment poses several performance challenges to Windows NT, which was designed primarily to run in a small SMP environment in which memory is equi-distant to all processors in the system. To overcome these problems and enable NT to run in this environment efficiently, we have implemented an abstraction called the Resource Set that allows threads to specify where memory could be allocated across the NUMA complex. Thus, threads can specify that memory should be allocated from banks that are close to the processors on which they run. This affinity in memory allocation can reap several performance benefits when running parallel applications. Our implementation largely disables the memory allocation policies of NT and substitutes them with out-of-kernel policies and mechanisms, through an innovative use of device drivers and DLLs. Our results suggest that memory allocation affinity should become a part of the standard API of Windows NT, as more and more large-scale multiprocessors based on shared memory are becoming the norm rather than the exception.

1 Introduction

There is an increasing need for powerful servers to meet the processing demands of modern distributed systems, transaction processing, and Internet data providers. Traditional server systems use proprietary mainframe computers or other large computer systems that are powerful and robust, though expensive. Recently, there has been an increasing demand for building servers out of commodity, off the shelf processors. In particular, symmetric- multiprocessor systems (SMP) that use the Intel's x 86 processor line and run Windows NT are in increasing favor due to their low cost, the availability of a vast amount of software, and the success of this platform in the personal computer and workstation markets. Physical limits, however, impose restrictions on the size of SMP systems. Current SMP systems for the Intel Pentium II processor, for example, typically contain up to 4 processors.

To build larger servers that extend the system size beyond these limits, Cache Coherent Non Uniform Memory Access (ccNUMA) machines use a fast switch to connect several SMPs and give a global, coherent shared memory. But these machines require an operating system that scales as the number of processors increases. The focus of this paper is on how Windows NT behaves in such a large-scale system. We have built a 16-way ccNUMA multiprocessor using Intel's XeonTM 400MHz processors, and we report on our preliminary experience with running Windows NT 4.0. We have found that:

Our results suggest that significant performance improvements will be possible if Windows NT could be modified to directly support the non-uniform nature of the access to the resources that it manages.

2 Previous Work

Over the years several research ccNUMA machines have been constructed, such as Alewife[Ale], DASH[Len] and FLASH[Flash]. These systems have been built either by constructing processor and memory nodes from scratch or by combining pre-existing hardware using a combination of hardware modifications and interconnection fabrics. Recent years have seen the introduction of several affordable and general purpose ccNUMA or scalable shared- memory implementations including the Silicon Graphics Origin[SGI], the Sequent NUMA-Q[Sting], the Data General Avion NUMALiiNE[DG] and the Unisys Cellular Multiprocessor[Uni]. All of these but the Origin use Intel IA-32 processors, and the Sequent and Data General systems are built using standard high-volume four-way SMP nodes. The Origin uses the R10000 processor and unique memory and I/O structures within as well as between two processor nodes. The Cellular MP also uses a unique arrangement of two-processor nodes and claims uniform access times to memory. Our hardware prototype is based on an extension of the Fujitsu Synfinity interconnect used in Fujitsu's teamserver [FJST].

Most of the work done on ccNUMA systems until very recently used a variant of UNIX as the operating system. For instance, SGI Irix 6.4 (or Cellular Irix) supports the ccNUMA features of the Origin while Sequent has historically had its own implementation of UNIX known as Dynix/ptx. Microsoft's direction for scaling Windows NT[NT] beyond standard SMPs has been to provide clustering, the Microsoft Cluster Service[MSCS], and for now NT lacks any explicit support for NUMA. However, some of the recent ccNUMA systems, including the Sequent NUMA-Q and the Fujitsu teamserver, have supported NT as either an alternative to UNIX or the primary operating system provided by the vendor. Our NT work has built on the implementation done for two-node systems by Fujitsu.

Although Splash[Spl] and Splash-2[Spl2] have been widely used in the academic community to measure MP performance ([Shrimp], [Len] and many others), most of the commercial server vendors have been more concerned about reporting numbers using the Transaction Processing Performance Council's TPC-C and TPC-D benchmarks[Gray], [TPC]. However, while the TPC benchmarks are good indicators of overall transaction processing performance, they do not offer very much insight into the reasons for the performance observed and require a large investment to set up and execute properly. For these reasons, we have chosen to study a subset of Splash-2 instead. Most of the Splash-2 numbers reported to date have been on UNIX although [Sch] used it to measure the performance of a distributed virtual memory system on Windows NT.

The value of affinity scheduling has been recognized even for SMP machines with caches[Vas] although in the context of placing threads together with their cache transients. Our affinity implementation may be viewed as an extension of this notion into placing threads with the main memory that they use. Although our implementation scheme is indirect since we do not have NT source code access, there is no reason to believe that Microsoft could not implement similar function directly in the NT executive. The work of the FLASH project[Ver] on improving data locality for ccNUMA machines used page migration and replication techniques rather than affinity allocation to improve memory reference locality. The Silicon Graphics Origin adds specialized hardware to support an efficient implementation of page migration. The Sequent, Data General and Unisys implementations all rely on splitting the larger machine into a set of communicating partitions, each of which runs a separate copy of the operating system, to deal with any scalability limitations that they encounter.

3 System Structure and Software Enhancements

3.1 Prototype Overview

We have constructed a 16-processor ccNUMA machine built out of four 4-processor SMP systems, each called anode . Each node contains four 350 MHz Intel Xeon processors with 1 gigabyte of RAM and a standard set of I/O peripherals. In addition, each node has a Mesh Coherence Unit (MCU). The MCU connects to a high-speed switch and provides coherent remote access to the memory and I/O devices of other nodes. The switch connects four nodes together to form a 16-processor system, providing a bandwidth of 1.6 gigabits per second per link per direction. The latency through the switch is approximately 30 nanoseconds, 3 times slower than access to local memory.

The MCU in each node snoops the node's memory bus and uses a directory-based cache coherence protocol to extend the coherence of memory across nodes. The MCUs exchange point-to-point messages over the switch to access remote memory and maintain cache coherence over the entire system. The MCU defines a standard 4-node memory map that effectively partitions a standard, 4G bytes physical address space into 4 areas of 1 gigabyte each, one for each of the nodes in a 4-node system. In addition to memory, memory-mapped I/O and I/O space addresses are also remapped to allow processors to access the remote memory-mapped I/O and I/O space of other nodes.

3.2 Enabling NT to run on a ccNUMA System

Windows NT treats all memory in the system alike. Therefore we modified the lower layers of the system to give NT the illusion of a 16-way SMP system. To enable Windows NT to run on the 16-way system, we enhanced the BIOS and NT's Hardware Abstraction Layer (HAL) code. The 4-node system boots as four separate SMP systems. After the BIOS code on each completes execution, but before booting the operating system, the system executes a special extended BIOS (eBIOS), which reconfigures the four SMP nodes into one 16-way ccNUMA system. We modified the NT HAL to support remote inter-processor interrupts, and to provide access to remote I/O devices and I/O spaces by re-mapping them as necessary. The combination of the HAL and eBIOS code allow us to run Windows NT on what NT believes is a 16-way SMP with 4 gigabytes of memory.

The eBIOS also allows the system to be partitioned at boot time into smaller NUMA systems. For example, the eBIOS can partition the system into two 2-node systems, each with 8 processors, and 2 gigabytes of memory. Each partition runs a separate copy of Windows NT. Other configurations for partitioning the 16-way system into separate systems are also possible.

3.3 Binding and Affinity

Our 16-way ccNUMA system is an early glimpse at the shape of large multiprocessor systems to come in the foreseeable future. To scale well in such systems, an operating system must accommodate the variability in memory access times across the system. In particular, application threads must be bound to processors and allocated memory such that the majority of memory accesses by threads come from the node on which they run. Otherwise, application threads will not exploit fully the advantages of the high-performance hardware, or indeed, may suffer in performance if the majority of their memory accesses come from remote nodes. Currently, Windows NT's support is limited to binding threads to particular processors, but there is no notion of allocating the memory such that the threads would get the majority of their memory accesses from a local node. Indeed, on a system like ours, NT does not recognize that all memory frames are not alike.

To overcome these problems, we have implemented a solution that works around NT's limitations. This solution provides the application with an extended Application Program Interface (API) that allows control over memory allocation, and disables NT's native memory allocation to a large extent. The resulting improvement on application performance suggests that this kind of support should become part of the standard API of Windows NT. We describe first the extended API and then show how it can be implemented currently in our system.

4 Adding Memory Affinity Support to Windows NT

Operating systems on SMP architectures optimize performance by constraining threads to execute on the smallest number of CPUs possible. Such "bindings" create an affinity between the threads and the cache footprints on the processors on which they run, resulting in good cache hit ratios. Windows NT supports process and thread affinity in the form of a bit mask, one bit per eligible CPU, which identifies where threads can execute. Nevertheless, having no notion of distance between CPUs and memory, NT does not support affinity in memory allocation. To exploit the large scale of a ccNUMA system, however, the binding must also take into account the memory allocation such that threads get the majority of their memory accesses from a local node. Therefore, we have designed an API that provides this service.

Our ccNUMA API is based on the Resource Set (RSet) abstraction. Intuitively, an RSet groups several resources in such a way that a thread bound to that resource set consumes resources exclusively from the set. For example, one could specify an RSet containing the processors and physical memory available to one node. A thread that is bound to such an RSet will run only on processors within the node and allocate memory from the pool of page frames within the node. Thus, the thread will not need to access remote resources.

RSets are flexible. They can combine resources in two different nodes, include resources spanning different nodes, contain a partial set of the resources on one node, or any other combination that suits the application needs. Furthermore, they can be manipulated using union and intersection operations, and can also form hierarchies, where one large RSet is made to contain several RSets. To simplify the interface, our library provides a global RSet that contains all resources in the system. Thus, an application can build additional RSets by specifying subsets of the global one. The support for the library implementation itself takes place by adding a call to the HAL that returns all the resources available in the system.

The RSet implementation provides fine-grained affinity control. Functions in the API fall into the following categories:

We have implemented the RSet abstraction using a combination of DLLs, backed by an NT pseudo device driver. Furthermore, we also provide a higher level API that provides a simplified interface to the RSet abstraction similar to traditional thread packages. Thus, an application programmer can use the RSet facility indirectly through the familiar interface of a thread library, or can access it directly for full control.

4.1 Allocating virtual memory according to an RSet

There is no mechanism in Windows NT to constrain what ranges of physical memory should back user's virtual memory. We have provided an interface similar to VirtualAlloc(), which supports the specification of an RSet. This interface allocates locked virtual memory that is backed by system memory as specified by the nodes identified in rset_for_mem's MemNodeSet:

void*  NumaVirtualAllocLocked(void* start_addr, size_t *pages, pRSet rset_for_mem);

Despite the lack of directed memory allocations in NT, if one can assure that the system memory backing a range of pages meets requirements, then by locking the page in memory, the mapping should remain constant as long as the application remains active. So the challenge, then, is to coerce NT into backing the virtual pages with memory from the requested node(s). To accomplish this, we have implemented the following approach. First, theNumaVirtualAllocLocked routine allocates the number of pages requested, mapping the virtual addresses into the caller's address space. Then it increases the calling process' Working Set Size by the number of pages requested and VirtualLocks the range into memory. It then passes the address and length of the virtual memory range to ourNumaMem device driver which returns a list of the nodes whose real memory backs each page. For each page not "correctly" backed, that page is modified, released, and reallocated. This process repeats until all pages are correctly backed.

The NumaMem device driver translates a given virtual memory address to its real address, determines the node which "owns" that real address, and returns that node identifier to the caller. For improved efficiency, a virtual address range can be passed to the driver, and a list of node identifiers (one per virtual page) will be returned. To enable the mapping between a real address and a node identifier (as required by the NumaMem driver), we have exported from our HAL implementation an interface which provides not only the memory range to node id mapping, but also the system topology information (e.g. number of nodes, which processors are in which node, etc.). Note that we are not advocating this implementation as a permanent solution of the lack of memory allocation affinity in Windows NT. Instead, this implementation allows us to study the potential benefits of including such support in the system. Our results suggest that this support should become an integral part of Windows NT as it moves to scale up to large system configurations.

4.2 Affinity Policies

Using our Windows NT device driver, we have implemented two different classes of affinity policies, one for thread execution and one for memory allocation. The thread affinity policies are:

We have several memory affinity policies, including:

5 Experience

To test the effects of adding the RSet abstraction to include memory affinity support in Windows NT, we have run several experiments to study the performance of our prototype as it runs parallel applications. The application suite consists of three applications from the Stanford's Splash-2 benchmark[spl, spl2], in addition to a parallel program for matrix multiplication. The three applications from the Splash-2 benchmark are:

Details of each of these applications are available in the references. Additionally, the matrix multiplication program multiplies two 1024x1024 matrices.

For each application we have modified the program to make use of the affinity support which we implemented. Due to time constraints, we outline only the major results of our study. We will present the entire evaluation in detail in the final version.

5.1 Effects of NT's Allocation Policies vs. Effects of Memory Allocation Affinity

Compared to the native memory allocation policy of NT, affinity in memory allocation resulted in a reduction in running time by a factor of 2 for jacobi on 16 processors, and a factor of 4 on 4 processors. Similarly, affinity in memory allocation resulted in about a factor of 30% reduction in running time for matrix multiply on 16 processors. These results suggest that affinity of memory allocation in NT should be part of the standard interface, especially as more and more large-scale servers become the norm.

5.2 Effects of Local Bus Contention

Local bus contention is a serious problem as modern processors increase in speed. For jacobi, we ran two configurations:

The second configuration outperformed the first, suggesting that local bus saturation had a penalty that far outweighs the latency of the interconnect. This problem persists with larger configurations. This result suggests that the traditional bus architecture for building SMP nodes is near the end of its usefulness, and that alternative architectures for building SMP nodes such as crossbar should receive serious consideration.

5.3 Scaling

An application such as 3Dfft (in its current implementation) does not scale up well for an SMP machine. Not surprisingly, the same results are true for a NUMA machine. Conversely, an application that scales well in an SMP environment such as jacobi or matrix multiply, seems to scale well in the ccNUMA environment. We did not see any surprises in these results, suggesting that inherent algorithmic characteristics of applications will be the defining factor in how they scale up to NUMA.

5.4 Effects of Algorithmic Changes

Programs written to run in an SMP system will run without modifications on a ccNUMA system. However, it has been argued that NUMA-aware programs could further exploit the performance advantages of a ccNUMA architectures, for instance by changing the algorithm used to exploit the characteristics of the NUMA environment (e.g. large amounts of physical memory). In our experiments, we have implemented a modified matrix multiplication algorithm in which the multiplicand matrix is replicated at each node. Interestingly, this is similar to distributed algorithms that solve the same problem using message passing. Perhaps this suggests that transforming message-passing programs to run on NUMA will be more fruitful than just running the existing SMP code. Our experiments have shown that the modified version results in a reduction in running time by a factor of 3 on 16 processors when compared to the SMP version. This result suggests that algorithmic changes will need to be considered when exploiting the best possible performance of the machine is necessary.

6 Conclusions

We have built a ccNUMA 16-way multiprocessor system using Intel's 350MHz Xeon processors and equipped with 4G of physical memory. We have found that the standard facilities available in Windows NT do not cope well with the demands of such architecture. Windows NT was designed primarily to run on small SMP environments in which memory is equi-distant to all processors in the system, and therefore would not perform optimally in an environment where the distance between processors and memory is not uniform across the system. To overcome these problems and enable NT to run in this environment efficiently, we have implemented an abstraction called the Resource Set that allows threads to specify where memory could be allocated across the NUMA complex. Thus, threads can specify that memory should be allocated from banks that are close to the processors on which they run. This affinity in memory allocation can reap several performance benefits when running parallel applications. Our implementation largely disables the memory allocation policies of NT and substitutes it with nout-of-kernel policies and mechanisms, through an innovative use of device drivers and DLLs. Our results show that:

Our results suggest that memory allocation affinity should become a part of the standard API of Windows NT, as more and more large-scale multiprocessors based on shared memory are becoming the norm rather than the exception.