Windows NT in a ccNUMA System

B.C. Brock, G.D. Carpenter, E. Chiprout, M.E. Dean, E.N. Elnozahy, D. Glasco, J.L. Peterson, R. Rajamony, F.L. Rawson, R.L. Rockhold

IBM
Austin, Texas 78758

March, 1999

This paper has been published. It should be cited as

Bishop Brock, Gary Carpenter, Eli Chiprout, Mark Dean, Elmootazbellah Elnozahy, David Glasco, James Peterson, Ramakrishnan Rajamony, Ron Rockhold and Andrew Zimmerman, ``Windows NT in a ccNUMA System'', 3rd USENIX Windows NT Symposium, Seattle, Washington, (July 1999).

ABSTRACT

We have built a 16-way multiprocessor system using Intel's 350MHz Xeon processors equipped with 4G of physical memory. Our implementation follows the ccNUMA paradigm. Such an environment poses several performance challenges to Windows NT, which was designed primarily to run in a small SMP environment in which memory is equi-distant to all processors in the system. To overcome these problems and enable NT to run in this environment efficiently, we have implemented an abstraction called the Resource Set that allows threads to specify where memory could be allocated across the NUMA complex. Thus, threads can specify that memory should be allocated from banks that are close to the processors on which they run. This affinity in memory allocation can reap several performance benefits when running parallel applications. Our implementation largely disables the memory allocation policies of NT and substitutes them with out-of-kernel policies and mechanisms, through an innovative use of device drivers and DLLs. Our results suggest that memory allocation affinity should become a part of the standard API of Windows NT, as more and more large-scale multiprocessors based on shared memory are becoming the norm rather than the exception.

1 Introduction

There is an increasing need for powerful servers to meet the processing demands of modern distributed systems, transaction processing, and Internet data providers. Traditional server systems use proprietary mainframe computers or other large computer systems that are powerful and robust, though expensive. Recently, there has been an increasing demand for building servers out of commodity, off the shelf processors. In particular, symmetric- multiprocessor systems (SMP) that use the Intel's x 86 processor line and run Windows NT are in increasing favor due to their low cost, the availability of a vast amount of software, and the success of this platform in the personal computer and workstation markets. Physical limits, however, impose restrictions on the size of SMP systems. Current SMP systems for the Intel Pentium II processor, for example, typically contain up to 4 processors.

To build larger servers that extend the system size beyond these limits, Cache Coherent Non Uniform Memory Access (ccNUMA) machines use a fast switch to connect several SMPs and give a global, coherent shared memory. But these machines require an operating system that scales as the number of processors increases. The focus of this paper is on how Windows NT behaves in such a large-scale system. We have built a 16-way ccNUMA multiprocessor using Intel's Xeon^TM 400MHz processors, and we report on our preliminary experience with running Windows NT 4.0. We have found that:

Policies that ensure affinity in memory allocation are very effective in improving application performance.
For some applications, saturation of the local bus is the main performance impediment. This is somewhat surprising since one would expect the latency of the NUMA interconnect to be the major source of overhead in a ccNUMA system.
Algorithmic changes to existing SMP applications can yield substantial improvement.
Generally speaking, applications that scale well on an SMP seem to scale well also in the ccNUMA environment, and vice versa.

Our results suggest that significant performance improvements will be possible if Windows NT could be modified to directly support the non-uniform nature of the access to the resources that it manages.

2 Previous Work

Over the years several research ccNUMA machines have been constructed, such as Alewife[Ale], DASH[Len] and FLASH[Flash]. These systems have been built either by constructing processor and memory nodes from scratch or by combining pre-existing hardware using a combination of hardware modifications and interconnection fabrics. Recent years have seen the introduction of several affordable and general purpose ccNUMA or scalable shared- memory implementations including the Silicon Graphics Origin[SGI], the Sequent NUMA-Q[Sting], the Data General Avion NUMALiiNE[DG] and the Unisys Cellular Multiprocessor[Uni]. All of these but the Origin use Intel IA-32 processors, and the Sequent and Data General systems are built using standard high-volume four-way SMP nodes. The Origin uses the R10000 processor and unique memory and I/O structures within as well as between two processor nodes. The Cellular MP also uses a unique arrangement of two-processor nodes and claims uniform access times to memory. Our hardware prototype is based on an extension of the Fujitsu Synfinity interconnect used in Fujitsu's teamserver [FJST].

Most of the work done on ccNUMA systems until very recently used a variant of UNIX as the operating system. For instance, SGI Irix 6.4 (or Cellular Irix) supports the ccNUMA features of the Origin while Sequent has historically had its own implementation of UNIX known as Dynix/ptx. Microsoft's direction for scaling Windows NT[NT] beyond standard SMPs has been to provide clustering, the Microsoft Cluster Service[MSCS], and for now NT lacks any explicit support for NUMA. However, some of the recent ccNUMA systems, including the Sequent NUMA-Q and the Fujitsu teamserver, have supported NT as either an alternative to UNIX or the primary operating system provided by the vendor. Our NT work has built on the implementation done for two-node systems by Fujitsu.

Although Splash[Spl] and Splash-2[Spl2] have been widely used in the academic community to measure MP performance ([Shrimp], [Len] and many others), most of the commercial server vendors have been more concerned about reporting numbers using the Transaction Processing Performance Council's TPC-C and TPC-D benchmarks[Gray], [TPC]. However, while the TPC benchmarks are good indicators of overall transaction processing performance, they do not offer very much insight into the reasons for the performance observed and require a large investment to set up and execute properly. For these reasons, we have chosen to study a subset of Splash-2 instead. Most of the Splash-2 numbers reported to date have been on UNIX although [Sch] used it to measure the performance of a distributed virtual memory system on Windows NT.

The value of affinity scheduling has been recognized even for SMP machines with caches[Vas] although in the context of placing threads together with their cache transients. Our affinity implementation may be viewed as an extension of this notion into placing threads with the main memory that they use. Although our implementation scheme is indirect since we do not have NT source code access, there is no reason to believe that Microsoft could not implement similar function directly in the NT executive. The work of the FLASH project[Ver] on improving data locality for ccNUMA machines used page migration and replication techniques rather than affinity allocation to improve memory reference locality. The Silicon Graphics Origin adds specialized hardware to support an efficient implementation of page migration. The Sequent, Data General and Unisys implementations all rely on splitting the larger machine into a set of communicating partitions, each of which runs a separate copy of the operating system, to deal with any scalability limitations that they encounter.

3 System Structure and Software Enhancements

3.1 Prototype Overview

We have constructed a 16-processor ccNUMA machine built out of four 4-processor SMP systems, each called anode . Each node contains four 350 MHz Intel Xeon processors with 1 gigabyte of RAM and a standard set of I/O peripherals. In addition, each node has a Mesh Coherence Unit (MCU). The MCU connects to a high-speed switch and provides coherent remote access to the memory and I/O devices of other nodes. The switch connects four nodes together to form a 16-processor system, providing a bandwidth of 1.6 gigabits per second per link per direction. The latency through the switch is approximately 30 nanoseconds, 3 times slower than access to local memory.

The MCU in each node snoops the node's memory bus and uses a directory-based cache coherence protocol to extend the coherence of memory across nodes. The MCUs exchange point-to-point messages over the switch to access remote memory and maintain cache coherence over the entire system. The MCU defines a standard 4-node memory map that effectively partitions a standard, 4G bytes physical address space into 4 areas of 1 gigabyte each, one for each of the nodes in a 4-node system. In addition to memory, memory-mapped I/O and I/O space addresses are also remapped to allow processors to access the remote memory-mapped I/O and I/O space of other nodes.

3.2 Enabling NT to run on a ccNUMA System

Windows NT treats all memory in the system alike. Therefore we modified the lower layers of the system to give NT the illusion of a 16-way SMP system. To enable Windows NT to run on the 16-way system, we enhanced the BIOS and NT's Hardware Abstraction Layer (HAL) code. The 4-node system boots as four separate SMP systems. After the BIOS code on each completes execution, but before booting the operating system, the system executes a special extended BIOS (eBIOS), which reconfigures the four SMP nodes into one 16-way ccNUMA system. We modified the NT HAL to support remote inter-processor interrupts, and to provide access to remote I/O devices and I/O spaces by re-mapping them as necessary. The combination of the HAL and eBIOS code allow us to run Windows NT on what NT believes is a 16-way SMP with 4 gigabytes of memory.

The eBIOS also allows the system to be partitioned at boot time into smaller NUMA systems. For example, the eBIOS can partition the system into two 2-node systems, each with 8 processors, and 2 gigabytes of memory. Each partition runs a separate copy of Windows NT. Other configurations for partitioning the 16-way system into separate systems are also possible.

3.3 Binding and Affinity

Our 16-way ccNUMA system is an early glimpse at the shape of large multiprocessor systems to come in the foreseeable future. To scale well in such systems, an operating system must accommodate the variability in memory access times across the system. In particular, application threads must be bound to processors and allocated memory such that the majority of memory accesses by threads come from the node on which they run. Otherwise, application threads will not exploit fully the advantages of the high-performance hardware, or indeed, may suffer in performance if the majority of their memory accesses come from remote nodes. Currently, Windows NT's support is limited to binding threads to particular processors, but there is no notion of allocating the memory such that the threads would get the majority of their memory accesses from a local node. Indeed, on a system like ours, NT does not recognize that all memory frames are not alike.

To overcome these problems, we have implemented a solution that works around NT's limitations. This solution provides the application with an extended Application Program Interface (API) that allows control over memory allocation, and disables NT's native memory allocation to a large extent. The resulting improvement on application performance suggests that this kind of support should become part of the standard API of Windows NT. We describe first the extended API and then show how it can be implemented currently in our system.

4 Adding Memory Affinity Support to Windows NT

Operating systems on SMP architectures optimize performance by constraining threads to execute on the smallest number of CPUs possible. Such "bindings" create an affinity between the threads and the cache footprints on the processors on which they run, resulting in good cache hit ratios. Windows NT supports process and thread affinity in the form of a bit mask, one bit per eligible CPU, which identifies where threads can execute. Nevertheless, having no notion of distance between CPUs and memory, NT does not support affinity in memory allocation. To exploit the large scale of a ccNUMA system, however, the binding must also take into account the memory allocation such that threads get the majority of their memory accesses from a local node. Therefore, we have designed an API that provides this service.

Our ccNUMA API is based on the Resource Set (RSet) abstraction. Intuitively, an RSet groups several resources in such a way that a thread bound to that resource set consumes resources exclusively from the set. For example, one could specify an RSet containing the processors and physical memory available to one node. A thread that is bound to such an RSet will run only on processors within the node and allocate memory from the pool of page frames within the node. Thus, the thread will not need to access remote resources.

RSets are flexible. They can combine resources in two different nodes, include resources spanning different nodes, contain a partial set of the resources on one node, or any other combination that suits the application needs. Furthermore, they can be manipulated using union and intersection operations, and can also form hierarchies, where one large RSet is made to contain several RSets. To simplify the interface, our library provides a global RSet that contains all resources in the system. Thus, an application can build additional RSets by specifying subsets of the global one. The support for the library implementation itself takes place by adding a call to the HAL that returns all the resources available in the system.

The RSet implementation provides fine-grained affinity control. Functions in the API fall into the following categories:

Determining the system configuration.
Creating and manipulating RSets.
Allocating virtual memory according to an RSet.
Binding processes and threads by an RSet.

We have implemented the RSet abstraction using a combination of DLLs, backed by an NT pseudo device driver. Furthermore, we also provide a higher level API that provides a simplified interface to the RSet abstraction similar to traditional thread packages. Thus, an application programmer can use the RSet facility indirectly through the familiar interface of a thread library, or can access it directly for full control.

4.1 Allocating virtual memory according to an RSet

There is no mechanism in Windows NT to constrain what ranges of physical memory should back user's virtual memory. We have provided an interface similar to VirtualAlloc(), which supports the specification of an RSet. This interface allocates locked virtual memory that is backed by system memory as specified by the nodes identified in rset_for_mem's MemNodeSet:

void*  NumaVirtualAllocLocked(void* start_addr, size_t *pages, pRSet rset_for_mem);

Despite the lack of directed memory allocations in NT, if one can assure that the system memory backing a range of pages meets requirements, then by locking the page in memory, the mapping should remain constant as long as the application remains active. So the challenge, then, is to coerce NT into backing the virtual pages with memory from the requested node(s). To accomplish this, we have implemented the following approach. First, theNumaVirtualAllocLocked routine allocates the number of pages requested, mapping the virtual addresses into the caller's address space. Then it increases the calling process' Working Set Size by the number of pages requested and VirtualLocks the range into memory. It then passes the address and length of the virtual memory range to ourNumaMem device driver which returns a list of the nodes whose real memory backs each page. For each page not "correctly" backed, that page is modified, released, and reallocated. This process repeats until all pages are correctly backed.

The NumaMem device driver translates a given virtual memory address to its real address, determines the node which "owns" that real address, and returns that node identifier to the caller. For improved efficiency, a virtual address range can be passed to the driver, and a list of node identifiers (one per virtual page) will be returned. To enable the mapping between a real address and a node identifier (as required by the NumaMem driver), we have exported from our HAL implementation an interface which provides not only the memory range to node id mapping, but also the system topology information (e.g. number of nodes, which processors are in which node, etc.). Note that we are not advocating this implementation as a permanent solution of the lack of memory allocation affinity in Windows NT. Instead, this implementation allows us to study the potential benefits of including such support in the system. Our results suggest that this support should become an integral part of Windows NT as it moves to scale up to large system configurations.

4.2 Affinity Policies

Using our Windows NT device driver, we have implemented two different classes of affinity policies, one for thread execution and one for memory allocation. The thread affinity policies are:

Float. This is the default NT policy. Threads are eligible to run on any processor at any time.
Fill. In this policy, as many threads are bound to a node as there are processors before we continue to the next node.
Round Robin. In this policy, threads are allocated such that the first thread is allocated to the first node, the second thread allocated to the second node, and so on in a round robin fashion.

We have several memory affinity policies, including:

Local. Memory allocation is bound to be from the node on which the corresponding thread executes.
Any. Memory allocation can take place on any node.
Remote. Only memory that is remote to this thread is used. This allows us to determine an application's sensitivity to memory affinity.

5 Experience

To test the effects of adding the RSet abstraction to include memory affinity support in Windows NT, we have run several experiments to study the performance of our prototype as it runs parallel applications. The application suite consists of three applications from the Stanford's Splash-2 benchmark[spl, spl2], in addition to a parallel program for matrix multiplication. The three applications from the Splash-2 benchmark are:

Jacobi: 500 iterations
Ocean: 514x514
3DFFT

Details of each of these applications are available in the references. Additionally, the matrix multiplication program multiplies two 1024x1024 matrices.

For each application we have modified the program to make use of the affinity support which we implemented. Due to time constraints, we outline only the major results of our study. We will present the entire evaluation in detail in the final version.

5.1 Effects of NT's Allocation Policies vs. Effects of Memory Allocation Affinity

Compared to the native memory allocation policy of NT, affinity in memory allocation resulted in a reduction in running time by a factor of 2 for jacobi on 16 processors, and a factor of 4 on 4 processors. Similarly, affinity in memory allocation resulted in about a factor of 30% reduction in running time for matrix multiply on 16 processors. These results suggest that affinity of memory allocation in NT should be part of the standard interface, especially as more and more large-scale servers become the norm.

5.2 Effects of Local Bus Contention

Local bus contention is a serious problem as modern processors increase in speed. For jacobi, we ran two configurations:

In the first, four threads ran in the same SMP, and sharing the same bus, and allocating all their memory from the SMP node,
In the second, four threads were distributed across all the nodes in the ccNUMA complex, such that one thread was running per node and affinity was adjusted accordingly.

The second configuration outperformed the first, suggesting that local bus saturation had a penalty that far outweighs the latency of the interconnect. This problem persists with larger configurations. This result suggests that the traditional bus architecture for building SMP nodes is near the end of its usefulness, and that alternative architectures for building SMP nodes such as crossbar should receive serious consideration.

5.3 Scaling

An application such as 3Dfft (in its current implementation) does not scale up well for an SMP machine. Not surprisingly, the same results are true for a NUMA machine. Conversely, an application that scales well in an SMP environment such as jacobi or matrix multiply, seems to scale well in the ccNUMA environment. We did not see any surprises in these results, suggesting that inherent algorithmic characteristics of applications will be the defining factor in how they scale up to NUMA.

5.4 Effects of Algorithmic Changes

Programs written to run in an SMP system will run without modifications on a ccNUMA system. However, it has been argued that NUMA-aware programs could further exploit the performance advantages of a ccNUMA architectures, for instance by changing the algorithm used to exploit the characteristics of the NUMA environment (e.g. large amounts of physical memory). In our experiments, we have implemented a modified matrix multiplication algorithm in which the multiplicand matrix is replicated at each node. Interestingly, this is similar to distributed algorithms that solve the same problem using message passing. Perhaps this suggests that transforming message-passing programs to run on NUMA will be more fruitful than just running the existing SMP code. Our experiments have shown that the modified version results in a reduction in running time by a factor of 3 on 16 processors when compared to the SMP version. This result suggests that algorithmic changes will need to be considered when exploiting the best possible performance of the machine is necessary.

6 Conclusions

We have built a ccNUMA 16-way multiprocessor system using Intel's 350MHz Xeon processors and equipped with 4G of physical memory. We have found that the standard facilities available in Windows NT do not cope well with the demands of such architecture. Windows NT was designed primarily to run on small SMP environments in which memory is equi-distant to all processors in the system, and therefore would not perform optimally in an environment where the distance between processors and memory is not uniform across the system. To overcome these problems and enable NT to run in this environment efficiently, we have implemented an abstraction called the Resource Set that allows threads to specify where memory could be allocated across the NUMA complex. Thus, threads can specify that memory should be allocated from banks that are close to the processors on which they run. This affinity in memory allocation can reap several performance benefits when running parallel applications. Our implementation largely disables the memory allocation policies of NT and substitutes it with nout-of-kernel policies and mechanisms, through an innovative use of device drivers and DLLs. Our results show that:

Policies that ensure affinity in memory allocation are very effective in improving the performance of the applications. For the jacobi program from the Splash-2 benchmark, we obtain a 300 to a 400% reduction of the total running time of the application compared to the same run using the native memory allocation policy of NT.
For some applications, saturation of the local bus is the main performance impediment. This is somewhat surprising since one would expect the latency of the NUMA interconnect to be the major source of overhead in a ccNUMA system. We have verified this anomaly by running the jacobi program and matrix multiply on a 4- way SMP node sharing the same local bus, and by distributing the four threads of the program in a round-robin fashion among the four 4-way SMPs that form the NUMA complex. The second configuration does not have any local bus contention and is show to outperform the 4-way SMP version often by a factor of 2, even though it runs across the interconnect.
Algorithmic changes to existing SMP applications can yield substantial improvement. For instance, by replicating the read-only multiplicand matrix in a matrix multiplication algorithm, we can reduce the running time of the program by a factor of 3 compared to the vanilla SMP implementation when run over NUMA.
Generally speaking, applications that scale well on an SMP seem to scale well also in the ccNUMA environment, and vice versa.

Our results suggest that memory allocation affinity should become a part of the standard API of Windows NT, as more and more large-scale multiprocessors based on shared memory are becoming the norm rather than the exception.

References:

[Alewife] Anant Agarwal, Ricardo Bianchini, David Chaiken, Kirk L. Johnson, David Kranz, John Kubiatowicz, Beng-Hong Lim, Ken Mackenzie, and Donald Yeung. The MIT Alewife Machine: Architecture and Performance. Proceedings of the 22nd International Symposium on Computer Architecture , June 1995.
[DG] Data General's NUMALiiNE Technology: The Foundation for the AV 25000 Server, http://www.dg.com/about/html/av25000_foundation.htm.
[FJST] Wolf-Dietrich Weber, Stephen Gold, Pat Helland, Takeshi Shimizu, Thomas Wicki, and Winfried Wilcke, The Synfinity Interconnect Architecture: A Cost-Effective Infrastructure for High-Performance Servers, Proceedings of ISCA'97, the 24th Proceedings of the Annual International Symposium on Computer Architecture.
[Flash] Jeffrey Kuskin, David Ofelt, Mark Heinrich, John Heinlein, Richard Simoni, Kourosh Gharachorloo, John Chapin, David Nakahira, Joel Baxter, Mark Horowitz, Anoop Gupta, Mendel Rosenblum, and John Hennessy. The Stanford FLASH Multiprocessor. In Proceedings of the 21st International Symposium on Computer Architecture, pages 302-313, Chicago, IL, April 1994.
[Gray] Jim Gray, ed., The Benchmark Handbook, 2nd edition, Morgan Kaufman, 1993.
[Len] Daniel E. Lenoski and Wolf-Dietrich Weber, Scalable Shared-Memory Multiprocessing, Morgan Kaufman, 1995.
[MSCS] Vogels, W., Dumitriu, D., Birman, K. Gamache, R., Short, R., Vert, J., Massa, M., Barrera, J., and Gray, J., "The Design and Architecture of the Microsoft Cluster Service -- A Practical Approach to High-Availability and Scalability", Proceedings of the 28th symposium on Fault-Tolerant Computing, Munich, Germany, June 1998.
[NT] David Solomon, Inside Windows NT, 2nd edition, Microsoft Press, 1998.
[Sch] Martin Schulz and Hermann Hellwagner, Extending NT Virtual Memory by SCI-based Hardware DSM 2nd USENIX Windows NT Symposium, August 1998, Seattle WA, USA.
[SGI] James Laudon and Daniel Lenoski, The SGI Origin: A ccNUMA Highly Scalable Server, Proceedings of the 24th International Symposium on Computer Architecture , 1997, Pages 241 - 251.
[Shrimp] Yuanyuan Zhou, Liviu Iftode and Kai Li, Performance Evaluation of Two Home-Based Lazy Release Consistency Protocols for Shared Virtual Memory Systems, Second Symposium on Operating System Design and Implementation, Seattle, WA, 1996.
[Spl] Jaswinder Pal Singh, Wolf-Dietrich Weber, and Anoop Gupta. SPLASH: Stanford Parallel Applications for Shared- Memory. In Computer Architecture News, vol. 20, no. 1, pages 5-44. A
[Spl2] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. The SPLASH-2 Programs: Characterization and Methodological Considerations. In Proceedings of the 22nd International Symposium on Computer Architecture, pages 24-36, Santa Margherita Ligure, Italy, June 1995.
[Sting] Tom Lovett and Russell Clapp, "STiNG: A CC-NUMA computer system for the commercial marketplace", Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996
[TPC] Transaction Processing Performance Council, http://www.tpc.org.
[Uni] Unisys, Cellular MultiProcessing Architecture, http://www.marketplace.unisys.com/ent/cmpwh.pdf.
[Vas] Raj Vaswani and John Zahorjan. The Implications of Cache Affinity on Processor Scheduling for Multiprogrammed, Shared Memory Multiprocessors. Proc. 13th Symposium on Operating System Principles, October, 1991.
[Ver] Ben Verghese, Scott Devine Anoop Gupta, and Mendel Rosenblum, Operating system support for improving data locality on CC-NUMA compute servers. Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, October 1996.