A Virtual Shared Disk (VSD) is a software abstraction that allows multiple nodes, running independent images of the operating system, to access a disk device physically attached to only one of the nodes as if the disk device were attached to all nodes. In this paper we consider the design and implementation of such a VSD on the AIX operating system. We also describe a reliable VSD, which in the case of twin-tailed disks allows transparent takeover of the VSD by a backup node, while applications only experience a delay.
The availability of powerful microprocessors has made clusters an attractive alternative to monolithic systems. Applications that can partition their computation among several nodes can take advantage of this architecture, which typically offers better price-performance than the monolithic systems. Such applications include large scientific computations, database and transaction processing systems, decision support systems, and so on.
A microprocessor cluster consists of a number of separate computing systems, coupled with an interprocessor communication mechanism, such as a network or communications switch. Each computing system has its own processor, memory, and I/O subsystem, and runs a separate instance of the operating system. For maximum benefit, however, it is desirable for an application to be able to abstract from the specific computing system, and treat all nodes in a cluster as equivalent. This ability is sometimes called a ``single system image.''
A useful aspect of single system image is the requirement that the same I/O device resources be available to all processors in the cluster equally. This allows processing tasks to be freely moved between processors. Furthermore, it facilitates the development of parallel applications that adopt a data sharing model for their computation.
Many different approaches can be taken to providing the same I/O resources to all processors.
The Virtual Shared Disk (VSD) is a software disk proxy. It is a layer of software that can make a disk attached to one processor appear as if it were directly attached to all processors. The interface presented to applications is the same as for local devices: applications can open, close, read, write and issue control operations against the VSD.
For a disk there are basically two operations to perform: read and write of a block of data at a given sector on the disk. The VSD software takes the disk address that it is given and maps it to a triple (cpu, device, block) specifying that the actual disk address for this part of the VSD is on device device on processor cpu as block block. It then redirects the I/O operation to the actual cpu with the desired disk block. For clarity, let us look at both read and write operations separately.
For a read operation, the VSD block number is mapped to a (cpu, device, block) triple. If the cpu is the same processor as we are currently running, the read operation is redirected to the device and block number on the current processor, and the read operation is completed.
If the cpu is a different processor, a message is sent from this processor to the remote processor to request the read operation. The message causes the remote machine to read the desired block from the desired device on the remote processor. After the read is complete on the remote machine, an acknowledgement message (ACK) and the requested block of information are returned by the remote processor to the local processor which originated the request. The local processor can then copy the requested data into the buffer originally specified for the read operation, and complete the read operation.
For a write operation, a similar sequence occurs. If the mapping indicates that the block should be written by the current processor, the request is simply redirected to the correct device and block on the local processor. If the supporting processor is a remote processor, the local processor sends a message to the remote processor, with the data to be written, asking it to write the data to its device and block. When the remote processor finishes the write, it sends an ACK to the local processor, which can then complete the write operation.
The result of this design is that remote disks become available as local devices on all processors. Disks are virtual, since they may be physically attached or backed by a disk on a remote server. Each processor can have an identical set of VSDs defined which are backed by a local or remote disk. Spreading the physical devices over multiple nodes allows load balancing.
In addition, it is possible to ``merge'' separate remote disks together into the appearance of one larger shared disk, or split a large disk into several smaller disks. For example, if we had four disks, one on each of four processors, each disk being 40 Megabytes, we could define a mapping function as follows. Each block is assumed to be 4 Kilobytes, so each disk has 10,240 blocks.
Each disk stores 40 Megabytes, but the VSD stores 40,960 blocks of 4 Kbytes each, for a capacity of 160 Megabytes. A request to read block 31750 would be mapped to cpu 3, device 13,0 (on cpu 3), block 1031 (= 31750 - 30719).
The illustrated mapping merges the individual remote disks together, one complete disk after another. Another mapping could interleave the blocks from the different disks, so that data would be striped across the disks (and processors). For sequential access to the disk blocks this could provide important parallelism and performance improvements.
The performance of the VSD depends upon usage characteristics. A read or write to a part of the VSD that is attached to the local processor suffers only a small overhead in going through the code to map the request to the local processor. A read or write to a VSD that is supported by a remote processor, however, requires two interprocessor messages (the request and the ACK), and a copy of the disk block between the two processors. The performance penalty will depend upon the actual I/O time, the time for an interprocessor message, and the time to transfer the disk block between processors.
The main advantage of the VSD is that all processors can access the disks on all processors, that all processors can share access to the disks (that is they can not only access the disks, but can access the disks at the same time that other processors are accessing the same disks), and that the shared disk image that is supported is the sum of the disk space on the separate processors. No special hardware is needed.
We have implemented the VSD concept in two different AIX systems: the POWER/4 shared memory cluster and the SP/1 switched cluster. A VSD pseudo-device driver was written for the AIX 3.2 operating system running on each processor.
The VSD software is packaged as a device driver that is loaded to AIX as a kernel extension. After a VSD is configured and started on the nodes in the cluster (see Section 6), it can be accessed by applications on any node as if it were attached locally. The VSD appears in the system as a block device, just like a hard disk or a logical volume. (In AIX, logical volumes are the functional analog of BSD disk partitions, although they can span more than one physical disk.) The usual naming conventions are adopted; there is an entry in the /dev directory on every node for each VSD device (e.g., /dev/vsd3). All VSD's have the same major device number, which is different from the major number of any other kind of device. Applications can open and close a device, read from and write to the device, and issue ioctl operations.
Both the block interface as well as the character (raw) interface are supported; for example, an application can use either /dev/vsd3 or /dev/rvsd3. Access through the block interface goes through the kernel buffer. Since multiple instances of the operating system have access to the device, data accessed through the block interface could be cached in kernel buffers on several nodes, thus creating consistency problems. For example, a node A may write a page, while a subsequent read by node B may see a stale copy of the page cached in B's kernel buffers. Even if B reads the page from disk, it may still obtain the old version, if A's write is cached in the kernel buffers and has not reached the disk yet. This problem is not specific to the VSD; it applies equally to physically shared disks.
Since there is no kernel buffer coherency scheme, applications that access a shared disk, whether real or virtual, need to evaluate the tradeoffs between the block interface and the raw interface. The convenience of implicit caching when using the block interface must be weighed against the performance implications of explicitly flushing the kernel buffer pool (via sync) when the application requires that the data really reach the disk, in addition to the extra copy of data from user space to the kernel buffer. All distributed applications of which we are aware use the raw interface. Read-only applications, or those which access the shared disk from only one node, are not exposed to these concerns.
The standard device switch interface is supported by the VSD device driver running on each processor. The VSD device driver supports config(), open(), close(), read(), write(), strategy(), and ioctl() interfaces. All read() and write() operations are redirected to the strategy routine, which performs the actual I/O.
The core logic of the driver is implemented in the strategy routine. We illustrate the control flow in the system with an example (see Figure 1).
Figure 1. Control flow for a remote request.
Assume that an application issues a read request against a raw VSD device. The operating system invokes the read entry point of the VSD device driver. The read routine validates the arguments and invokes uphysio, a generic routine used by raw device drivers. If necessary, the request is broken by uphysio into multiple, smaller requests, which are passed to the device strategy routine using the standard buf structure interface.
The strategy routine looks up the device in the driver's internal data; since there may be several hundred VSD's configured, a hash table is used to expedite the lookup. The strategy routine performs some further validation of the request with respect to device status and constructs a control block for the request. Then it determines which node is currently the server for the VSD to which the I/O was issued; the server is the node to which the device is physically attached. Even if a disk is twin-tailed for recovery (see Section 5), one of the nodes holding the tails is designated as server and can access the disk; the other tail is only activated in case of failure.
The VSD is completely interrupt driven, there is no process associated with it, which avoids the overheads of daemons, kernel processes, etc. The initiation of the disk I/O on the server upon receipt of a request from a client (step (3) in Figure 1) is performed by the interrupt service routine that processes the incoming message. Since network I/O has higher priority than disk I/O, the incoming request may have interrupted the disk driver in a non-reentrant state, so the disk request is scheduled to run at the appropriate interrupt level.
Similarly, the interrupt service routine invoked upon completion of the physical disk I/O invokes the communication code to send the response back to the client. Finally, the arrival of the response (and possibly of data) at the client triggers the copy of data into the user space (if necessary) and termination processing; again, there is no process scheduled.
The original version of the VSD was built for the POWER/4, a shared memory cluster prototype built at IBM Austin. The POWER/4 has four nodes, each running a separate instance of AIX; the nodes have access to a large amount of shared memory.
For a read operation for a disk block on a remote server, a description of the read request is created and sent to the remote server. Messages are sent by placing them on a queue of messages in shared memory, and interrupting the remote server. The remote server receives the message, and initiates the read from the disk directly into shared memory. When the read completes, the disk interrupts the remote server, which sends a message back to the local processor. The local processor copies the disk block from shared memory to the buffer specified in the original read request.
Since all data transfers are done through the shared memory, the POWER/4 VSD exploits the shared memory to keep a common LRU cache of VSD disk blocks for all VSD devices and all nodes, as well as driver data structures common to all nodes. This modifies a read operation to the VSD. The VSD driver first checks to see if the desired block is in the shared memory LRU cache. If so, it is immediately copied to the user buffer, and the read operation is complete. If the block is not in the shared memory LRU cache, it is read into the shared memory LRU cache, either locally or remotely as appropriate, and then copied to the user buffer from shared memory.
For write operations, the user buffer is immediately copied into the shared memory LRU cache. A configuration option allows the write operation to return immediately or requires that the write be pushed through to disk before the write completes. An ioctl() function is provided to allow all dirty blocks in the shared memory LRU cache to be flushed to disk when necessary.
The use of the shared memory LRU cache provides a substantial performance gain, since commonly used disk blocks are almost always in the shared memory LRU cache. Blocks which are in the shared memory LRU cache are immediately available to all processors (subject to locking during modification of course). However, the VSD approach can also be used when physical shared memory is not available.
The VSD design is in terms of messages passed between the local client node and the (possibly remote) server node. While the use of shared memory can provide faster transfer of information, a networked or switched cluster can also be used. VSDs have been implemented on the SP/1 switched cluster, using IP datagrams to send both requests and data between nodes. Temporary storage managed in a buddy fashion is used to hold the data at a remote server.
The strategy routine logic is implemented as a finite state machine. On the server, a request goes through a sequence of states, which can be divided into three main categories: resource acquisition, actual I/O operation and resource release. The communications mechanism used to exchange requests and responses between clients and servers is a thin datagram protocol built on top of IP. It manages communications buffers (mbuf's), inserts the appropriate headers and invokes ip_output. The interface to the communications layer is a very simple procedure call. The parameters are the destination node, the address and size of the request control block, the address and size of the data block (if any), a descriptor of the address (for the case the data block is in user space and the kernel needs addressability to it) and a pointer to a function that is invoked when the transmission completes.
On the client side, no action needs to be taken when transmission completes; the ACK received from the server upon I/O completion triggers further processing on the client. Thus, the function pointer passed to the communications layer is NULL. On the server things are different. When transmission of the response block (and possibly data) completes, the resources held by the request must be released, so a pointer to the function that performs these tasks is passed to the communication layer.
The fact that the communication protocol is built on top of IP lends the VSD device driver portability. The code runs on wire connected clusters (e.g., clusters connected via ethernet or token ring), as well as on IBM's SP family of supercomputers, which have a switch connecting the nodes in a frame. We are currently experimenting with other communication protocols.
Distributed applications built on a shared disks model typically employ a high-level protocol (e.g., a global lock manager) to maintain data coherency across nodes and to achieve serializability. A typical scenario is that some node writes a modified page back to the shared disk and another node reads the page from the disk. Thus, a write of a particular page is often followed by a read. To avoid performing two physical I/O operations in these cases, we have implemented an optional LRU cache feature.
To exploit the cache feature, a logical block size is associated with all of the cached VSD's at configuration time. The logical block size may be different from the physical block size of the device. The physical block size is the smallest unit of data that can be transferred to or from the device (typically a 512-byte sector); all requests must be multiples of the physical block size. The logical block size is the prevailing request size, for which the double I/O elimination should be targeted. For most applications, the logical block size should be the same as the page size of the application.
Every server maintains an LRU cache of logical blocks for the devices it is serving. The cache is common for all devices served by a particular node, and the size of the cache is determined at configuration time. The size of the cache can be increased dynamically while the driver is running, without suspending the operation of the driver. To avoid coherency problems, blocks are only cached on the server node. To preserve device semantics (e.g., write persistence in case of crashes), the cache is a write-through cache (data is always written to the physical device before the I/O operation is complete). Finally, use of the cache is optional; the caching feature can be turned off on a per VSD device basis at configuration time.
If the caching feature is turned on for a VSD and a request for that VSD is a write of the logical block size, then a copy of the data is maintained in the cache. For remote requests, this represents no additional overhead. At the remote server the data must be copied out of the mbuf's, anyway; instead of using temporary space, a frame in the LRU is used. For local write requests, there is an additional copy between the user space and the LRU cache. When a read request for a logical block of a cacheable device arrives at the server, the LRU cache is checked for potential presence of that block. A hashed index provides quick lookup of logical blocks in the cache.
If a request is not of the logical block size, or if it is not at a logical block boundary, the LRU cache is bypassed at the server. If the request originated locally, the I/O takes place directly between the physical device and the user space; if the request was from a remote client, the temporary copy from the network is used. The LRU cache is also bypassed for local read requests of logical block size and appropriate alignment that miss in the LRU cache, to avoid the extra copy between kernel and user space.
When a write request for a cacheable device bypasses the LRU cache, logical blocks that reside in the LRU cache and overlap the byte range written by the bypassing operation will contain obsolete data, since they will not reflect the actual data on the disk. If such blocks are left in the LRU cache, subsequent read operations for those blocks will return invalid data in case of cache hit. To avoid this problem, just before a bypassing write returns to the user, the LRU cache is checked for the presence of overlapping blocks. If any are found, they are invalidated.
Applications that have data bouncing between nodes through the VSD could benefit significantly from the LRU cache. However, if such bouncing is infrequent or does not occur at all, it behooves the application to define its VSD's as non-cacheable, to avoid the extra copies and the invalidation checks.
Although the VSD is intended for use in an enclosed cluster with reliable communications, the possibility of lost messages cannot be excluded. Such loss could cause applications to hang waiting indefinitely for the completion of an I/O operation that will never take place or will never be acknowledged. We address this issue in a way that tries to optimize for the case when messages arrive properly, since this will be the most frequent case.
Lost messages are handled by retransmitting requests. When a request is sent from a client to a server, a timer is set for the request on the client. If the request message or the completion acknowledgement is lost, the timer expires and the request is retransmitted. To avoid the overhead of setting and clearing a timer for each individual request, a count-down field is initialized in the request and the request is placed on a wait queue. Periodically, this wait queue is scanned and the timers of all requests are decremented. Those requests that have expired (their timer has reached 0) are removed from the wait queue and placed again on the ready queue, so that they will be retransmitted to the server. There is a limit to the number of times a request can be retransmitted. If this limit is ever exceeded, the request fails and an error is returned to the application. Exponential back-off is used between successive retransmissions to avoid overloading an already congested server.
Request retransmissions can introduce data integrity problems. In particular, stale I/O requests and/or responses can cause confusion and produce erroneous results. We illustrate with an example shown in Figure 2.
Figure 2. Problem with retransmissions.
There are two client nodes, A and B, both trying to write the same page on server S. We assume that some concurrency control scheme is employed by the application, so that B attempts to write the data only after A has completed its write. Thus, the copy of the page written by B should be the final copy on disk.
Node A sends the request a to the server, but the request or its acknowledgement is delayed, so A retransmits request a'. When the response to the first request, a, arrives, A notifies node B, which sends its own write request b to the server. After b has been executed on the server, request a' arrives at the server, so stale data overwrites more recent data, which is not acceptable.
To deal with the problem outlined above, each node maintains a set of counters, one for every other node. Every time a request is transmitted, it atomically fetches and increments the counter for the corresponding node. Retransmitted requests obtain a new value of the counter. The server includes the request's counter value in its response. At the client, the counter value of the response is checked against that of the request. If they match, the response is accepted; otherwise, it is discarded. This implies that a request does not complete until the response to its most recent retransmission has been received.
Figure 3. Using counters.
Figure 3 shows what will happen in our example. Assume that the request first obtains a counter of 67. After the timeout, the request obtains a counter of 74. (The counter values need not be consecutive, since there may have been other, unrelated requests to the same server in the meantime). When the first response arrives, it contains a counter value of 67, so it is discarded, since the request now has counter value 74. Thus, the request waits until the second response with counter 74 arrives. The request completes and node B is notified that it can proceed with its own request. Note that the counters used by nodes A and B for requests destined for the same server are totally unrelated to each other.
The processing rules specified above ensure that the client always waits for the response to its most recent transmission, so that stale responses do not get into its way. However, the server may get confused by stale requests, as the example in Figure 4 illustrates.
Figure 4. Stale requests at the server.
Nodes A and B are again trying to write the same page on server S. Node A sends request a with counter 67, but it gets delayed, so A times out and resends the request a' with counter 74. The second message arrives before the first one, gets serviced, and the response with counter 74 is sent to node A. The response matches the request, so the request completes, node B is notified and it sends its write request b to the server. After request b has finished at the server, the stale request a arrives at the server and overwrites more recent data.
The server eliminates the stale request problem by keeping track of the highest counter value received from every client and discarding requests with counters smaller than the highest value ever seen from that client. In the example above, the stale request a' with counter 67 would simply get discarded. Note that this scheme may unnecessarily discard valid requests, just because they happened to arrive out of order with some other, unrelated request(s). This is not a problem, however, because requests will very rarely get reordered by the network.
Critical applications require continuous operation and high availability of their data. In this section, we show how the VSD exploits twin-tailed disks to allow continuous access to the data even when the primary server fails. VSD recovery occurs transparently to applications running on nodes other than the failed one; such applications observe only delays in accessing the data that was being served by the failed node, their requests do not fail.
The recovery of the VSD relies on the availability of a processor membership mechanism that constantly monitors nodes, detects node failures and triggers recovery actions. Such a mechanism is usually available on clusters (e.g., IBM's HACMP and SP products). When the processor membership detects a node failure, it notifies all remaining nodes. A policy module on each node determines which VSD's are affected by the node failure and which node is the backup server (if any). Appropriate reconfiguration commands are issued to the VSD device driver to modify its internal routing tables and start issuing requests for the affected VSD's to their new server. Requests that had been sent to the failed node will eventually time out and be resent to the new server.
If the server takeover is performed with one round of messages, problems may arise, because node failures do not necessarily follow the fail-stop model [Schlichting and Schneider 1983]. In particular, there is no guarantee that the node presumed dead is really dead. For example, the node may just be unreachable from a subset of the other nodes due to a network problem. We illustrate with an example in Figure 5.
Figure 5. Problem with one-phase takeover.
The membership service has determined that the old server is dead; however, the old server is still reachable from client B. Client A has been notified and has switched to the new server. Thus, the write of page P goes to the new server, completes, and then client B is notified and tries to read the same page. However, client B has not yet been notified of the change and directs its read to the old server. If an old version of page P was stored in the old server's cache, client B will read stale data.
To address the above problem we have extended the semantics of the device, as shown in Figure 6, to allow for two-phase transitions.
Figure 6. State diagram for the VSD.
A device is normally in one of two states, stopped or available. We have split the available state into two states, active and suspended. In the active state the device is fully operational; requests are accepted and serviced immediately. In the suspended state, the future of the device is in doubt; it is a temporary state until the device becomes either active or stopped. In the suspended state, requests are still accepted from the applications; however, they are placed on a hold queue instead of being executed immediately. Accepting requests is necessary, so that applications will only experience delays; if the requests were not accepted, the applications would fail and recovery would not be transparent.
When a node failure is detected, a two-phase protocol [Gray 1979] is run. In the first phase, all nodes suspend the affected VSD's; the old server (if alive) also invalidates any pages cached from those VSD's and releases control of the underlying real devices (see Section 6). After all nodes have completed the first phase, the second phase starts. In the second phase, all nodes resume normal operation of the VSD's (return them into the active state) and record the identity of the new server in their routing tables. The new server also assumes control of the underlying real device.
Using a two-phase protocol ensures that two nodes can never assume two different servers for the same VSD at the same time. In the worst case, a subset of the nodes believes the VSD is being served by some node, while the remaining nodes believe the VSD is suspended (temporarily no server). The same technique can be used to switch server nodes gracefully (i.e., when no failure has occurred), which allows for an orderly shutdown of a node for software or hardware maintenance.
We have packaged both the client and the server part of the driver in a single module, which has advantages and disadvantages. On the positive side, code replication is avoided. On the negative side, a lot of logic may be packed within the same module, making the code less maintainable. However, the most important consequence of this packaging decision is the feasibility of recovery. If we had packaged the VSD software as two different drivers, the client and the server for a particular VSD would have to load different modules. This may not be a problem for startup, but it would introduce additional complexity and delays at takeover time. In particular, the node taking over as the server for the VSD would have to switch from the client code to the server code, which could entail unloading and loading of modules. Furthermore, takeover transparency for the application would be difficult. When the failure occurs at the former server, requests may be executing client code on the node that will take over. These requests would have to be taken out of the client module and be fed into the server module, which would require modifications to the base operating system. When client and server are packaged in a single module, the takeover occurs internally in the module, without operating system intervention.
The commands that load the VSD device driver and configure the individual VSD's take a lot of arguments, as indicated by the previous sections. A system with a few tens of nodes could easily have several hundred VSD's, all of which would have to be configured on every node. Furthermore, the configuration data must be defined and stored in some configuration database, to provide persistence across reboots. Thus, there is a need for automating the configuration of VSD's across the nodes of a cluster, and we have provided tools that perform this task. Although the low-level configuration commands could accept any kind of block device as the underlying physical device, our high-level tools are geared towards logical volumes, which are expected to be the predominant kind of underlying device.
The first configuration requirement is that all nodes should refer
to the same VSD by the same name, so that
The next step is to map the unique name (and minor number) for a VSD to a real device. This could be done by specifying for every VSD an ordered pair (node, device number) that defines the node that serves the VSD and device number within that node for the real underlying device. For twin-tailed disks, two such pairs would be specified, one for the primary node, one for the backup. Thus, the following configuration would be possible: the real underlying devices for vsd1 and vsd2 are logical volumes in the same volume group (In AIX, a volume group is a collection of physical disks that are treated as the functional equivalent of a single large disk with aggregate capacity.), which is twin-tailed between nodes A and B. For vsd1, node A is specified as primary and node B as backup, while for vsd2 node B is primary and node A is backup. Consequently, during normal operation, both nodes A and B must access the volume group, to serve vsd1 and vsd2, respectively. However, such an arrangement is infeasible, because all logical volumes in a volume group must be controlled by the same operating system instance at any point in time; moreover, the takeover operation can be performed only on a per volume group basis.
To avoid the above problem and to preserve the layering approach in the system, a VSD is mapped to a pair of the form (volume group name, logical volume minor number). The node location of the volume group and the preference order for its tails (primary/backup, if it is twin-tailed) are defined in a separate table; the node location and tail preference for a VSD are the same as those of the volume group on which it resides.
We are still one step away from the final solution. Volume group names are not unique across operating system instances; a common example is the rootvg, which exists on all systems. Thus, we introduce global volume groups, which are globally unique names for ordinary volume groups. VSD's are mapped to minor numbers (logical volumes) within a global volume group. Each volume group is in turn mapped to a triple (volume group name, primary node, backup node), which specifies the name by which the volume group is known locally on the node(s) that have physical access to it. If the volume group is not twin-tailed, the backup node entry is left blank.
At startup time, or when reconfiguration occurs as a result of node failure, the current cluster configuration (i.e., which nodes are active members of the cluster) is used to determine which node (if any) should be serving each global volume group. Hardware takeover operations are based on this data. In case of volume group takeover, the VSD mapping table is examined to find which VSD's are affected, and the appropriate commands (e.g., suspend/resume) are issued for those VSD's.
We have presented the design and implementation of a Virtual Shared Disk, which allows multiple nodes with independent instances of AIX to access a disk device attached to a single node as if it were locally attached to each node. In the case of twin-tailed disks, the server of a device can be switched from one node to another transparently to the applications.
The performance for local accesses is indistinguishable from that of direct accesses to the underlying physical device. For remote accesses, performance is only limited by the communications protocol; there is no overhead associated with kernel processes or daemons.
The VSD is currently used as an enabler layer for the Oracle Parallel Server on IBM's SP family of supercomputers. Potential future uses include video servers and cluster logical volumes.
The authors gratefully acknowledge the contribution of Larry Brenner at IBM Kingston, who worked on the cluster services that drive VSD configuration and recovery.