2 Execution Environments

Chapter 2
Execution Environments

2.1 Introduction

Because the components of the OSKit are intended to be usable in many different kernel and user-mode environments, it is important that their requirements be defined fully, not only in terms of interface dependencies, but also in terms of execution environment. A component’s execution environment involves many implicit and sometimes subtle factors such as whether and in what cases the component is reentrant. A client using a component must either use an execution environment directly compatible with the environment expected by the component, or it must be able to provide an environment in which the component can run by adding appropriate glue code. For example, most of the OSKit’s components are not inherently thread- or multiprocessor-safe; although they can be used in multithreaded or multiprocessor environments, they rely on the client OS to provide the necessary locking as part of the “glue” the client OS uses to interface with the component.

In order to make it reasonably easy for the client OS to adapt to the execution environment requirements of OSKit components, the execution models used by the OSKit components are purposely kept as simple and easy to understand as possible without sacrificing significant functionality or performance. Another factor driving the OSKit components’ execution models is the goal of being able to integrate large amounts of code, such as device drivers and network protocol stacks, virtually unmodified from traditional kernels such as BSD and Linux; this requirement inevitably places some restrictions on the execution models of the OSKit components derived from these source bases. However, in general, even the execution models of these complex OSKit components are considerably simpler and more well-defined than the original execution environments of the legacy systems from which the components were adapted; this simplification is enabled by carefully-designed OSKit glue code surrounding the legacy code which emulates and hides from the OSKit user the more subtle aspects of the legacy execution environments.

Since the OSKit includes components with a wide range of size and complexity, and as a result different components naturally tend to have different levels of dependence on their execution environment, the OSKit defines a set of standard execution models arranged on a continuum from simplest and least restrictive to most complex and demanding on the client OS. This range of execution models allows the client OS to adopt the simpler OSKit components quickly and with minimal fuss, while still providing all the detailed environment information necessary for the client OS to incorporate the most demanding components. For example, the basic memory-management libraries, LMM and AMM, use an extremely simple execution models with very few restrictions, allowing them to be used immediately in almost any context. The device driver libraries, on the other hand, necessarily place much greater demands on the client since they must deal with interrupts, DMA, and other hardware facilities closely tied to the execution environment; however, these requirements are defined explicitly and generically so that with a little effort even these components can be used in many different contexts.

The remaining sections of this chapter describe each of the execution models used in the OSKit, in order from simplest to most complex. In general, each succeeding execution model builds on and extends the previous model.

2.2 Pure Model

The pure execution model is the most basic of the OSKit execution environments; it has the following properties:

Pure functions and components have no implicit global state. For example, they do not define or use any global or static variables; they only use and manipulate data passed explicitly the client. For example, many functions in the minimal C library, such as memcpy and strlen, are pure by nature in that they only touch data areas passed in parameters by the client and have no other side-effects. As a less trivial example, the LMM and AMM memory manager components include functions that maintain state across calls, but only in explicit data structures visible to the client.
Pure functions and components are fully reentrant and thread-safe with respect to disjoint data sets. For example, if the client OS uses the LMM to manage several separate and disjoint memory pools, then the lmm functions may be run concurrently on two different processors in a multiprocessor system without synchronization, as long as each processor is manipulating a different LMM memory pool. This property is a natural consequence of the fact that pure OSKit components maintain no implicit global state.
Pure functions and components are not reentrant or thread-safe with respect to overlapping data sets. For example, just as it is not safe to make several memcpy calls concurrently with overlapping destination buffers, it is not safe to call LMM functions concurrently on the same memory pool. This is true for interrupt-style as well as thread-style concurrency: for example, the client OS must not call an LMM function on a particular memory pool from within an interrupt handler if the interrupt handler might have interrupted another LMM function call using the same memory pool. In order to use these components in an interruptible or multithreaded/multiprocessor environment, the client OS must wrap them with appropriate synchronization code, such as locking or interrupt priority management, in order to ensure that only one call can be made at a time for a given data set.
Pure functions and components are not reentrant with respect to overlapping data sets even during callbacks from the component to the client OS. In other words, callbacks are assumed to be atomic as far as the component is concerned. For example, the address map manager, AMM, makes calls back to the client OS to allocate and free memory for use in maintaining the address map, as well as to perform client-defined processing when address map regions are split or joined together. During the processing of one of these callbacks, the client OS must not attempt to make a reentrant call back into the AMM operating on the same address map.

Figure 2.1:

Illustration of the execution model used by pure components. Separate components, and separate instances of each component, are fully independent and have no implicit shared global state; therefore they can be invoked concurrently with no synchronization. However, each individual instance of a component (e.g., a particular LMM memory pool) is single-threaded and non-reentrant; the client OS must avoid concurrent calls to that instance, as well as recursive calls to the same instance through callbacks.

Figure 2.1 illustrates the pure execution environment. Since pure functions and components contain no implicit global state, separate “instances” or uses of these components by the client can be treated as completely independent objects: although each individual instance of the component is single-threaded and non-reentrant, the client OS can manage synchronization between independent instances of the component in any way it chooses.

2.3 Impure Model

Components that use the impure execution model act just like those operating in the pure model, except that they may contain global shared state and therefore must be treated as a single “instance” for synchronization and reentrancy purposes. For example, many of the functions in liboskit_kern, the kernel support library, set up and access global processor registers and data structures, and are therefore impure. Similarly, some of the functions in the minimal C library, such as malloc and its relatives, inherently require the use of global state and therefore are impure.

The impure execution model has the following properties:

Impure functions and components may depend on implicit global state, such as global or static variables or special processor registers.
Impure functions and components are not reentrant or thread-safe, except when explicitly stated otherwise. In order to use these components in an interruptible or multithreaded/multiprocessor environment, the client OS must provide appropriate synchronization code. Many impure components and functions provide explicit synchronization hooks for the convenience of the client OS. For example, the minimal C library’s malloc functions make calls to mem_lock and mem_unlock, which are empty functions by default but can be overridden by the client to provide real synchronization; see Section 14.5 for more information.
Impure functions and components are not reentrant even during callbacks from the component to the client OS, except when explicitly stated otherwise. In other words, callbacks are assumed to be atomic as far as the component is concerned.

2.4 Blocking Model

The blocking execution model extends the impure model to support non-preemptive multithreading; it is essentially the execution model used in traditional Unix-like kernels such as BSD and Linux. Components that use the blocking model have the same properties as those that use the impure model, except that they are re-entrant with respect to some callback functions; these functions are known as blocking functions. This means that, whenever the component makes a call to a blocking function, the client OS may re-enter the component in a different context, e.g., in the context of a different thread or processor. The set of callback functions that are assumed to be blocking is part of the component’s contract with the client OS; in general, a function is blocking unless it is explicitly stated to be nonblocking.

In order to use a blocking-model component in a fully preemptive, interruptible, or multiprocessor environment, the client OS must do essentially the same thing to adapt to the component’s execution model as it would for a pure or impure component: namely, it must surround the component with a lock which is taken before entry to the component and released on exit. However, because the component supports re-entrancy through callbacks that are defined to be blocking functions, the client OS’s implementations of these callback functions may release the component’s lock temporarily and re-acquire it before returning into the component, thereby allowing other concurrent uses of the component.

2.5 Interruptible Blocking Model

The interruptible blocking execution model, unlike the other models, allows a component to be re-entered at arbitrary points under certain conditions. In the interrupt model, there are two “levels” in which a component’s code may execute: interrupt level and process level. (Note that in this context we use the term “process” only because it is the traditional term used in this context; the components in fact have no notion of an actual “process.”) The interrupt model also assumes a one-bit interrupt enable flag, which the component can control through well-defined callback functions which must be provided by the client OS. When the component is running at either level and interrupts are enabled, the component may be re-entered at interrupt level, typically to execute an interrupt handler of some kind. To be specific, the properties of the interruptible blocking model are as follows:

Each component is a single-threaded execution domain: only one (virtual or physical) CPU may execute code in the component at a given time. For example, on a multiprocessor, only one processor may be allowed to execute in a component set at a time at process level; this can be handled by placing a global lock around the component. (Different components can be allowed to execute concurrently, as long as the client OS takes appropriate precautions to keep them fully independent of each other.) Similarly, if the host OS is preemptible, then the OS must ensure that if a component is preempted, then that component will not be re-entered in another context before the first context has finished or entered a blocking function.
Multiple process-level activities may be outstanding at a given time in a component, as long as only one is actually executing at a time (as required by rule 1). A subset of the callback functions provided by the client OS are defined as blocking functions; whenever one of these functions is called, the host OS may start a new activity in the component, or switch back to other previously blocked activities.
The host OS supplies each outstanding activity with a separate stack, which is retained across blocking function calls. Stacks are only relinquished by a component when the operation completes and the component’s code returns from the original call that was used to invoke it.
Code in a component always runs at one of two levels: process level or interrupt level. Whenever the host OS invokes a component through its interface, it enters the component at one of these two levels. Typically, some of the component’s exported functions or methods can be invoked only at process level, while others can be invoked only at interrupt level, and there may be a few that can be invoked at either level; which functions can be invoked at which levels is part of the component’s interface (its contract with the client OS), and thus is defined in the component’s description. Typically, most of a component’s entrypoints can only be invoked at process level; therefore, entrypoints for which no level is explicitly specified can be entered only at process level.
Both process-level and interrupt-level execution in a component can be interrupted at any time by interrupt handlers in the component, except when the code has disabled interrupts using osenv_intr_disable (see Section 8.15.1).
When a component is entered at process level, the component assumes that interrupts are enabled. The component may temporarily disable interrupts during processing, but must re-enable them before returning to the client OS. Similarly, when a component is entered at interrupt level (e.g., a hardware interrupt handler in a device driver), interrupts are assumed to be initially disabled as if osenv_intr_disable had been called implicitly before entering the handler. However, the component may re-enable interrupts, at which point the client OS is allowed to interrupt the component again with other interrupt-level activities.
Interrupt-level activities must be strictly stacked. In other words, when the client OS interrupts a process-level activity in a component, that interrupt-level activity must be allowed to run to completion before the client OS may resume the process-level activity. Similarly, if an interrupt-level activity is itself interrupted, then the most recent interrupt-level activity must finish before the client OS may resume previous interrupt-level activities. This constraint is generally satisfied automatically if the client OS is running on a uniprocessor and uses only a single stack for both process-level and interrupt-level processing; however, the OSKit components do not require the client OS to use only a single stack as long as it meets these re-entrancy requirements.
Code in a component that may run at interrupt level may not call blocking callback functions provided by the client OS; only nonblocking callback functions may be called at interrupt level.

Although on the surface it may appear that these requirements place severe restrictions on the host OS, the required execution model can in fact be provided quite easily even in most kernels supporting other execution models. The following sections describe some example techniques for providing this execution model.

2.5.1 Use in multiprocessor kernels

Global spin lock: The easiest way to provide the required execution model for interruptible, blocking components in a nonpreemptive, process-model, multiprocessor kernel such as Mach 3.0 is to place a single global spin lock around all code running in the device driver framework. A process must acquire this lock before entering driver code, and release it after the operation completes. (This includes both process-level entry through the component’s interface, and interrupt-level entry into the components’ interrupt handlers.) In addition, all blocking callback functions which the host OS supplies should release the global lock before blocking and acquire the lock again after being woken up. This way, other processors, and other processes on the same processor, can run code in the same or other drivers while the first operation is blocked. Note that this global lock must be handled carefully in order to avoid deadlock situations. A simple, “naive” non-reentrant spin lock will not work, because if an interrupt occurs on a processor that is already executing process-level driver code, and that interrupt tries to lock the global lock again, it will deadlock because the lock is already held by the process-level code. The typical solution to this problem is to implement the lock as a “reentrant” lock, so that the same processor can lock it twice (once at process level and once at interrupt level) without deadlocking.

Another strategy for handling the deadlock problem is for the host OS simply to disable interrupts before acquiring the global spin lock and enable interrupts after releasing it, so that interrupt handlers are only called while the process-level device driver code is blocked. (In this case, the osenv_intr_enable and osenv_intr_disable calls, provided by the OS to the drivers, would do nothing because interrupts are always disabled during process-level execution.) This strategy is not recommended, however, because it will increase interrupt latency and break some existing partially-compliant device drivers which busy-wait at process level for conditions set by interrupt handlers.

Spin lock per component: As a refinement to the approach described above, to achieve better parallelism, the host OS kernel may want to maintain a separate spin lock for each component. This way, for example, a network device driver can be run on one processor while a disk device driver is being run on another. This parallelism is allowed by the framework because components are fully independent and do not share data with each other directly. Of course, the client OS must be careful to maintain this independence in the way it uses the components: for example, if the client OS wants to have one component make calls to another (e.g., to connect a file system component to a disk device driver), and it wants the two components to be separately synchronized and use separate locks, the client OS must interpose some of its own code to release one lock and acquire the other during calls from one component to the other.

2.5.2 Use in preemptive kernels

The issues and solutions for implementing the required execution model in preemptive kernels are similar to those for multiprocessor kernels: basically, locks are used to protect the component’s code. Again, the locking granularity can be global or per-component (or anything in between, as the OS desires). However, in this case, a blocking lock must be used rather than a simple spin lock because the lock must continue to be held if a process running the component’s code is preempted. (Note the distinction between OS-level “blocking,” which can occur at any time during execution of the component’s code but is made invisible to the component’s code through the use of locks; and component-level “blocking,” which only occurs when a component calls a blocking function.)

An alternative solution to the preemption problem is simply to disable preemption while running the component’s code. This solution is likely to be simpler in terms of implementation and to have less overhead, but it may greatly increase thread dispatch latency, possibly defeating the purpose of kernel preemption in the first place.

2.5.3 Use in multiple-interrupt-level kernels

Many existing kernels, particularly those derived from Unix or BSD, implement a range of “interrupt priority levels,” typically assigning different levels to different classes of devices such as block, character, or network devices. In addition, some processor architectures, such as the 680x0, directly support and require the use of some kind of IPL-based scheme. Although the OSKit device drivers and other OSKit components do not directly support a notion of interrupt priority levels, it can be simulated fairly easily in IPL-based kernels by assigning a particular IPL to each component used by the kernel. In this case, the osenv_intr_disable routine provided by the kernel does not disable all interrupts, but instead only disables interrupts at the interrupt priority level that the client OS has assigned to the calling component, and at all lower priority levels. This way, although the code in each component is only aware of interrupts being “enabled” or “disabled,” the host OS can in effect enforce a general IPL-based scheme.

An obvious limitation, of course, is that all of the device drivers in a particular driver set must generally have the same IPL. However, this is usually not a problem, since the drivers in a set are usually closely related anyway.

2.5.4 Use in interrupt-model kernels

Many small kernels use a pure interrupt model internally rather than a traditional process model; this basically means that there is only one kernel stack per processor rather than one kernel stack per process, and therefore kernel code can’t block without losing all of the state on its stack. This is probably the most difficult environment in which to use the framework, since the framework fundamentally assumes one stack per outstanding component invocation. Nevertheless, there are a number of reasonable ways to work around this mismatch of execution model, some of which are described briefly below as examples:

Switch stacks while running driver code. Before the kernel invokes a component operation (e.g., makes a read or write request), it allocates a special “alternate” kernel stack, possibly from a “pool” of stacks reserved for this purpose. This alternate stack is associated with the outstanding operation until the operation completes; the kernel switches to the alternate stack before executing process-level component code, and switches back to the per-processor kernel stack when the driver blocks or returns. Depending on the details of the kernel’s execution model, the kernel may also have to switch back to the per-processor stack when the process-level component code is interrupted, due to an event such as a hardware interrupt or a page fault occurring while copying data into or out of a user-mode process’s address space. However, note that stack switching is only required when running process-level component code; interrupt handlers in this execution model are already “interrupt model” code and need no special adaptation.
Run process-level device driver code on a separate kernel thread. If the kernel supports kernel threads in some form (threads that run using a traditional process model but happen to execute in the kernel’s address space), then process-level component code can be run on a kernel thread. Basically, the kernel creates or otherwise “fires off” a new kernel thread for each new component operation invoked, and the thread terminates when the operation is complete. (If thread creation and termination are expensive, then a “pool” of available threads can be cached.) The kernel must ensure that the threads active in a particular component at a given time cannot preempt each other arbitrarily except in the blocking functions defined by this framework; one way to do this is with locks (see Section 2.5.2). Conceptually, this solution is more or less isomorphic to the stack switching solution described above, since a context switch basically amounts to a stack switch; only the low-level details are really different.
Run the device drivers in user mode. If a process-model environment cannot easily be provided or simulated within the kernel, then the best solution may be to run components in user mode, as ordinary threads running on top of the kernel. Of course, this solution brings with it various potential complications and efficiency problems; however, in practice they may be fairly easily surmountable, especially in kernels that already support other kinds of user-level OS components such as device drivers, file systems, etc.
Run the device drivers at an intermediate privilege level. Some processor architectures, such as the x86 and PA-RISC, support multiple privilege levels besides just “supervisor mode” and “user mode.” Kernels for such architectures may want to run blocking OSKit components under this framework at an intermediate privilege level, if this approach results in a net win in terms of performance or implementation complexity. Alternatively, on most architectures, the kernel may be able to run blocking OSKit components in user mode but with an address map identical to the kernel’s, allowing them direct access to physical memory and other important kernel resources.

[prev] [prev-tail] [front] [up]

Chapter 2Execution Environments