- Back to Home »
- Extremely Reliable Operating System
Posted by : Unknown
Monday, July 1, 2013
ABSTRACT :-
the EROS, Extremely Reliable Operating System, addresses the
issues of reliability and security by combining three ideas from earlier
systems: capabilities and a persistent single-level store. Capabilities unify
object naming with access control. Persistence extends this naming and access
control uniformly across the memory hierarchy; main memory is viewed simply as
a cache of the single-level store. The combination simplifies application
design, allows programs to observe the "principle of least
privilege," and enables active objects to be constructed securely. EROS is
built on a simple object model that is easy to implement and understand.
INTRODUCTION :-
EROS: The Extremely
Reliable Operating System
EROS is a new
operating system originally implemented at the University of Pennsylvania. The
project has now migrated to Johns Hopkins University. EROS merges some very old
ideas in operating systems with some newer ideas about performance and resource
management. The result is a small, secure, real-time operating system that
provides orthogonal persistence.
HARDWARE REQUIREMENTS :-
What hardware does EROS run on?
The EROS runs on
the hard ware like X 86-class PC's,486 and higher. It doesn't support 386
because 386 supervisor mode does not honour write protect bit in the page table entries.
Support for networking is limited. The only network interface
cards are 3COM 3C509/3C509
And 3COM 3C905.The drivers IDE and EIDE are fully supported. SCSI is handled in the
future and no graphics support at present
FEATURES :-
1. Pure Capability Architecture
:
EROS is a pure capability system. A capability uniquely identifies
an object and a set of access rights. Processes holding a capability can
perform the operations permitted by those access rights on the named object.
Holding a capability is a necessary and sufficient condition for accessing the
associated object with the authority granted by that capability.
2. Orthogonal Global Persistence :
The basic idea of orthogonal global persistence is quite simple:
on a periodic basis, or when requested by an authorized application, a
consistent snapshot of the entire system state is taken. This consistent
snapshot includes the state of all running programs, the contents of main
memory, and any necessary supporting data structures.
Global persistence means that the state of all processes is
captured at one instant; in the event that the system is forced to recover from
the snapshot, all applications are consistent with respect to each other.
Orthogonal persistence means that applications do not need to take
any special action to be a consistent part of the snapshot.
EROS persistence is a kernel service. While this makes the EROS
kernel larger than non-persistent kernels such as QNX, non-persistent kernels
do not provide a comparable means of transparent recovery.
3. Security :
Security experts worry greatly about confinement. Confinement is
the process of making sure that information does not leak to unauthorized
users.
While confinement remains a crucial problem in secure and
mission-critical applications, the confinement problem is not currently very
important. The on-going fall in hardware prices, coupled with the shift to
client-server technology, has made this problem fairly easy to solve. To build
a secure server, buy a dedicated machine, turn off all network services except
the one you intend to run, and let that service performs its own authorization.
Given this approach, the remaining security hazards come from
three sources:
1)
Hardware bugs, some of which
cannot be corrected by software.
2)
Implementation bugs, such as failure to
check input buffer lengths.
3)
Misused authority, where a program
operating with special privilege is made to use that privilege at an
inappropriate time
5. Stateless Supervisor :
Forthcoming.
6. Deadlock-Free Supervisor :
7. A Small Kernel :
Forthcoming.
x86 Bootstrap
The note describes how the new x86 bootstrap code works. The x86
bootstrap has been rewritten several times, and the code is difficult to read
because portions of it are in 16-bit assembler. It is written as a
possible source for both users and as information for other x86 OS authors.
1. Sequence of Operation :
Here is a summary discussion of the sequence of events in booting
an x86 machine.
1.1 Load Master Boot Record
The BIOS first loads sector zero of the disk at (linear) 0x7c00,
and branches to it by jumping to the segmented 16-bit address 0:0x7c00. On
entry to the MBR, the following values are in registers:
v
%CS:%EIP 0x0:0x7c00
v
%DL
BIOS identifier of the booted drive
No assumptions should be made about other registers. In particular
it is not aware of any safe assumptions concerning other segment registers.
If this is a floppy boot, then this is not an MBR, and execution
proceeds as described under boot record, below.
If this is in fact an MBR sector, the code embedded in the MBR
record moves itself out of the way (where to is not documented, and no safe
assumptions can be made) and then searches the partition table for the
``active'' partition. If an active partition is found, the MBR loads sector zero
of that partition at linear address 0x7c00 and branches to it. If no partition
is active, processing halts at this point.
1.2 Load Partition Boot Record :
The partition (or floppy) boot record is entered with the
following register environment:
Ø
%
0x0:0x7c00
Ø
CS:%
Ø
EIP
Ø
%DL BIOS identifier of the booted drive
Ø
% Only in hard disk boot. Points to
partition table
Ø
ES:% entry IFF this was a hard disk boot.
this is the only
RELIABLE way to know
which image was
booted, as there may be multiple instances of EROS installed, and some MBR
multiboot loaders do not set the active flag on the in-memory partition table.
1.3 Load Extended Bootstrap
:
From this point on, what happens is entirely the business of the
operating system-specific boot code. The initial boot sector has 512 bytes less
some tables in which to load a more general bootstrap.
A complication in reading the EROS bootstrap code is that it is
written using a 32-bit assembler. Some operations can be coded directly, but
don't mean what they say they mean (e.g. ``mov %eax,%ebx'' looks like a 32 bit
operation but is actually a 16 bit operation). Others must be prefixed by
extended opcode bytes because the linker only gives us 32-bit relocations.
Still others must be hand-coded as code bytes, bypassing the assembler
mechanisms almost entirely.
Suffice it to say that this is entirely unmaintainable, and an
utter pain in the ass to read. Bryan Ford did us all a great service by
introducing the code16 directive. Unfortunately, not all Linux assemblers
support this directive, there are bugs in the current implementation (different
in different assembler versions), and it does not resolve the linker's lack of
support for 16 bit relocations. For these reason we do not use it.
For this reason, the EROS bootstrap logic loads the rest of the
primary bootstrap as fast as possible, gets into 32-bit code, and then does
it's best to stay there.
Some OS bootstraps load the secondary bootstrap cleverly, making
use of the information in the partition table. The EROS bootstrap instead
follows the *BSD convention: the sectors to be loaded are precomputed by a
bootstrap installer utility. The EROS bootstrap code simply reads and applies
this precomputed table. The table originally compiled in to the raw bootstrap
binary is suitable for use in a 1.44M floppy.
The extended bootstrap is loaded to a precompiled linear address
(presently 0x80000, controlled in boot.h).
2. EROS Extended Bootstrap :
The goal of the extended bootstrap is to enter the operating system.
It proceeds in three steps:
q
Enter 32-bit protected mode
q
See if a ramdisk image is to be loaded. If
so, load it. This may equire decompression.
q
Load the kernel from the disk (or
ramdisk).
2.1 Entering 32-bit Protected Mode :
On entry into the extended bootstrap code, the first thing done is
to enter 32-bit protected mode. This allows us to use the assembler
straightforwardly, and more importantly to use the C compiler to write much of
the remaining bootstrap.
For the most part, the only tricky aspect to entering protected
mode is the number of constraints on the problem. The global descriptor table
must contain the following:
¨
Code and data descriptors for the
bootstrap code as 32 bit code
¨
Code and data descriptors for a 32-bit
mode flat linear mapping of the entire address space (to be used by the newly-loaded
kernel).
¨
A code descriptor for the bootstrap code
as 16 bit code.
The last is required because we will need to switch back into
16-bit mode in order to call the BIOS to load further disk sectors. For a
similar reason, the stack of the secondary bootstrap must fall within the low
1Mbyte of real memory (this is why we did not load the extended bootstrap above
the 1Mbyte boundary.
Finally, the extended bootstrap places its heap at 1Mbyte (growing
upward). This allows the decompression library to run, but means that malloc'd
memory cannot be passed as a buffer to the BIOS, because the BIOS cannot
address such memory.
2.2 Balance of Extended Bootstrap :
The balance of the extended bootstrap is written in a mix of
C/C++. Experience suggests that the bootstrap code changes, and that the C code
serves as it's own best documentation.
In fact, all of this code should be written in C. The only reason
that C++ was used was for compatibility with some structure declarations used
by the kernel proper. Note, in particular, that the program is linked without
the standard runtime environment, and that global constructors are not
executed. This isn't a problem, because there are no globules requiring
construction in the current bootstrap.
KERNEL
Kernel Virtual Memory
This note describes how EROS lays out kernel virtual memory at
present, and explores a new design in which most user state is not mapped by
the kernel.
1. Physical Memory Model :
The EROS kernel views memory as a list of contiguous ranges.
Because legacy DMA engines impose addressability constraints, each physical
memory range has an associated memory ``class,'' (an integer) indicating how
precious (i.e. how constrained) that memory is. When memory allocation is
performed, the allocating code specifies the required class. Allocation will be
performed from the least precious class matching the requirements.
Note: The importance
of this convention is declining. After a while, I concluded that it was better to
just integrate bounce buffers into the DMA management software. In the current
kernel, I believe that the only code which allocates memory under a constraint
is the DMA engine itself.
2. Current (before 5 Jan 1998) Kernel Memory Model :
The EROS kernel uses main memory for several purposes. In
approximate order of allocation, these are:
Kernel Mapping Table(s) :
The kernel mapping tables map all of the following space. The
process ``owns'' the low 3 gigabytes of the virtual address space, while the
kernel owns the upper 1 gigabyte. The kernel map in the current design includes
a redundant map of all of physical memory.
Interrupt Management Structures
:
Any structures and tables necessary to handle interrupts.
Driver Buffers :
Drivers are free to allocate kernel pages during startup.
User Thread Table :
The master thread table, containing all of the running or sleeping
threads in the system.
Context Cache :
The cache for active processes, used in context switching and IPC.
Inverted Mapping Table :
Used to invalidate mappings when the capability slots they depend
on are changed. This data structure is referred to in the EROS kernel as the
Depend Table
Core Table :
The per-object header structures associated with physical pages.
There is one core table entry for every page frame in main memory. In the
current implementation, this includes the page frames used for page directories
and page tables.
Node Space :
The memory frames for in-core nodes, which space must be generally
accessable to the kernel.
Page Space :
All user pages reside here. In the current implementation, page
space is mapped into the kernel's virtual memory map.
3. Proposed Kernel Memory Model :
There are two major problems with the model described above:
►
It limits physical memory to slightly less
than 1GB.
►
It does not reserve any part of the
address space for use as ``small spaces.
The first point may seem insignificant, but at the University of
Pennsylvania we had a number of machines configured with 256 Mbytes of memory
in 1997. The price/density curve suggests that the 1GB boundary cannot be far
away. While the feature isn't widely used, current Pentium Pro processors (and
possibly Pentium II -- I'm not sure, but I am told that they are essentially
repackaged Pentium Pro dies) can support more than 4GB of physical memory
already, which we would someday like to leverage.
3.1. Kernel References to User Pages :
There are a small number of places where the supervisor makes
reference to the content of a data page:
1. From Software : There
are several cases in which the kernel makes a data reference from software:
Ọ
Neither Mapped: In the data page clone
operation, neither source nor destination is necessarily mapped. Note, however,
that this operation returns no data to the user.
Ọ
PIO: Programmed I/O moves data to or from
a page that is not necessarily related to the current process in any way. This
page must be temporarily mapped into the kernel before I/O can occur. Bounce
buffer copies are also a form of PIO.
In the EROS kernel design, PIO occurs with interrupts disabled,
and never occurs from a nested kernel interrupt. This means that PIO does not
interfere with other copies that may be in progress.
2. Physical DMA: If
the DMA hardware uses physical addresses, the fact that the page frame is no
mapped into the kernel presents no special difficulties; the kernel merely
needs to know the physical address of the page frame.
3. Virtual DMA: If
the DMA operation uses virtual addresses, then a mapping must somehow be
constructed. In such architectures there is generally an efficient way to shoot
down stale mappings, or if necessary a technique similar to the one used for
programmed I/O can be employed.
3.3. Revised Memory Layout
Given that the page references are cleanly definable, the revised
kernel memory model excludes user pages from the kernel memory map. The new
address space layout is as follows:
+-----------------+------------------+-------------------+
| Large Space | Small Spaces | Kernel Virtual |
| (unique) |
(shared) | (shared) |
+-----------------+------------------+-------------------+
Small spaces and kernel space are shared across all virtual maps,
enabling rapid transition to/from kernel and to/from small spaces. While nodes
are directly access able within kernel space, page data is not unless it is
mapped into the user address space. Note that capability load and store
instructions always access pages that are read/write to the kernel but not to
the user, so this falls under the user-accessible pages scenario.
In this model, page table page frames are pre-reserved, and cannot
be taken from the general memory pool. More precisely: they could be taken from
the general page pool, but the need to translate addresses means that it
basically isn't worth it to do so because page table access wants to be fast.
3.3. Kernel Virtual Memory
The factors
that were not considered at the design time.
♥
Page table update requires translation
from physical addresses to virtual addresses for page tables.
♥
Depend table invalidation. Translation
from physical addresses to core table entries for address translation.
♥
Growing the kernel address map, and adding
mappings to it.
3.3.1 Page Table Update
When an address fault is satisfied, a page table entry must be
constructed. This page table entry must be written into the appropriate
physical page, so a kernel virtual address for the page frame must exist or
must already be established.
There are several ways to approach this problem. Listed in order
of performance:
Contrive for all user page tables to come out of a contiguously
allocated range of physical pages, and map these physical pages at a contiguous
range of virtual addresses. If this is done, the interconversion can be
accomplished by adding or subtracting a suitable constant.
This approach imposes a fixed limit on the number of page tables
available for user mappings. This number can be conditional zed on the amount
of available memory, but it is not obvious what a good heuristic would be.
3.3.2 Depend Table Invalidation
When a node slot is modified, the corresponding depend table
entries must be invalidated. To do this, the PTE entries must be updated and
the TLB must be flushed.
3.3.3 Kernel Address Map
A number of data structures must be mapped into the kernel virtual
address space. A few of these want to be mapped from contiguous physical
memory, but this doesn't really pose special problems to the kernel mapping
mechanism per se. The question is: how is the kernel map itself to be updated.
One can imagine three designs for the kernel virtual memory map:
Dedicated. A memory map covering the entire physical memory is
constructed at startup time. When the kernel must update it's own mapping
structures, it switches temporarily to this map. The dedicated map maps
physical memory in what would otherwise be per-process virtual space, and
kernel memory according to the normal kernel mapping(s).
Reflexive. The kernel map appears in kernel virtual memory, and is
updated by making suitable read and write operations via the mapped virtual
addresses.
Non-reflexive. The kernel map does not appear in kernel virtual
memory.
One solution to the temporary mapping construction problem is to
hand-construct a sufficient mapping table to support building the temporary mapping,
and then do the rest by building temporary mappings.
A Programmer's Introduction to EROS
This document is a work in progress. It's current contents
are mostly correct. This document provides an introduction to the EROS system
from a programmer's point of view. If you are building programs that will run
under an operating environment, such as Dionysix.
This note does not try to provide an exhaustive description of the
EROS environment. Instead, it tries to provide enough information to get you
started, leaving details to be addressed by other documents.
1. Process
Applications in EROS are typically constructed out of several
independent processes, which we call process.
A process consists of an address space, set of registers, which
change as the process executes,A set of 16 capabilities, more commonly referred
to as keys, keeper, which is another process that is invoked when a process
takes an exception. The keeper is established when a process is first created.
Most processes are unable to change their keeper.
Each process implements a specific, well-defined collection of
related services. Each process carries exactly the authority that it requires
to perform its function and no more.
While it is common to build several new process for each application,
the EROS system comes with a toolkit of useful process. Some of these provide
high-level facilities such as sorting and data management. Others implement
various supporting policies for such things as page fault handling. In addition
to constructing new components, an application designer will typically make use
of several such ``off the shelf'' process in constructing an application.
Processes are more than just libraries. In addition to code and
data, each process has the ability to remember information from one session to
the next, and runs in it's own protected environment. Services implemented
using process can run in parallel with the application they serve. We refer to
such protected, independent services as active persistent agents. They are one
of the keys to EROS's reliability.
Placing a service in a separate process has several advantages:
The service runs in its own address space.
The service can be tested independently.
The service can be used by non-trusted clients.
The primary disadvantage is
that invoking a process is somewhat more expensive than performing a library
call.
1.1. Persistence
All objects in EROS, including process, are persistent. If the
system is shut down, or for some reason crashes, the state of all running programs
is preserved by the checkpoint mechanism. The EROS system periodically saves a
consistent copy of the entire system state to disk. When the system restarts,
it resumes from the most recently saved system image. Typically, no more than 5
minutes work is lost in such a failure.
A consequence of persistence is that process exists forever. Until
a process is explicitly dismantled, the services it implements remain available
to clients with suitable keys.
Persistent processes enable a whole new approach to application
design. Instead of building shared libraries, common services in EROS are
placed in separate, persistent process. Because process are persistent, there
is no need to recreate them each time an application is started, which would be
necessary in a non-persistent system. In addition, fault isolation and security
are greatly improved by persistence. For example, EROS programs can hold
authorities that are different from the user's authorities .
1.2. Views of Process
From the programmer's perspective, there are (at least) two useful
views of process:
A process is an application running program that invokes services
A process is the implementer of one or more related services.
1.3. Process States
In addition to providing services, process provides a mechanism
for synchronization. A process can be occupied by at most one thread. If a
client invokes a process that is currently running, the client blocks until the
process becomes available. More precisely, process can be in one of three
states:
State Description
Available An available
process is currently idle, an can be invoked by any holder of an appropriate
start key. This is the state of most of the process in the system. An available
process is not occupied by any thread.
Running Once it has been
invoked, a process moves from the available state to the running state. A
running process is busy servicing an invocation, and is occupied by a thread.
If a process is invoked while it is running, the invoker will block until the
process becomes available. A process becomes available when it performs a
return operation.
Waiting A process that has
invoked a key and is waiting for a response moves from the running state to the
waiting state. It remains in this state until it receives a reply to its
invocation. If a process is invoked while it is waiting, the invoker will block
until the process becomes available.
2. Keys and Objects
A key grants the authority to use a particular service or
manipulate an object. Keys can be loosely broken into a small number of
categories:
Object Keys The primary
objects in EROS are pages and nodes. All other objects in the system are
constructed out of these primary object types. Each of these objects has an
associated key type.
Primary objects are implemented directly by the EROS kernel. When
a program invokes a primary key, the action requested is performed directly by
the EROS supervisor.
Process and address spaces are constructed out of nodes and pages,
and the keys that manipulate these objects are also implemented by the kernel.
Start Keys Process can
publish services to their clients. The creator of a new process can fabricate a
start key to that process. The start key includes:
The identity of the process that implements the service
A type field, known as the data byte, indicating which service is
to be performed.
A program that holds a start key to a particular process can use
the service by invoking the start key. The message sent during the invocation
is sent directly to the implementing process.
Device Keys Access
to hardware devices is provided through device keys. Generally speaking, a
device key is a thin layer over the raw hardware. The kernel implementation of
the device key does only enough checking to make sure that a request will not damage
the rest of the kernel.
A program that holds a device key has direct control over the
underlying hardware. Device keys are therefore closely held, which means that
they are given out only to the programs that are supposed to manage the
devices.
Most programs use only object and start keys. A few of the
more commonly used objects and their keys are described below. A discussion of
address spaces and start keys can be found in subsequent sections.
3. Address Spaces
A address space provides a mechanism for assembling pages into a
larger, addressable pool. The current EROS implementation supports address
space trees of up to 264 bytes, and 64 bit implementations will probably
increase this limit (the current address space structure can handle 2128 byte
address spaces, but the number key size allows only 96 bits of offset to be
requested).
4. Invocation Types
EROS implements three types of invocations: call, return, and
send.
4.1 Call Invocations
The call invocation can send only three keys, because the kernel
generates a resume key and places it in the fourth key slot of the message. A
unique return key is generated for each call, and the kernel guarantees that
every call will receive at most one return. Return keys are, in effect,
self-consuming. If any copy of a given return key is invoked all others are
effectively invalidated.
If the recipient is running or waiting for another process, the
caller will block until the recipient becomes available. Along with the
caller's message, the recipient receives a copy of the data byte from the
service key, indicating which service was invoked.
In addition to specifying the location and size of the keys and
data to be sent, the caller specifies where any keys and data returned from the
service should be placed. When the service returns, it's returned information
will be placed in the buffer and key slots specified by the caller.
When a call invocation is performed, the calling process ceases to
run, and is placed in the waiting state. It will eventually resume execution
when the resume key is invoked. The recipient, meanwhile, transitions from the
available state to the running state, and begins processing the service
request. We say that the thread which occupied the caller has migrated to the
service process.
4.2. Return Invocations
The return invocation is the mirror to the call invocation. The
returner specifies the message and result code to be returned, and also
specifies the location and slots into which the next message should be
received. The returner then switches from the running state back to the
available state, and the caller (who performed the call that the returning to)
switches from the waiting state back to the running state. The thread has
migrated back to the caller.
4.3.Send Invocation
It is sometimes necessary to send messages without blocking for a
reply. The send invocation does this. The caller sends a message as with the
call instruction, and blocks until the recipient has received it. Once the
message has been transmitted, both the sender and the receiver are in the
running state, and are able to execute instructions.
As an example, the send invocation is used when the user interface
wishes to initiate a new application.
Note that the send invocation has the effect of creating a new,
independent thread of execution. The invoking thread remains in it's original
process, and a new thread is allocated for the recipient.
5. Not Just Client Server
The call and return invocations are named for their most common
usage, which is to handle client/server interactions between two process, but
this is not the only way in which these invocations can be used. In particular,
a call can be performed on a resume key, and a return can be performed on a
start key. This has some very useful implications.