CS350

Note: I made a Docker image containing a full development environment, current as of Spring 2016. This is basically all of the tools you need to complete the CS350 coursework, in a format that will work without any fuss. Notably, this image fixes several issues with setting up compiler versions and ncurses incompatibilities.

4/5/16

There are 5 assignments, each one building on top of the previous ones. Assignments will be software that runs on top of the OS/161 operating system, on top of a MIPS VM, SYS/161. There are 5 slip days that can be used to extend assignment deadlines, and 3 or fewer can be used for any single assignment.

The role of the operating system is to both help applications do what they need to do, and to stop applications from doing what they shouldn't. For example, an operating system might expose a nice abstract filesystem for storing data in, but it might prevent applications from, say, writing to a file that was being written to by another application.

To applications, the OS provides an execution environment, with interfaces to various resources, like networking, storage, and graphics (most of the code in a modern operating system lives in the various drivers). The OS is also responsible for isolating different applications from each other, making sure they don't step on each other's toes. This is the application view of operating systems.

The OS is responsible for keeping all of the hardware in line, making sure that these resources get allocated among the programs in a fair way. This is the system view of operating systems.

The OS is also a concurrent, real-time program (at least, most modern OSs). It needs to be able to support multiple things going on at once, but also satisfy timing constraints, such as feeding audio data to the sound card to output uninterrupted sound. This is the implementation view of the operating system.

6/5/16

The OS kernel is the part of the OS that responds to system calls, interrupts, and exceptions. The rest of the operating system includes all of the utilities, shells, and programming libraries.

Applications run on top of the OS, which in turn runs on top of the hardware. Applications interact with the OS kernel via system calls, and the OS interacts with the hardware through commands, data, and interrupts. The user space is all of the parts above the kernel, and is isolated by the kernel from the actual resources - in the user space, you cannot interact directly with resources, only through the OS system calls. Note that system calls are very different from procedure calls.

In the 1940's, computers didn't use OSs - programmers would work directly on the hardware. By the 1950's, these huge, time-shared computers started getting monitoring and scheduling software, and critically, shared libraries and software - this was called batch processing. UNIX followed naturally from these shared libraries and utilities, first written in assembly, then C. Since UNIX was very accessible, and the source code was often available, improvements came rapidly and soon became dominant in the computing industry..

Concurrency

Concurrency is the ability for multiple jobs to be running at the same time. Generally, this happens through multitasking (quickly switching between different jobs on a single processor) and multiprocessing (running different jobs on different processors). Jobs are simply threads from any process - theydon't all have to be from the same program.

Processes represent full, independent programs. Threads represent different parts of the same program. The difference is that threads within a process share memory, while processes do not - all threads within a process see the same global data and code. Each process has its very own address space - they can pretend they have all of the computer's memory to themselves.

10/5/16 - Tutorial

In this course, we have apps running on top of OS/161 (our kernel), running in SYS/161 (the MIPS virtual machine for our kernel), on top of the host (our own machine). We will only really be working on the apps and OS/161. Specifically, we work mostly in the os161-1.99/kern folder.

Overview of Git. Overview of version control in general. GDB overview. Overview of the folder heirarchy in the OS/161 codebase.

11/5/16

The most important operation in threading is the thread switch - running one thread for, say, 20ms, then switching to another thread so it can run for another 20ms. When we switch threads, we need to keep track of and set the correct values of the program counter, registers, the stack pointer, the frame pointer, and so on. Each thread needs its own stack.

When a thread switch occurs, we need to back up the registers and other thread data on the thread's stack, and then load in the other thread's data from that thread's stack.

In OS/161, we have a particular register allocation when we're making function calls. This is actually a very common calling convention in MIPS, and a lot of OSs follow something that is compatible with it:

Threads are associated with the control state of a running program, known as the thread context/state. This contains the CPU state (program counter, stack pointer, registers, and whether it's user mode or kernel mode), and the stack state (the stack lives in the address space of the process). Basically, all the information we need to pause the running program and restart it from where it left off later.

Note that each processor on the computer can run one thread at a time. With true concurrency, we lose program determinism - the same program run twice with the same inputs may not behave the same way each time. The hard part of concurrency is ensuring both program safety (correctness) and liveness (efficiency).

In OS/161, we have a built in threading library in kern/include/thread.h. This exposes a simple way to create a new thread given a function (through forking), kill the current thread, and yield the current thread to allow the next one to run (voluntarily causing a context switch). There are a couple of usage examples in the slides.

13/5/16

Summary of threading interface. Overview of assignment 1 and how to go about doing it: traffic simulation where we make sure that vehicles, represented by threads, are simulated in a way such that they get around without colliding. Half of the assignment is implementing the synchronization primitives, and the other half is using those primitives to build applications.

Pausing one thread and resuming another is called a context switch. Basically, we decide which thread to switch to (scheduling), then save the current thread's context and restore the next thread's context (dispatching). Usually, dispatching has to be done in assembly, since the way it's done varies significantly based on the architecture.

A simple way to do scheduling is round robin scheduling: the scheduler maintains a list of threads, and then gives each one a time slice in turn. This is often implemented as a queue, where on each context switch, the first thread in the queue is given a time slice and then moved to the back of the queue.

Threads can be running (currently executing), ready (waiting in the queue to execute), or blocked/sleeping (waiting for something to happen or a resource to be available, such as blocking I/O, sleeping for a timer, or waiting for a lock to be released). In some cases, a thread can also go into a special zombie state, which means that the thread is done running, but is still kept around because it possibly has useful information (usually information about how it exited).

Dispatching turns a ready thread into a running thread, while yielding or other context switching converts a running thread into a ready thread. Blocking operations (namely, wchan_sleep) converts a running thread into a blocked/sleeping thread, and completion of that operation (namely, wchan_wakeone and wchan_wakeall) turns a blocked/sleeping thread into a ready thread.

A thread can voluntarily pause (by yielding - cooperatively), or involuntarily pause (by being preempted). The operating system gives each thread a certain amount of time, and if it takes too long, the operating system does a context switch (usually using interrupts). Preemption makes sure that all threads have fair access to the processor. When we're implementing the scheduler in OS/161, we can have an interrupt fire at regular intervals, check how long the current thread has been running, then force a context switch if it's longer than its assigned quantum.

In OS/161, thread_yield calls thread_switch, which calls switchframe_switch, which is written in assembly. This pushes a special kind of stack frame onto the thread's stack, known as a switch frame, that saves the thread's context (registers, etc.). When we switch back to that thread, we can just restore the context from that switchframe.

In OS/161, a preemptive context switch can (from the perspective of a thread) happen at any time, since it is triggered by an interrupt. When this happens, we push a special kind of stack frame into the thread's stack, known as the trap frame, that saves all of the thread registers and context. This is different from a switch frame because a thread frame is planned by the thread, so the switch frame doesn't need to save things like temporary variables, while the trap frame does.

The preemptive context switch is still done using thread_yield, just like a voluntary yield - when a thread is preempted, we push both a trap frame and a switch frame to the stack.

Why do we need both a trap frame and a switchframe? Well, the trap frame comes before the interrupt handler, which could call a bunch of other functions, so in general we won't know where that trap frame is relative to the stack. Plus, the thread_yield function probably needs to clobber things like the status register. Finally, it's possible to have a trap frame without a switchframe (when the interrupt is for non-thread-switching purposes such as PS2 keyboard input). The switchframe is convenient because we can then assume that the top of the stack s always a switchframe when we're dispatching a thread.

Round robin scheduling by itself is cooperative - there is no preemption. Preemptive round robin scheduling is round robin with preemption. The time slice that is allocated to a thread when it's dispatched is called a scheduling quantum. The time slice is often based on a multiple of the processor tick (a timer that regularly fires, usually at something like 1 ms, and context switching occurs at these points, usually at something like 20 ms).

The downside of round robin is that it doesn't work with priorities - we want important threads to be scheduled more often than less important ones.

18/5/16

To summarize: computer resources are shared using processes and threads. Threads can switch cooperatively by yielding, or by being preempted. A switch frame is pushed for any thread switch, while a trap frame only occurs for a thread preemption. Switch frames only save the usual function-call-persistent registers, while trap frames need to save all the registers.

A wait channel is an object that a thread can subscribe to and go to sleep. When the wait channel gets activated, any threads that are waiting on that wait channel get woken up and put in the ready queue. For example, if a thread wants to wait for a printing job to finish, it would wait on the printer wait channel and get woken up again when the printer is done. Wait channels have two operations: putting something on the wait channel, and taking something off the wait channel.

Threads need to store their name, the wait channel that it's listening on, their state, the CPU it's on, the process it belongs to, their call stack, and a few other things that the OS needs to keep track of threads.

Synchronization

Threads share access to system resources like the hard drive via the operating system, and also program data (global variables). Synchronization is a set of techniques for making concurrent operations on shared resources safe and ensures that programs remain correct.

If we have a global variable, and two threads each read it, add 1, and then write it back (x ++), it's possible for one thread to get preempted after it's read the variable but before it writes back the new value - the program would only update the global once rather than twice! Since thread execution is non-deterministic, the thread could get preempted at any time, and code from different threads can be interleaved arbitrarily. Synchornization is needed to ensure that instructions are executed in the correct order.

Mutual exclusion is the technique of only allowing one thread to access a shared resource at a time, ensuring that it doesn't get messed with by other threads.

In C and C++, the volatile keyword applied to a variable ensures that it's always loaded and and stored directly in RAM every single time it's read or written in the code, and that this property should never be optimized away by the compiler. Among other things, this ensures that variable accesses are not lifted out of loops (loop invariant code motion), and that the variable is never assigned to a register.

For example, int volatile x;. This should be used for any variable that can be changed by anything other than our own thread's code, such as the keyboard state (the OS can change this for us) or variables shared between threads (the other threads can change this for us).

In the two threads example earlier with a global variable, what we want is to signal that a thread is going into a critical region for a resource. If this signal is set already by the time we're trying to enter the critical section, then we want our thread to wait until the signal is stopped and the other thread is done its critical section. Synchronization is a technique for making sure different critical sections don't step on each other's toes.

20/5/16

Blocked threads can only be woken up by a signal - since they're not executing, they can't wake themselves up.

In modern OSs, when the CPU is idle, almost all of the hundreds or thousands of threads are blocked.

There is one wait channel for each lock instance - one for every critical section. Wait channels can have multiple things waiting on them. In OS/161, we can either wake up one thread waiting on the wait channel, or all of them. To implement thread priority, we could unblock threads on wait channels in order of their priority.

Every wait channel can be locked using a spinlock that's associated with it. Most wait channel functions require spinlocking the wait channel while they're actually updating things like lists of threads.

Best practices for locking is to be as granular as possible - ideally, there should be as little code that runs while locked as possible, and as few locks as possible. The usual pattern is to copy data structures that need to be shared into local, non-shared memory, then release the lock before processing the local copy - this is often useful for ensuring that the data structure stays locked for the minimum time possible.

wchan_sleep is very similar to thread_yield in that it gives up the thread's remaining quantum. However, thread_yield makes the thread become ready, while wchan_sleep makes the thread become blocked.

When we attempt to procure a semaphore, we spinlock the semaphore instance, and while the semaphore value is 0, we'll temporarily release the spinlock and block the current thread.

Semaphores are often used when we're trying to synchronize producers and consumers, to make sure that there is at least one item queued up before consumers try to read them. Producers push items to the queue and vacate a spot for a consumer, while consumers procure a spot and take an item from the queue. This ensures that consumers will properly block if there are no spots available.

If the queue must be a bounded size, we could also use a second semaphore to enforce that only so many spots (spaces for items) are available on the queue. Basically producers would procure a spot in the queue before adding it to the queue, and consumers would vacate a spot on the queue after it takes an item from the queue.

25/5/16

27/5/16

The spinlock holder (lk_holder) is a CPU rather than a thread, because the spinlock only runs on a single CPU (a spinlocked thread can't be preempted, and only runs on one CPU).

At the lowest level, we can enforce mutual exclusion (preventing two threads from being in the same critical section) by disabling interrupts (to prevent the current thread from being preempted), or by using special atomic instructions provided by the architecture.

Spinlocks enforce mutual exclusion on a single CPU (but not necessarily for all CPUs) by disabling interrupts. In OS/161, interrupts can be masked to a given threshold, so that only interrupts with a priority above that threshold will get through and call their handler.

While disabling interrupts can only enforce mutual exclusion on a single CPU, the atomic instructions work over all CPUs (this is generally implemented using a bus between all CPUs, plus some arbiter that ensures access to the bus is given out fairly).

One way to implement mutual exclusion using atomic instructions is to have a test-and-set instruction: atomically set the value of a memory location and return the original value at that location. To implement spinlocks with test-and-set, we just have to use:

boolean volatile locked = false;
void spinlock_acquire(void) {
    // disable interrupts to prevent us from getting pre-empted
    splraise(IPL_NONE, IPL_HIGH);

    // continuously try to atomically set the locked value to true
    // when the lock is already unlocked, the loop breaks and we also set the locked value from false to true at the same time, which means we acquired the lock and can continue
    // when the lock is already locked, setting it to true repeatedly does nothing since it was already true to begin with, which means we'll keep looping until it becomes unlocked
    // when it becomes unlocked, the locked flag gets sets to false elsewhere, and this loop sets it to true then breaks, which means we acquired the lock and can continue
    // this test-and-set has to be atomic because we don't want to risk the possibility of another thread acquiring the spinlock in between us finding that the value of the locked field is false, and us setting the locked field to true - if that happens, then two threads have acquired the spinlock, which is Very Bad
    while (test_and_set(&locked, true)) { }
    // the atomic instructions tend to slow everything down, since they need to stop all the other cores when they run
    // as an optimization, we could do `locked && test_and_set(&locked, true)` instead of just `test_and_set(&locked, true)` so multiple threads waiting on the lock to be released don't constantly spam slow atomic instructions
}
void spinlock_release(void) {
    // as soon as we set this, it's fair game for any waiting threads to set to true again
    // note that the threads won't necessarily be able to acquire the lock fairly, so starvation is possible this way
    locked = false;

    // re-enable interrupts, allowing us to be pre-empted again
    spllower(IPL_HIGH, IPL_NONE);
}

Another way is to have a compare-and-swap instruction: atomically set the value of a memory location to true if and only if its original value matches a given/expected value, then return the original value at that location. Implementing spinlocks with this is almost identical as for test-and-set, where the given/expected value would just be false.

There's also a different way implemented by MIPS: we have a load-linked instruction ll (atomically return the current value at a memory location) and a store-conditional instruction sc (atomically set the value of a memory location if and only if it has not been written to since the last load-linked instruction on the same CPU).

We can actually implement test-and-set using these two instructions: load-linked the original value at the memory location, store-conditional the new value, then return that new value if the store-conditional was successful, or the original value otherwise. Using that test-and-set primitive, we can implement spinlocks in OS/161 using something like the code above. Nowadays, the C compiler can generate these automatically thanks to atomic operations in C11.

Spinlocks are nice for short waits, because they only take a few instructions to set up. However, it wastes CPU time for longer waits, and it's also possible for some threads to starve - if multiple threads are trying to get a lock, it's possible for some of them to consistently acquire the lock, while others never get the chance to. With locks in OS/161, we ensure that locks are acquired in FIFO order, so that if a thread tries to acquire a lock, it is guaranteed that it eventually will.

Synchronization Pitfalls

Starvation can occur whenever some threads have a higher chance of getting a lock, while some other threads do not. The threads that are less likely to acquire the lock might never get it at all, in certain cases - those threads starve. For locks in OS/161, we ensure that starvation never occurs by using FIFO order for lock acquiring - if a thread attempts to acquire a lock, it is guaranteed that eventually it will manage to if all locks are eventually released.

A deadlock is a cycle of two or more locks, and results in indefinite waiting. For example, thread A waits for thread B to release something, while thread B is waiting for thread A to release something else - they depend on each other, and are both stuck.

One example of deadlock is two threads that each allocate 40% of the computer's memory, and then each try to allocate another 30%, but both get blocked by the kernel since there isn't enough memory. Each thread is waiting for the other thread to release its original 40% of the memory so it can allocate its additional 30%.

To prevent deadlocks, we can force all threads to request resources all at once (the "no hold and wait" strategy), take away resources from some of the threads (the "preemption strategy", which isn't usually feasible), or assign an order to resources, and force all threads to request resources in ascending order, which ensures that a cycle can't occur.

To detect deadlock, we can treat the locks as vertices of a directed graph, and a lock B acquired while lock A is held is a directed edge from A to B. Every so often, we can do a depth-first search to detect cycles. Usually, to recover from any detected deadlocks, we terminate one or more of the threads in the cycle to break it. This can be a pretty expensive operation, so we don't want to do it too often.

A race condition is when a computation's result unexpectedly depends on the order and interleaving that we execute the threads in.

A process, like a thread, also abstracts program execution, but they are somewhat more isolated from each other. Notably, different processes have different address spaces, so processes don't share global variables. Processes can contain multiple threads. Additionally, many different OS resources are often associated with the current process rather than the current thread, like files and sockets.

A process is essentially a collection of one or more threads, an address space, and a collection of associated resources. A process is concurrent if it has multiple threads, and is sequential otherwise.

While multiprocessing is the concept of programs using multiple processors/cores, multiprogramming is the concept of programs using multiple processes. Multiprogramming allows the OS to be robust against out of control programs - if the address spaces are isolated, programs can't mess up the OS's memory, and if resources are associated with processes, the OS can control and multiplex access to the actual resources to ensure no program can hog all of it.

For our purposes, the OS is the kernel plus a bunch of utility functions and libraries that use usable by user programs. The kernel alone runs in privileged mode, as a single program with possibly multiple threads, in order to control hardware and protect itself. Privileged mode can run additional, special instructions (like halt and the ability to access certain restricted regions of memory) that can't be run in user mode. The kernel exposes some of these through controlled, secure interfaces called system calls/syscalls.

Some examples of system calls are fork/exec, fopen/fclose, and mkdir/rm/mv. Generally, syscalls will be wrapped in a user-mode syscall library that does all the hard work of making the actual syscall from the user-mode side.

A syscall works by switching to kernel mode, saving the thread context (like the program counter) on the kernel's stack, and then jumping to a specific, fixed memory address specified in hardware in the kernel memory space. At this location we have a system call handler, which can read the cause register C0 to figure out which system call to perform. After performing the system call, the handler restores the thread context and switches the processor back into user mode to continue executing the program.

1/6/16

A syscall is generally rather expensive, since it involves a lot of context switching. However, it's the only way for user programs to safely call into kernel functionality.

3/6/16

8/6/16

10/6/16

;wip: catch up in everything - needed three weeks to recover from injury enough to get out of bed

15/6/16

The standard page size for almost all applications is 4096 bytes. In some server environments and similar, the page size may be increased to improve memory performance. If we have 0x123456 bytes of data, we need 0x123 + 1 pages to hold them.

With paging, address translation is an additional thing we need to do every single time we try to access RAM, which slows down all those memory operations that use it. To improve this, we have TLBs, which are basically much faster caches for page table lookups, usually with about 64 entries. The MMU uses the TLB to translate between virtual and physical addresses. If the address can't be translated using the TLB, the MMU will raise a VM exception/address exception, at which point OS/161 is supposed to catch it, generate a TLB entry for that address, and retry the memory operation.

OS/161 has a software-controlled TLB with 64 entries, providing operations like TLB Write (modify TLB entry), TLB read (get TLB entry), and TLB probe (look for a page number in the TLB).

Every application in OS/161 has three sections in its virtual address space, usually in the following order:

The MIPS virtual address space is laid out to have 4MB reserved (not readable/writeable/executable, only really for detecting null/small pointer errors), 252MB for code and read-only data from 0x00400000 (real-world OSs make this part sized dynamically to support larger binaries), and the rest for statically allocated data and the heap from 0x10000000, plus the stack, which grows from 0x7fffffff (2GB) downward for up to 12 pages. Segments are always mapped to contiguous runs of physical memory, and we keep track of how many pages they contain in order to do bounds checking.

Having different segments means that we can have some parts executable, some parts writeable, and some parts neither, to improve safety. When we try to write in a read-only section, or execute a non-executable section, we get a segmentation fault.

Each segment has its own virtual base address, and its own physical base address, To translate a virtual address to a physical address, we subtract the virtual base address stored for that segment, check if the offset is too large using the stored page count for that segment (and give an exception if so), and then add the stored physical base address for that segment to get the corresponding physical address. For the stack segment specifically, we have to compute the virtual base address and physical base address of the stack by subtracting 12 pages from the top of the stack, subtracting the virtual base address from the virtual address, check if the offset is negative (a stack overflow), and then add the physical base address.

The virtual address space essentially keeps track of all the virtual addresses that the thread can legally access. The kern/arch/mips/vm/dumbvm.c source file contains functionality for working with address spaces.

Other schemes are also widely used. Linux, for example, uses 29 segments! ELF segments include program code (.text), read-only global data (.rodata, like string constants), initialized global data (.data, like global variables that are initialized to a given value), uninitialized global data (.bss, like uninitialized arrays and structs), and uninitialized global data that is small in size (.sbss, like uninitialized integers and floats). Not all of the sections are required - unused sections can be left out. We often call the program code and read-only global data segments the "code section", and the initialized/uninitialized/small uninitialized global data segments the "data section".

17/6/16

OS/161 uses the ELF format for executables, just like Linux and OS X. Executables are files that describe program code and data. The ELF format also allows us to store additional useful information like debugging symbols, descriptors for each section, and so on.

When starting a program, OS/161 sets up the entire virtual address space, and then loads in the program code and data. Most modern OSs set up the pages in the address space on demand instead of pre-loading the whole thing, which makes it possible to avoid using memory for program features that we aren't using.

Even though lots of different OSs can use ELF, we generally can't expect that a given ELF file will work in all of them. The main thing determining compatibility is system calls being very different (for example, Windows has thousands, while OS/161 has about a hundred).

ELF begins with a header that describes the virtual addresses to use for the start and length for each segment, as well as where we can find an image of those segments in the ELF file. The image of a segment is generally smaller than the segment in memory itself, because any remaining space is filled with zeroes in memory while they're just left out in the image. The cs350-readelf -S SOME_EXECUTABLE utility can dump information about the segments of any given ELF file for OS/161.

While every process has its own address space, a kernel may or may not have one for itself. Some kernels disable address translation and use the physical address space, some use a virtual address, just like a process, and some use a portion of the address space of each process, enforcing the user/kernel using memory permissions.

OS/161 (and also Linux) uses the last one - using a portion of every process' address space. Specifically, in OS/161, all addresses from 0x80000000 and above refer to kernel memory, and refer to the same physical memory in all processes - the kernel memory. This makes it easier to share things between userspace and kernel, like when implementing syscalls, and makes it easy for the kernel to refer to virtual addresses. In userspace, an exception is raised if code attempts to access this kernel memory.

Basically, in OS/161 the userspace is the first 2GB of a process' virtual address space, and kernel space is the other 2 GB. The kernel space has 3 segments: a 0.5 GB segment for kernel code and data, a 0.5 GB segment for kernel devices, and 1 GB that's unused. Only the 1 GB unused kernel segment and the user segment are run through the TLB - addresses in the two 0.5 GB kernel segments are not passed through page translation.

We want to avoid wasting 2 GB of memory whenever loading a program - most sections are relatively small compared to how large they could be, and in the end, most programs only use a tiny fraction of their virtual address space. Dynamic relocation lets us map each valid segment to a different physical memory runs, while not mapping the rest of the address space.

One way to do dynamic relocation is segmentation: each program segment has its own page table (with its own physical base address and length in bytes), and we can map each segment separately. With segmentation, every virtual address has a segment ID, and an address within that segment. This allows us to put our segments anywhere, so the kernel doesn't need to allocate contiguous memory for the entire address space. With segmentation, each segment is contiguous memory, but segments can be put anywhere in the physical memory.

22/6/16

Segmentation example: suppose we have a 32-bit address with a 28-bit segment offset. Then the first 4 bits of every virtual address is the segment number, and the remaining 28 bits are the offset within the segment. When we want to translate a virtual address into a physical one, like 0x20002004, we first extract the segment number from the first 4 bits to get 0x2, and then extract the segment offset, 0x2004. Then, we can check that the segment number is within the bounds of the segment table, and that the offset is within the segment bounds. We can then add the segment offset to the physical base address to get the translated physical address.

An alternative to segmentation is to use multi-level paging - the single large, linear page table is replaced with a hierarchy of smaller page tables, and parent page tables can manage their child page tables to assign memory sparsely. In this scheme, each virtual address consists of bits for a page directory index (an index within the page directory), bits for a page table index (an index within the page table), and bits for the page offset, which is the offset within the page. Since page directories and page tables both store their own lengths, we can also do bounds checks at the page directory, page table, and offset levels. We can even have nested page directories, where directories point to other directories, and so on, until they finally point to page tables.

When we have large address spaces and small page sizes, our page tables become huge - an architecture with 64-bit virtual addresses and 4 kB byte pages, we need \frac{2^{64}}{2^12} = 2^{52} page table entries, which would take an impractical amount of memory to store. Instead, we can use a page directory, which is like a page table for page tables. Each entry in the page directory points to a page table elsewhere in memory. This allows us to only have page tables for the parts of memory we're actually using, which means that we don't have to have the entire page table in memory if we don't need it.

The multi-level page table needs multiple lookups just to translate an address. As a result, this is often makes memory operations quite a bit slower - we usually want to make the heirarchy as flat as possible.

One interesting thing we can do with virtual address spaces is mapping some pages to secondary storage - some pages can be made to resolve onto the hard drive. This allows us to put some pages in RAM, and some on disk, allowing us to pretend we have more RAM than we actually do and run programs that use more total memory than we physically have RAM for. On Linux, this is generally accomplished using the swap partition. To actually implement this, each entry in the page tables can have a bit that determines whether it's present in memory - the present bit.

Since secondary storage is an order of magnitude slower than RAM, we want to minimise the amount of secordary storage we're actually using. To do this, we have multiple paging strategies: for example, the OS can bring pages from secondary storage as needed (demand paging), or it can predict which ones will be retrieved and then fetch them pre-emptively (prefetch paging). There's also many possible replacement policies (how do we replace old pages in memory if no spaces are available?). For segmentation, we can also make use of secondary storage, but in addition to the considerations for paging, we also need to consider placement policies (where do we fetch the pages into memory if there are multiple possible spaces?).

If we try to access a page with the present bit not set, we get a page fault, block the thread requesting the memory operation, retrieve the page, and then set the present bit and unblock the thread, retrying to operation that caused the page fault. A page fault sounds like an error, but is a totally normal thing to happen and occurs quite often. Basically, when the MMU says that TLB lookup failed (the MMU isn't aware of the present bit or secondary storage at all), the kernel catches the exception, tries to load it from secondary storage, and retries the operation. The TLB is basically a popular subset of what's in RAM.

24/6/16

Assignment 2b overview and suggestions. execv is a system call that causes the current process to essentially be replaced by a different program, much like the runprogram function we've encountered in assignment 2a.

When we use secondary storage, the goal is to store the most commonly accessed and important things in memory to minimise the total amount of time we're spending on memory operations.

The optimal page replacement strategy is as follows: when all pages are being used, and we need to load a new page, the page that will not be used for the longest time in the future will be evicted so the new page can be loaded. However, we can't predict the future. Most strategies we use will try to approximate future access patterns by taking advantage of spatial (pages near recently used pages are more likely to be used) and temporal (recently used pages are more likely to be used) locality.

The simplest replacement policy is first-in-first-out - pages are loaded/replaced in a fixed order, generally starting at the first page and going up. This can be done as simply as keeping track of a single page number, but it doesn't account for how frequently each page gets used, how recently we last used the page, or whether we're changing pages (in which case we would also need to write those changes back to the hard drive later when we evict the page; unchanged pages can simply not write anything to disk).

Another replacement policy is least-recently-used - when all the pages are being used, and we need to load a new page, the least recently loaded page gets evicted, and the new page gets loaded in its place. This is almost as simple as the first-in-first-out policy, and we can implement it by incrementing a counter on each page every time we access them (or keep a linked list of pages that we update as we access memory), and is nice because recently used, frequently used pages are kept around. However, in real systems, LRU is difficult to implement, especially if we have to store and update a time for every single memory operation, so they often use approximations. One good approximation of LRU, known as the clock algorithm, relies the MMU set a "use" bit on pages when accessed - it's basically first-in-first-out, except it's skipping pages that have their use bit set (and also clearing the use bit while skipping them).

A modified/dirty page is more expensive to evict than a clean one, as mentioned before - when we evict a dirty page, we have to write it back to the disk, while we don't have to do this for clean pages. Doing these dirty page writes is often why leaving a computer with a modern OS idle will result in a lot of disk activity - the OS tries to do these expensive operations when the user isn't there. If we keep track of modified pages (say, by having the MMU set a "modified" bit), we can use this to improve our page replacement policies - by preferring to evict clean pages over dirty ones, we reduce unnecessary disk activity.

When we have multiple applications running, we have two possibilities for page replacement: local or global. Global page replacement means that any page can be replaced, which utilizes available memory very well, but isn't really fair since some applications can potentially have all of their pages evicted. Local page replacement means that only pages assigned to the faulting process in RAM can be replaced, which can be more efficient due to often having less pages to check in things like the clock algorithm, but can potentially waste or underutilize memory if the pages assigned to each process in RAM don't match up with how many pages they actually need. A good compromise is a mixed strategy that implements the local strategy, but occasionally reassigns pages according to how often each process is page faulting - if page faults for a process are rare, the OS might reduce its resident set size, and if page faults for a process are frequent, the OS might increase its resident set size.

A working set is the subset of the virtual address space that the application is actually using heavily, and can change a lot over time as the application does different things. The resident set is the subset of the virtual address space that is resident in memory (as opposed to being stored on secondary storage). If the working set is larger than the resident set, then there will be a lot of page faults, since not all of the heavily used pages are in memory. The virtual memory size/commit size is the memory (RAM and swap space) that's allocated for the process.

Thrashing is when a system is getting bogged down by too many page faults. This is generally resolved by suspending/killing some processes to free up some RAM - suspending a process frees up its resident set.

29/6/16

Midterm average was in the 60's after curving (IIRC, the curve was 2%). We're going over many of the questions in the midterm that a lot of people got wrong.

The Job Scheduling Problem is the problem of how to executing multiple jobs \set{J_1, \ldots, J_n} on a single server, where each job J_i stores the arrival time (when the job arrived and is available to run), and the run time (how long the job will take to run). For now, we'll assume that only one job can run at a time (so no multi-core processing), and switching between jobs can be done at any time (no scheduling quantums) and has no cost (instant context switching).

The scheduler accepts these pairs of arrival and run times (or estimates of run times), and outputs the times that we should start each of the inputted jobs at. We can characterize different scheduling algorithms by their response time characteristics (how long does it take from the job to arrive to when it starts?), and their turnaround time characteristics (how long does it take from the job to arrive to when it finishes?).

There are 4 basic scheduling algorithms: first-come-first-serve (run jobs in the order they arrive; pre-emptive version is round robin), shortest-runtime-first (run jobs that take the least time first; pre-emptive version is shortest-remaining-runtime-first).

Shortest-runtime-first has the disadvantage that long jobs can potentially starve if new, short jobs keep becoming available.

6/7/16

CPU scheduling is the job scheduling problem where the server is a single CPU in a single-core processor, and the jobs are ready threads. When a thread is added to the ready queue, it means that a new job is added. The runtime is the thread time slice, bounded by the quantum. We generally don't know what the true runtime will be.

The goal of CPU scheduling is to have responsiveness (low time between ready and running), fairness (every job gets the resources it actually needs), and efficiency (minimise context switching overhead).

Modern CPU schedulers accomplish this using process and thread priorities. For example, Linux has 99 different possible priorities, while Windows have 32.

One way to schedule with priorities is to always schedule the highest priority thread. This works well, but isn't fair to the lower priority threads. A better way is weighted fair sharing: scheduling different threads in proportion with its priority over the sum of all priorities - given two threads, with priorities 5 and 10 respectively, the one with priority 10 gets scheduled twice as often as the other.

A good way to implement priority scheduling is to use multi-level feedback queues. In this scheme, we distinguish between interactive applications (which tend to run for short bursts and sleep most of the time due to waiting for human input), and non-interactive applications (which tend to run for longer bursts and sleep less often). The idea is to gradually reduce priority of threads that behave like interactive applications.

To implement priorities, we have one round-robin ready queue for each priority level. To implement multi-level feedback queues, the thread has its priority raised if it blocks/yields before the quantum ends, and has its priority reduced if it gets preempted. This ensures that things like UIs and other small, fast tasks get scheduled with higher priority than long-running computations. The scheduler always runs the first thread from the non-empty ready queue with the highest priority. Each ready queue can even have its own quantum length - lower priority queues should generally have longer quantums to minimize time wasted on context switching, since lower priority threads under this scheme often use their entire quantum.

Although this works, it can starve lower-priority threads. One way to get around this is to occasionally migrate every thread into the highest priority.

Modern Linux does completely fair scheduling. Ideal processor sharing means that priority exactly determines the fraction of processor time it's given: c_1 \frac{\sum_{i = 1}^n p_i}{p_1} = \ldots = c_n \frac{\sum_{i = 1}^n p_i}{p_n}, where c_i is the amount of time thread i takes in its quantum, and p_i is the priority of thread i. Each term c_j \frac{\sum_{j = 1}^n p_j}{p_i} is called a virtual runtime, and the scheduler always schedules the thread with the lowest virtual runtime.

Simplifying, the scheduler tries to make it so that \frac{c_1}{p_1} = \ldots = \frac{c_n}{p_n}, by always scheduling the thread with the smallest \frac{c_i}{p_i} value.

Completely fair scheduling also ensures responsiveness, since all threads will get run regularly.

On multicore processors, we can either give each core its own set of ready queues, or have a single ready queue for all cores. While giving each core their own queues is fast since we don't need any synchronization, it might not use the cores in the best way since some cores might be heavily loaded while others might be lightly loaded. On the other hand, having a single set of ready queues ensures good resource utilization, but has a lot of overhead from ensuring that the queue is only modified by one core at a time using synchronization primitives.

Most modern cores have a few levels of per-core cache (L1 and L2, generally), and one level of cache shared between all cores (L3, generally). When we move a thread from one core to another, we lose all of the benefits of the already warmed-up cache from the original core. As a result, threads run a lot better if we keep them on the same core (this is known as CPU cache affinity). However, if a core is heavily loaded while another is lightly loaded, it might still be better to move it over. moving threads from heavily loaded cores to lightly loaded cores is called thread mogration.

Input/Output

A bus is a communication path between different devices. Most computers have internal buses (for communication between processor and RAM), and peripheral buses (for other devices like a network card, USB, or disk drives). Different buses can be connected with each other using a bus bridge.

Generally, the CPU is connected to the memory bus, which is connected to the RAM and a bridge to peripheral buses like PCI-E, USB, and SATA. Newer architectures try to speed up peripherals like GPUs by adding a GPU bus directly between the processor and GPU. We have multiple buses because faster buses are more expensive, need to be closer to the processor, and usually have smaller limits on the number of devices.

8/7/16

Overview of assignment 3. It's the hardest one after assignment 2a. Each TLB entry in OS/161 is two words, elo (frame number and TLB entry flags) and ehi (the corresponding page number). The main thing assignment 3 is about is to improve the dumbvm virtual memory system, specifically to handle a full TLB, implement support for read-only segments, add dynamic segmentation, and implement physical memory reclamation so free/kfree works.

In SYS/161, our bus architecture, LAMEbus will be a single bus that connects all the devices together, including processors, memory, disks drives, and other peripherals. SYS/161 supports hardware timers, disk drives, serial consoles, and network interfaces.

In SYS/161, we control each device by writing to their command/control registers (for controlling the device), status registers (for querying the device status), and data registers (for transferring data between the processor and the device).

A device driver is a kernel component that encapsulates interaction with devices. Generally, this makes up most of the OS - 70% of Linux, for example!

13/7/16

Drivers will generally work with the command/status/data registers by either polling (continually checking the status register for changes) or interrupts (don't do anything until the device gives us an interrupt). For example, PS/2 mice generate interrupts when the mouse moves, which are then handled by the mouse driver, while USB mice are directly polled by the mouse driver.

There are two main ways to transfer larger amounts of data between devices and memory. The oldest one is program-controlled I/O, where the driver software moves data in to or out from the device data registers one at a time. Nowadays, we use direct memory access (DMA), where a controller chip is told by the CPU to start transferring data between the device and RAM, and raises an interrupt when the transfer is done. DMA is used instead of program-controlled I/O, since it allows us to keep the CPU free while we're doing transfers. For example, the hard drive controller does DMA to leave the CPU free while reading or writing from the disk.

In the driver itself, we can either access those special device registers using special instructions, or memory-mapped I/O. Memory-mapped I/O is what most architectures use these days, since it's simpler to code for. In memory-mapped I/O, we assign each device register to a physical address, and then use the normal store-word and load-word instructions to access them. In OS/161, we map each device's registers to one of 32 possible 64kB slots of contiguous memory between 0x1FE00000 and 0x1FFFFFFF.

When the CPU puts out a load/store on the SYS/161 bus, each device receives the request and checks if the address is within its address range. If it is, it will accept the request and process it, and if it's not, it will simply ignore the request. This ensures that each request is automatically given to the correct component.

A disk platter is divided into many concentric tracks (generally around 300000 per inch nowadays), and divided radially into several sectors (generally 64). A single sector always contains the same amount of bits, even though sectors are smaller near the center of the platter. As a result, the bits near the edge of the platter are spaced farther apart, and therefore are less likely to be corrupted - we generally put the OS at the edge since it's the most important part of the system.

We used to refer to blocks of storage with cylinder number, head number, sector number. Nowadays, we use logical block addressing to abstract storage devices - each block is the same size in bytes, we transfer data between disk and memory in blocks, and the disk controller handles how to map it to the physical disk. The disk controller controls how to move the heads and platters to minimise service time.

The service time is the sum of the time needed for the head to seek to the correct cylinder (this is generally the slowest part, taking tens of milliseconds), rotating the platter to the right sector (second slowest, taking possibly multiple milliseconds), and reading the sector (third slowest, taking hundreds of microseconds). Most modern hard drives spin at about 7200rpm (120 revolutions per second), and current material science gives us an upper bound of about 20000rpm. The rotational time depends on this spin speed and how far the sector is from the read/write heads, the read time depends on this spin speed and the number of sectors we're reading, and the

Therefore, for hard drives, large bulk data transfers are faster than multiple small transfers (since the read/write head only needs to move across the platter once), and sequential access is much faster than random access (since the read/write head can stay in the same cylinder for longer). Modern OSs defragment the disk to try to keep the most important data all in the same track and reduce the number of seeks needed. Older OSs could do this manually with a special defragmentation utility, but newer ones do this transparently in the background.

Modern disk controllers try to optimise for sequential access by doing things like track buffering (buffering sectors adjacent to the ones we requested, since we're likely to access those sectors soon after). They can also optimize the order of disk access requests to minimize read/write head movements, which has a huge impact for real-world situations. The slowest way is FIFO, since the read/write head might move a large distance for every single request, but it's simple and fair. Another strategy is shortest-seek-time-first, which greedily chooses the request on the cylinder closest to the current request, but this may starve far-away requests.

The elevator algorithm is a compromise between FIFO and shortest-seek-time-first, which maintains fairness while also reducing seek times a lot. The algorithm is as follows: the read/write heads move in one direction until there are no more requests in that direction, reverses, and moves in the other direction until there are no more requests in that direction, repeating this process until there are no more requests.

Solid state drives are increasingly popular nowadays. While SSDs are a lot faster, mechanically durable, and more power efficient, they're more expensive per byte of storage, support only limited numbers of writes, and have worse data longevity (after leaving it alone for about ten years, the data is likely to be lost). For applications like personal computing, SSDs tend to be better. For most datacenters and archival applications, hard drives still tend to be better.

15/7/16

For our purposes, files are persistent, named, ordered sequences of bytes. Files are mutable and can change content and size. Each file is associated with a set of metadata, including things like the file's owner, permissions, and creation timestamps.

A filesystem is a namespace and organizational structure for files. The filesystem provides operations for creating/deleting/moving/etc. different files within it. For our purposes, a filesystem stores files heirarchically using folders as internal nodes, and each file/folder gets a unique ID known as an inode, as well as a path name that takes us from the filesystem root to the file (this isn't necessarily unique). The filesystem is generally thought of as a tree, but hard links (arbitrary associations between paths and inodes) mean it might not be strictly heirarchical. However, most filesystems disallow hard links for directories, in order to make sure the filesystem can never contain cycles and that path names can be canonicalized. The filesystem makes sure files referred to by each inode actually exist, and ensures that files are removed when no hard links are left to their inode.

The filesystem is also responsible for breaking up the contiguous byte sequences that make up each file into blocks that can be stored on a hard drive. There are multiple different ways to break files into blocks on the hard drive, each with their own tradeoffs.

One strategy is contiguous allocation, where the file is stored in a sort of contiguous array of blocks, and the file table keeps track of the start location and length. This is great for sequential access of single files, and we can efficiently access a random index in the file, but large files might be hard to find enough contiguous blocks for, and it quickly becomes fragmented when files change size, so we need to regularly run a compaction algorithm or leave lots of space for files to grow/shrink.

Another strategy is chained allocation, where the file is stored in a sort of linked list of blocks, where each block stores the location of the next block at the end, and the file table keeps track of the first and last block in the file. While this allows files to easily grow and shrink arbitrarily, and never gets fragmented, we lose the ability to do efficient random access within files since we have to go through each block until we get to the desired offset.

A variation of chained allocation is the file allocation table (FAT) strategy, in which all block pointers for files are stored together in the beginning of the filesystem, so we can cache it and reduce the number of seeks needed. This is actually used in things like USBs where cross-OS interoperability is important.

Linux and most modern OSs use an indexed allocation strategy - each file is stored in a set of blocks indexed by its own index (which can be used to map file offsets to blocks), and the file table keeps track of where each file's index is. This per-file index is generally stored in its own block. While indexed allocation supports sequential and random access, and also allows files to easily grow and shrink, we still need to occasionally defragment to reduce the number of seeks we need when sequentially accessing large files. Also, since the index is a fixed size, this means the file size has a fixed upper bound (since it can only have a fixed maximum number of blocks).

20/7/16

Files are read in a stateful manner - each open file keeps track of a file position that is an offset within the file used for reads and writes, and reading/writing 100 bytes will move the pointer ahead 100 bytes. It is also possible to set the file position by using a seek operation.

A soft link behaves a lot like a hard link, but it links to file names rather than inodes. A soft link in Linux is a file with the special "link" bit set (under ls -l, this appears as something like lrwxrwxrwx rather than -rwxrwxrwx), and the content is the path of the file it's linked to (on Windows, soft links are known as shortcuts, and their contents are a special binary format). The soft link can be made to point to either a relative path or an absolute path. The most visible differences between soft links and hard links are that a soft link can link to a non-existant file, that a soft link can link to a file on a different file system, and soft links still point to their target file after deleting and recreating it.

Many OSs support having multiple different filesystems. Windows, for example, divides every path into two parts: the drive letter (which filesystem to use), and the interior path (where in the filesystem we are). Linux uses a single namespace, where we mount different filesystems at different mount points within this namespace - all filesystems are accessed in the same way, regardless of what filesystem they're on. For example, if a filesystem is mounted at /media/USER, we can access files on it like /media/USER/file.txt.

Note that inodes are only unique within a single filesystem, so there can be duplicates between different filesystems. This is the reason we can't have hard links that point to files on different filesystems.

Filesystems need to store file data, file metadata, directories, links, and filsystem metadata, all persistently. In addition, filesystems need to keep track of open files (including their file positions) and cached file data. One way we might lay this out on a disk is:

22/7/16

Filesystems tend to get slower when they're nearly full. In the example filesystem layout above, this would be because each time we need a block, we have to do a linear search over the entire inode bitmap and block bitmap to find a free one, and this search takes longer if most of the bitmap is filled.

Directories are implemented as special files that store entries containing inodes and filenames for the files that they contain. The files don't store their own name; filenames are only stored in directory entries.

Hard links for a given file are implemented by adding a new entry with the desired filename and the same inode into the hard link's directory. Soft links are implemented by adding a new file with the link bit set (in the file metadata), where the contents are a path to the original file.

When translating a file path into an inode, we need to start at the root inode, then repeatedly read each directory or soft link in the path for each path component, until we get to the end of the path and have a file inode. Note that we must do a read for every single level in the path.

The inode table is a prime candidate for caching, because we need it for every single file/directory read. Modern filesystems also tend to split up the inode table over several parts of the disk to help reduce the distance the read/write head needs to move to get to an inode table.

In addition to just what's on disk, the filesystem is responsible for things like tracking file descriptors for each process (including file descriptor inodes and file positions). Filesystems generally will store an open file table in RAM for these descriptors. Filesystems often also cache commonly used inodes and block contents in RAM to speed up accesses.

One of the downsides of having all the inodes together in one table is that corruption tends to happen locally - if the read/write head crashes onto the part of the disk storing the inode table, the entire inode table could be corrupted. Also, it's bad for SSDs since we do a ton of writes in a small amount of space, so the block storing the inode table will tend to wear out a lot quicker if the SSD doesn't automatically mitigate this with wear levelling.

A log-structured filesystem (LFS) is a different filesystem layout that works better for devices where sequential access is much faster than random access and we have lots of available memory for caches. Modern filesystems have enough RAM available that most of our reads are taken from the caches, so generally disk I/O will mostly be writes. To optimize for this case, a log-structured filesystem makes all writes as sequential as possible.

Log-structured filesystems are also nice for wear-levelling, because it tends to spread out writes over all the blocks. Therefore, log-structures filesystems are often used for SSDs and other flash-based memory technologies.

Basically, whenever we do a write, we start right after the end of the previous write, write all the data blocks we need to, then write an inode entry for those data blocks. When we need to overwrite any data, we just write the new data, and then add a new, duplicate inode entry to the block after that. Each inode keeps track of the previous inode block.

26/7/16

Whenever we do a read, to find the inode of a file, we need to find the last instance of that inode in the log (since there might be many duplicate instances if the data was rewritten many times).

How do we find the last instance of each inode? For this purpose we can use an inode map (also known as an imap). This is a mapping from inodes to disk addresses where we can find those inodes on the disk. Every time we write a new inode, we write the new, changed portion of the inode map. To keep track of the latest inode map, we also have a fixed location on the disk called the checkpoint region (CR) with pointers to the latest inode map. Since the checkpoint region is relatively small, it can usually be fully cached in RAM (as opposed to the entire inode table, which is generally too large), flushing it out to disk periodically and upon a graceful shutdown.

What happens when the system suddenly loses power? When this happens, we lose all in-memory data structures, while everything that is already on disk is kept. One of the goals of a robust filesystem is to be crash consistent - even if we lose some data, the filesystem should be a consistent state when we boot up again.

For example, if our filesystem deleted files by marking the file data blocks as free, removing the inode entry, then removing the filename from the directory entry, and the computer lost power in between the first and second steps, we might have an inode that points to a free block! A good filesystem should guard against this by doing things in a different order: removing the filename from the directory entry, removing the inode entry, then marking the file data blocks as free. This ensures that even if the computer crashes between any of the steps, we still have the filesystem in a consistent state.

A lot of old filesystems also have a checking utility that makes sure that the filesystem is consistent. For example, you can check for inodes that don't exist in any directory entries, or space not used by any inodes that are not marked as free. Many of these issues can even be corrected to recover the files. However, this is a very slow operation, and therefore can't be done too often.

Modern filesystems like NTFS and EXT3 do journaling instead - any filesystem metadata changes are written to a journal before actually making the changes, so that changes are all written to disk atomically (assuming we can write blocks atomically). After the operation is done, we write to the journal saying that those changes completed successfully. If the system crashes in the middle, we simply go through the journal, and redo any operations that were not marked as completing successfully.

Interprocess Communication

Threads can communicate by sharing memory, but processes cannot. Interprocess communication (IPC) is an OS feature that allows processes to communicate using OS-controlled, safe mechanisms. IPC is generally done with mechanisms like shared virtual memory/files, and message-passing systems like sockets, pipes, and signals.

Message-passing systems can either be indirect (the OS buffers the messages) or direct (the sending process blocks while the receiving process is copying data from the sending process). Both of these are done via the OS, which enforces that the messages are only sent between processes that are indeed trying to communicate.

Since IPC resources are virtual, we don't have many physical restrictions on mechanisms. We can have simplex (one direction only, like pipes) half-duplex (one direction at a time, like wifi packets) and full-duplex communication (two directions at the same time, like sockets). We can have messages be datagrams (known/bounded message sizes, like email) or streams (unbounded message sizes, like telephone calls). We can have connection-oriented (receipient specified when making connection, like telephone calls) or connectionless (recipient specified for each send operation, like email).

An unreliable protocol sends messages without ensuring it's received correctly. A reliable protocol will check if the message was delivered successfully, and possibly try to correct things like message order or corruption. This is slower and requires more resources than unreliable protocols, and is usually implemented with acknowledgement messages with checksums, message ordering values, and retry strategies for resending messages.

A socket is an endpoint for communication - each end of a two-way communication needs to have their own socket. There are stream sockets (connection-oriented, reliable, duplex stream communication) and datagram sockets (connectionless, unreliable, duplex datagram communication).

Unix domain sockets are used for communication between processes on the same machine, while INET domain sockets are used for communicating between processes on different machines over the internet protocol (this forms the basis of the internet). Both domains support stream and datagram sockets.

Datagram sockets can send/receive at any time, and block on receiving if there are no incoming messages/block on sending if the system buffer is full. Datagram sockets tend to be unreliable, and each datagram might arrive out of order. For example, UDP is an unreliable datagram socket protocol.

Streaming sockets rely on opening/closing connections (processes can block while waiting for a connection request to come in, or request a connection themselves), and once established, allow full-duplex sending/receiving on those sockets. For example, TCP is a reliable stream socket protocol.

When the two ends of a socket are on different machines, the kernel is responsible for controlling access to the networking hardware, and multiplexing the hardware resources to all the processes requiring them.

Lecture Notes by Anthony Zhang.