Lecture Notes by Anthony Zhang.


Distributed Systems.

Khuzaima Daudjee
Section 001 (technically section 2 somehow, but it shows up as section 1 on Quest)
Website: https://cs.uwaterloo.ca/~kdaudjee/courses/cs454/
Tuesdays/Thursdays 10:00am


Introduction to the fundamentals of distributed systems given computers with facilities for data transmission - distributed algorithms, file systems, databases, and so on, as well as related topics.

Course has 4 assignments (worth 50%), no midterms, and one final (worth 50%). Textbook "Distributed Systems: Principles and Paradigms (2nd edition)" is freely available online from the publisher (download page), though it's not strictly required.

To pass the course, you must pass both the assignments and exams. There's also a Piazza class. Also, there is a lot of information that is only going to be mentioned in class, so the class notes on LEARN are not comprehensive.

After an assignment deadline, there is a 48 hour grace period afterwards in which instructors won't answer questions on Piazza, but you can still submit your assignment with no penalty.


A distributed system is a collection of independent computers that appear to users as a single computer - a system that acts like one computer. In other words, a number of computers that all appear as a single machine. This includes things like popular web applications, to multicore processors.

There are a few common fallacies of distributed systems, which often prevent distributed systems from working well:


The main reasons we use distributed systems over centralized systems are economy (lots of small computers is often cheaper than an equivalent single big computer), performance (distributed systems can scale beyond what's possible to do with a single machine, and having multiple computers means the I/O capabilities far exceed single machine performance), suitability for inherently distributed problems (for example, serving web applications from servers that are as close to the client as possible, in many different countries), reliability (distributed systems can be designed so that machines can fail without the whole thing going down), and scalability (distributed systems can be designed to grow as needed if requirements change).

Most distributed systems are designed so that individual machines are executing their programs concurrently, operate as independently as possible, and tolerate failures.

Most distributed systems avoid using a global clock - each machine can potentially have its own idea of what time it is. While it is possible to do synchronization (confirmation requests, TrueTime, etc.), it unavoidably adds complexity and overhead.

Well-designed distributed systems are designed to be transparent, which means the user just sees the system as a single machine. Although two visitors to gmail.com in different countries are likely talking to two entirely different machines, both visitors see the same application. If it was not transparent, it would expose internal details of the distributed system, like having to visit "us-west1.gmail.com" for western north american users, or "us-east1.gmail.com" for eastern ones.


Transparency in a distributed system means that resources can be accessed without knowing about the internal details of the distributed system. There are several kinds of transparency in distributed systems:

Real-world systems often try to implement many of these kinds of transparency, but sometimes it is useful to avoid making things transparent in return for more control, which can be useful for things like performance.

The distributed system fallacies mentioned earlier touched on some of the challenges facing the building of transparent systributed systems. There are different networks, hardware, operating systems, programming languages, and protocol implementations.


A well-designed distributed system should behave properly if the number of users and components increases. For example, the system shouldn't slow down when it gets more users, and should become faster as we add more resources to the system. A scalable system should also maintain its behaviour when it becomes more complex, and when more functionality is added. Nowadays, a common concern is also data scalability - the system shouldn't slow down when the amount of data being processed and stored increases.

In other words, scalability generally means preserving correct behaviour when increasing: system resources, number of users, software complexity, and quantity of data.

Scalability is often made difficult by the need for distributed ways to store data (we often need all machines to see the same or generally very similar data), implement services (it's a lot easier to write a centralized service since we don't need to think about communication with resources as much), and run algorithms (many algorithms are much more complicated to implement when it can't see all of its input at once).

Consider a situation where we have a popular web application. As requests come in, how do we ensure that they're distributed to our servers in a way that no single server doesn't get overloaded? Turns out, perfect load balancing is NP-hard, but we can get pretty close. For example, round-robin load balancing works pretty well if requests have relatively consistent resource requirements.


Scalability is also made difficult by partial failures - working nodes need to keep working, and failed nodes need to try and recover. Detecting failure, hiding failure from the end user (failure masking), continuing to work properly (failure tolerance), and failure recovery are all often important goals of distributed systems. This is often implemented using different schemes for replication. For example, the space shuttle has 3 redundant onboard computers, which compute the exact same thing multiple times in order to prevent errors from showing up.


Concurrency is made difficult by using shared resources - how do yoou ensure that the shared resource is changed correctly if there are multiple entities trying to change it at the smae time? If you have a banking system, and a client is depositing money from one location into an empty bank account, and then withdrawing from another location at the same time, does the withdrawal go through?

System Architecture

The architecture of a system defined the components of that system, their functionality, and the relationships/interactions between them.

A reference model is a conceptual framework that breaks the system into manageable pieces and relates them to each other in a high-level way. Reference models can be component-based (relate components to each other) or function-based (relate individual functions to each other).

Generally, distributed systems are implemented in layers:

There are two main dimensions to hardware organization we care about: shared/private memory (whether the memory is shared between processors or not), and bus/switch memory routing (whether processors/memory are connected via a bus, or everything is routed through a switch). For shared memory, the bus/switch connects processors and memory. For private memory, the bus switch only connects processors. Private memory with buses tend to be the cheapest, which is why they're often used in modern real-world distributed systems. Note that a processor or memory unit might be an entire machine by itself.

NUMA (non-uniform memory access) is a type of architecture in which different processors might need different amounts of time to access a given piece of memory. It's making a comeback in large-scale computing because they're cheaper, as it's a lot harder to match memory access patterns across a lot of connected machines.

A distributed operating system (DOS) is one that transparently manages hardware resources for multiprocess and homogenous collections of computers. A networked operating system (NOS) is one that isn't transparent, but provides communication facilities between different, heterogenous computers. Middleware runs on top of NOSs to implement distributed transparency. Most real-world systems use NOSs with middleware, since that allows us to use heterogenous hardware.

Different types of system architectures include client-server (including multi-client-single-server, multi-client-multi-server, and so on), multitier systems. Real-world large-scale systems are almost always client-server, because it allows us to scale services horizontally and vertically, allows the client to potentially use really cheap hardware, allows a huge variety of different clients, and overall just provides better price for a given performance. We usually use multi-client-multi-server, because with multi-client-single-server, the server forms a bottleneck and a single points of failure.


Vertical scaling in terms of software is partitioning of a service into multiple parts. For example, one part of the service might implement the business logic, another might implement the database, and another might implement the web frontend. The downside of vertical scaling is the service is harder to develop, it doesn't make sense to split up a service beyond a certain point (so the amount that we can scale up is limited), and some parts of the service might end up as bottlenecks anyways (for example, the frontend server might get bogged down).

Horizontal scaling in terms of software is replication of a service across different machines - each one of the servers is fully capable of completing the request from start to finish. The upside is you can scale pretty much arbitrarily, but the downside of this is that the service needs a lot more effort to develop - we need to consider synchronization, and similar issues. One example of horisontal scaling is a load balancer that distributes requests to a server in a round-robin fashion.

Real-world systems often use a mix of these two. For example, a web service might have a horizontally-scaled set of servers for the web frontend (horizontal scaling), that then make use of a single storage server (vertical scaling).

A thin client is sort of like a client-server application, but the entire application runs on the server, only the user interface running on the client. For example, X11 servers that we connect to over the internet.

A multitier system is one in which servers are clients of other servers. For example, a search engine might have a user interface, processing, and data tier, each of which makes requests to lower tiers. Multitier architectures are a superset of client-server architectures; a client-server architecture can be thought of as a 2-tier multitier architecture. Note that each tier incurs additional costs (latency, resource usage, etc.) for making requests and returning data on the network, compared to just combining the tiers, but the upside is that each tier can be scaled in a more granular way.

Peer-to-peer systems are decentralized - each peer has the application, which has the ability to coordinate with other peers running the application. We might think of this as a system in which every member is both a client and a server. When properly designed, a peer-to-peer system can arbitrarily scale, though designing robust peer-to-peer systems is much more difficult than the other systems we mentioned.

Popular examples of peer-to-peer systems are Napster (centralized server connecting peers to other peers) and GNUtella (nodes autoamtically organized in a tree, accesses start at root node and propagate downward). While these are peer-to-peer, they have poor fault tolerance overall - Napster can get shut down just by taking out the centralized server, and GNUtella is susceptible to partitioning nodes when interior nodes fail.

RPC is any mechanism by which calls are made to remote services in the same way as local procedure calls. CORBA is a specific RPC mechanism that can be checkpointed, implemented using an object request broker that routes requests and responses. CORBA was meant to be the middleware to replace all middleware, but it isn't used much these days because it's very specific to object-oriented applications.

Request time is the time between the request being made on the client, until the response is done being received. Service time is the time between the request being received on the server, and when it starts being sent out. Throughput is the number of jobs per unit time (like requests per second). We also care about the system resource utilization and the network capacity utilization.


In the past, the main source of messaging overhead was hardware and network overhead - software overhead was relatively low in comparison. Nowadays, with gigibit ethernet, infiniband, and fiberchannel, the main source of messaging ovehead is from the software itself.

The two kinds of parallelism are intra-process parallelism (threads, or using more resources within a process), and inter-process (multiple processes, possibly on different machines).

Most things can be made parallel, but some things can be made more parallel than others. Some things that usually prevent parallelism are centralized components (e.g., a single mail server), centralized data/tables (e.g., a single database server, a single DNS server), and centralized algorithms (e.g., a routing algorithm that needs complete system state to work). Usually we get around these by using caching and data replication.

The quality of service is how well it satisfies the requirements. Functional QoS is whether the service can perform its specified functions. Non-functional QoS is how well it performs those functions with respect to performance, reliability (probability of system not experiencing failure in a particular time period, and the characteristics of potential failures), and availability (the fraction of the time the system is operational, even if experiencing failures). For example, a video game might have a QoS guarantee that the framerate is at least 60fps, or a database might have a QoS guarantee that it will be up 99.99999% of the time.

Usually we improve dependability by using hardware and software redundancy, and masking failures in individual instances. The system should be designed to tolerate as many simultaneous failures as possible. Usually, we have to balance this with performance, since more redundancy means more synchronization overhead. For example, we might have a setup where we have two servers are running for each application, one acting as a backup for the other. There are also more advanced schemes like replicating only some parts, or using the backup system for non-critical activities. Redundancy also includes adding backup routes in the network, if network failures are expected.

Redundancy can be divided into process redundancy (processes can take over for other, failing processes), and data redundancy (permanent data can be recovered or rolled back when faults/inconsistencies detected).

Networks (in the context of distributed systems)

Networks in our context route requests/responses between nodes (network entities) in an undirected graph. We will look at connection-oriented networks (where messages are sent over a connection made specifically between two nodes, like how words are sent over a telephone call connection), and connection-less networks (where messages are simply sent, no explicit connection required, like how packets are sent over the internet).

Network protocols determine the order/format of messages sent between network entities, as well as the semantics of those messages. On the internet, we have 7 layers of protocols, according to the ISO OSI layered network architecture (physical, data link, network, transport, session, presentation, application). Each layer is responsible for one specific function to allow messages to be sent across the network. This allows us to focus on the upper layers to simplify our mental model of how the network works.

Each layer adds its own wrapper/framing around the data from the layer above it. For example, a packet might have a header with the packet length, synchronization (makes packet match up with the network), packet number, destination address, and source address, and have a footer with a CRC checksum.


We usually simplify the 7-layer OSI architecture for the internet to only 5 layers:

The application layer can contain its own layers, like an RPC mechanism over HTTP. The application runs on a set of network hosts, exchanging messages between hosts to implement the app.

The API defines the interface between the application and the transport layer. For example, POSIX sockets constitute an internet API.

A common way to make latencies and bandwidth more predictable is by using buffering. For example, video buffering allows video playback to remain unaffected by short dips in latency/bandwidth.

Circuit-switched networks are those that connect nodes directly using dedicated links, like telephone networks - this gives us bandwidth and latency guarantees, but it's relatively inflexible because it depends on whether dedicated resources are available (providers share resources by multiplexing and providing a lower but still fixed bandwidth).

In contrast, packet-switched networks route packets around using whatever links are available, like the internet - this gives us a lot of flexibility and reliability (since broken links can be routed around), but we can't get bandwidth, latency, or congestion/contention guarantees. Plus, each packet gets the max bandwidth available for each link, rather than being multiplexed. This is an improved version of message-switching networks, which is the same as before but entire messages are routed around instead of individual packets.

Packet switching can support a lot more users than circuit switching, and uses available resources more efficiently, which is why it's used for the internet.

One way to get circuit-switching-like latency and bandwidth guarantees on packet-switched networks is to have routers remember a particular route. This at least ensures somewhat consistent timing, and packets are more likely to arrive in order this way.

Packet switching networks have multiple kinds of packet delays when processing packets on nodes (time spent reading addresses, computing route, etc.), queueing delays (time spent waiting for packet to go through transmission queue), transmission packets (time to send packets into the link), and propagation delay (time on the wire). Routers have their own routing tables storing the addresses of the next hop for every reachable destination (and the cost of each link), and by making local decisions about where to route packets to next, result in a globally efficient path to the destination.


There are multiple ways to implement IPC:


;wip: catch up in this


;wip: missed due to interviews


Name/directory services are used to look up network resources by their names. For example, DNS, LDAP, and X.500 (e.g., WatIAm).

We can think of them as geographically distributed databases, generally implemented as a heirarchy of many name servers. Name services have to be distributed in order to provide availability (some parts can fail without bringing the system down), performance (servers should be close to users around the world), and ease of maintenance (we can take down some parts of the system as needed).

Name/directory services have multiple possible issues:

There are two schemes for name server resolution: iterative and recursive. Iterative resolution has the client request a lookup from a server, and if the server doesn't have it, it tells the client so and the client makes a request to that server's parent server, and so on, until the record is found. Recursive resolution has the client request the lookup from a server, and if the server doesn't have it, it requests it from its parent, and so on, until the record is found, returning them down the chain back to the client. Recursive resolution is generally faster, because servers are generally geographically arranged heirarchically as well - the server making the request to its parent is generally a lot closer to its parent than the client. However, iterative resolution is easier on the server, because they don't need to make any requests themselves.

One of the most common name/directory services is DNS, which maps domain names to IPs at the application layer.

A resolver is a DNS client, generally a library distributed as part of the OS. It communicates with name servers using a UDP request/reply protocol (each message can have multiple requests/replies), and implements timeouts, retry, and so on. The resolver has a hardcoded list of name server IPs to start with, generally provided by the ISP, known as the local DNS server. The resolver first contacts the local DNS server, and if it doesn't have the record, it'll contact the root DNS server, and work its way down the heirarchy until it gets to the authoritative DNS server for that host - the one that contains the record the host themselves publish.

DNS supports iterative and recursive navigation, and specifies which one to use when making requests (the server doesn't have to obey, though).

DNS records are called resource records, containing the following fields:

LDAP is a lightweight version of X.500, widely used for intranets within companies. It's also organized as a tree, mapping names to values, but the big difference is that you can query with boolean operators on directory attributes. Jini is another lookup service, intended to hold information about available/online services.


Distributed Filesystems

Filesystems are the OS's interface to storage. A distributed filesystem emulates the behaviour of a normal filesystem, using a physically distributed set of files.

Usually, these should provide access transparency, location transparency, mobility transparency, performance transparency, and scaling transparency.

Distributed filesystems should also allow concurrent access, replicating files, and handle heterogenous hardware. Often, we also want them to have security metadata and access control.

It should be fault tolerant, so at-most-once semantics are preserved for file operations, and at-least-once semantics for idempotent file operations. Replication helps with fault tolerance, because after failure we can just restart the servers (assuming the servers themselves are stateless).

Consistency is another important property of filesystems. All the clients should see the contents of the file identically as if there was only one copy of the file. Often, we settle for eventual consistency, where consistency may be temporarily broken but is guaranteed to occur eventually.

We also care a lot about efficiency - file accesses should be fast, and capabilities should scale up with more resources. The ability to scale up is why we would use a distributed filesystem over a normal one, even though a distributed filesystem's single-computer performance and storage capacity would probably be worse than a normal filesystem.

Distributed filesystem architectures usually contain a client module that runs on the application's server, which makes requests to a filesystem server running a directory service and a flat file service. The flat file service identifies file contents by UUID, and exposes them via an RPC mechanism. The directory service then maps file paths to those UUIDs. The client module exposes a filesystem-like interface to the distributed filesystem server, and might perform caching to improve performance for commonly used files.

Two models under this architecture are remote access (each file operation is done by requesting that the server perform the operation) and download/upload (every file operation is done by downloading the file from the server, performing the operation, and sending it back to the server). The advantage of remote access is that it makes implementing concurrent access easier (because the server knows and controls the ordering of operations), while the advantage of download/upload is higher performance when making many changes to a small set of files (because we can do them all locally, distributing the load of file operations to the clients).

One well-known distributed filesystem is NFS - network file system. It actually exposes a full virtual filesystem to the client using a kernel module, integrating both NFS servers and many other types of filesystems. This allows us to put EXT4, NTFS, and NFS filesystems all together into what appears to the application as a single filesystem.

The virtual filesystem uses v-nodes (analogous to v-nodes) for open files, for each mounted filesystem. For local files, the v-node just stores the i-node. For remote files, the v-node can store the file handle on the NFS server. This makes it a lot easier to provide access transparency.

NFS requests are done over RPC, originally unauthenticated - this was very bad for cases where access control was required. Security was added later on with Kerberos-based authentication and symmetric encryption (specifically, DES).

NFS uses the remote access model, so every file operation is done on the server. NFS version 4 supports requesting multiple file operations in a single NFS request, while earlier versions required one request per file operation. This is somewhat analogous to recursive vs. iterative name resolution - one requires the client to make many requests, while the other allows the server to do more of it.

Also, for recursive mounts, where we have a client that mounts a folder from a server that in turn mounts a folder from another server, NFS clients use iterative resolution - the client has to make requests to both servers, not just the original one it mounted. This makes security easier (since you only need to think about whether a given client can access something, not whether a client accessing some chain of servers can access something), and is simpler to implement overall.

NFS supports soft-mounting and hard-mounting. Hard-mounting means that applications that are accessing files are suspended and the operation is retried until it completes successfully, while soft-mounting means that applications that are accessing files are blocked for a while, and eventually get an error if it fails after a couple retries. Hard mounting means if the server goes down, applications might be blocked indefinitely, but soft mounting means that data corruption is possible when writes are reported as succeeding when they actually failed.

Mounting allows distributed filesystems to appear as a single filesystem. By adding another filesystem's namespace into our own namespace, it appears as a unified, local filesystem, when it might actually be many different computers providing the files.

Assignment 2 is supposed to be pretty straightforward, intended as a warm-up for assignment 3, which is the big one. Assignment 3 is a 2-person project, while assignment 2 is individual.


For NFS, it's important to use caching on the client and server to make the performance acceptable.

On the server, we can do write-through caching: when we write to a file that's cached, we update the cached copy, then make sure it's written to disk until replying to the client, and when we read a file, we just read from the cache. However, this makes write operations slower than if we were just writing/reading to the cache instead of the cache and the disk.

One alternative to write-through caching is to add a commit operation, which forces cached data to be written to disk, and otherwise just read/write from the cache.

On the client for read operations, we can cache files we're working on, especially for read, write, attribute, and directory listing operations. To maintain cache consistency, NFS uses a freshness timestamp associated with each file (and assumes a global clock), and periodically polls the server to check whether it has an updated copy. If it does, we update the data from the server, as well as the freshness timestamp. The freshness interval is configurable and can be set per-directory.

On the client for write operations, we can mark the modified pages as dirty, and schedule them to be sent to the server asynchronously, like when we flush or close the file.

On a single processor, when we write a file and then read it in again, we expect to get the value we just wrote. However, on a distributed filesystem with caching, it is possible that we will see obselete values.

One way to get around this is to make applications lock files before they start changing them, so only one client can change a file at a time. However, this would be a big performance hit.

In fact, the unix-style one-copy update semantics can't be implemented in distributed systems in general. Distributed filesystems instead often relax that requirement and implement unix session semantics instead - the application opens a session, and within the session the application expect that all of its reads/writes are executed in order, and read its own writes, but other clients might not see any of those operations until the session ends. In other words, changes aren't visible until the file is closed.

So four ways to implement file sharing in a distributed filesystem are unix one-copy update semantics (all clients see all changes), unix session semantics (all clients see changes after file is closed), immutable (files can't be changed in the first place), and transaction-based (changes are batched in atomic changesets).

NFS supports unix session semantics, and also shared reservations, which can be used as read-locks and write-locks.

The Andrew File System (AFS) is the basis for the OSF's distributed computing environment, and is a more ground-up redesign of the filesystem. It takes advantage of the fact that infrequently updated shared files (like binaries, configuration, etc.) will remain valid for long periods of time, and that most of the files used by an application are usually a small set (so we can hae a reasonably-sized cache and expect most files to hit the cache). It's optimized for the common case where files are small, reads are more common than writes, sequential access is common, recently used files are likely to be used again soon, and most files are only accessed by one user.

AFS consists of a client called Venus, and a server called Vice. AFS uses callback promises to represent locally cached files - a token that means the server will inform them if the cached file has been invalidated (each client can have up to one callback promise per file). Whenever a file is updated, the server informs all other clients holding a callback promise for that file that it's been updated, and the client can decide whether to they want to continue using their cached copy or update their cache.

When you open a file in AFS, Venus checks if the file is cached and the file's callback promise is valid, and uses the cached file if both are true, or updated the local cache otherwise. Also, after a failure, it's possible that we've missed callback promise callbacks while we were offline - Venus solves this by periodically checking on callback promises to make sure they're still actually valid (usually every 10 minutes).

This implements weak consistency, and a one-update guarantee (we won't inadvertantly overwrite updates). read/write operations are performed on the local copy, and when the file is closed, the file is copied back to the server (in practice, just the changed parts). This model also works really poorly for files heavily written to by multiple users, like databases, because every change invalidates everyone else's changes, but AFS was not designed for this use case.

AFS doesn't handle concurrent file updates, and leaves it up to applications. It has cache consistency only on open and close operations.


A synchronization problem is one in which we need processes to cooperate and synchronize with each other. On a single system, we've altready looked at constructs to make this possible, like mutexes, semaphores, and so on. However, these don't generalize to distributed systems.

In a centralized system, we can have a fully unambiguous, well-defined clock. In contrast, a distributed system might have a different clock for every single machine, cause by clock skew (different quartz crystals oscillate at slightly different frequencies).

Many applications depend on time only moving in one direction and cause to precede effect. For example, GNU Make detects modified files on a single machine by checking the timestamps of the source and object files. Synchronization is needed whenever two processes interact and need to agree on an order in which events occur.

Physical clocks measure time in the real world. Real world time is a mess, even in the main standard, UTC. For example, September 1752 is missing days 3 through 13. There's also leap years and leap seconds (which are even unpredictable, added whenever required). Google's physical timekeeping system tries to mitigate this by slowly skewing the clock to the correct value over a full day.

One way to do clock synchronization is to give every machine a WWV receiver to get atomic time from the NIST-run atomic clocks. However, this requires special hardware, and being within range of the broadcasts (out of Colerado).

The Berkeley algorithm is another way to do physical clock synchronization. There's a time server that periodically gets elected, polls every other server, and then tells each of them how much they should adjust their clock by to get to the average of all polled times (rather than just telling them what the time should be). Also, the time server should take the round trip time into account for each other server.

The averaging algorithm is somewhat similar, but is fully decentralized. Each machine divides time into distinct intervals (e.g., every second), and broadcasts its current clock at the beginning of each interval. Then, every machine adjusts its clock toward the average of all the clock value broadcasts it received. Also, each server should take the round trip time into account for each server.

The issue with these two algorithms is that they both adjust toward the average of all the clocks, which will slowly skew away from real-world time.

NTP is a time synchronization protocol, based on UDP. It arranges servers heirarchically, starting from a strata 1 server (the root node). It's nice for intranets and fast connections, when we have low average latency, and supports fault tolerance by reparing the heirarchy upon server failure.

Using clock synchronization hardware like the kind used by Google, we can do a lot of interesting things. For example, we can enforce at-most-once message delivery to servers even if we crash (keep the highest seen timestamp so far, reject anything that has a lower timestamp, and initialize highest seen timestamp to current timestamp on start).

We usually don't actually need a real clock for synchronization. A logical clock just enforces an ordering in a sequence that might be scrambled out of order, like the sequence number in TCP packets.

An event A that starts and ends before event B means that A happens before B, and so B must have a higher timestamp - this is called the happens-before relation, denoted A \to B. This is transitive, so if A \to B and B \to C, A \to C. If A doesn't happen before B and B doesn't happen before A, then A is concurrent with B (it overlaps in terms of time). For example, the event of sending a message on a client, event A, happens-before the event of receiving a message on the corresponding server, event B, so A \to B.

Lamport timestamps are a common way to capture the happens-before relation, for implementing logical clocks:

This ensures that if an event A has a lower lamport timestamp than an event B, A is guaranteed to happen before B.


A replicated database can use lamport timestamps to make sure that updates get applied to all copies of the database in the correct order.

One way to implement updates in a distributed database is to have clients send updates to the database's sequencer, which decides what order to process them in. The sequencer then sends the updates to each replica in a way.

However, the sequencer can potentially become a bottleneck - instead, we might use independent replicas that talk to each other to synchronize their transactions.

Global State/Elections

The global state of the distributed system is the local state of each process and the messages in transit.

Sometimes it's useful to know the global state for a system, for things that cross process boundaries like garbage collection and detecting deadlocks.

A distributed snapshot/cut is a consistent representation of the system at a certain time. Consistent in this context means "if A sends B a message, we either record both A sending the message and B receiving the message, or neither A sending the message nor B receiving the message". Formally, consistency means that if an event e is in the snapshot, and f \to e (f happens before e), then f must also be in the cut.

For example, if we have a distributed banking algorithm where messages are transactions, a consistent snapshot of the system must ensure that the total amount of money on each node and on messages in transit should always equal the total balance. The state of each node might be "how much money do I hold?" and messages in transit might be "A has taken out $50 and is sending it to B".

The Chandy-Lamport algorithm is one way to take distributed snapshots. To take a snapshot with this algorithm:

  1. The process taking the snapshot (i.e., the observer process), P_i, saves its own local state.
  2. For each outgoing channel C (i.e., send to every other process it's connected to), P_i sends a unique marker on that channel and marks all further messages with that marker.
  3. When a process P_j receives a marker on channel C:
    1. If P_j has never received any markers before (on any channel, at any time), then it runs steps 1-2 itself and sends its saved local state to P_i.
    2. If P_j has received markers before, save the state of channel C as the set of messages received along C after P_j's state was last recorded, and P_j received the marker on C. Then, send the channel state to the observer processor.

A lot of algorithms need a single process to coordinate/lead, but don't particularly care which process does it - we just need to choose exactly one process to do it. Electing this coordinator process is called the election problem.

The general approach for this is to give every process a unique ID number, and let each process know the process number of every other process (but not whether those other processes are up or not). We want to elect the process that is up and has the highest ID number.

Lamport proposed paxos for this, but it's pretty complicated and not used much. A more recent algorithm, raft, is simpler but also not used much.

The bully algorithm is used a bit more often. If a process P_i needs to know who the coordinator is, but the current coordinator isn't responding (or if we just started):

  1. P_i starts the election process:
    1. P_i sends an ELECTION message to every node with a higher ID number than itself - P_{i + 1}, \ldots, P_n.
    2. P_i waits for a little while, for other processes to get a chance to respond.
    3. If P_i doesn't get any responses, it has the highest ID number and broadcasts a VICTORY message to every other node (establishing P_i as the coordinator).
    4. If P_i gets responses, it waits for a while for another process P_j to broadcast a VICTORY message (establishing P_j as the coordinator), or if none do, restarts the entire procedure (the higher ones probably died before one could declare victory).
  2. If a process P_j receives an ELECTION message:
    1. P_j responds to the sender with an OK message (to tell the sender that it's still alive)
    2. P_j sends an ELECTION message to every node with a higher ID number than itself - P_{j + 1}, \ldots, P_n.

This algorithm, compared to more naive ways of finding the coordinator, has the advantage of having all nodes always agreeing on who the coordinator is, even if nodes fail or come online partway through. However, it sends a quadratic number of messages with respect to the number of nodes.

The ring algorithm (also known as the Chang and Roberts algorithm) is a variant of the bully algorithm, where processes send/pass ELECTION messages along a ring of nodes. If a process P_i needs to know who the coordinator is, but the current coordinator isn't responding (or we just started):

  1. P_i starts the election process. All processes initially have an unset flag PARTICIPATED, representing whether they've participated in the election:
    1. P_i sends an ELECTION message to its neighbor containing P_i's ID number, and sets PARTICIPATED for itself. Neighbors that don't acknowledge the message are skipped and we go to the neighbor's neighbor, and so on, until our ELECTION message sends successfully.
  2. If a process P_j receives an ELECTION message from its neighbor:
    1. If the ID number for the ELECTION message is greater than P_j's ID number, P_j forwards it to its other neighbor and sets PARTICIPATED for itself. Neighbors that don't acknowledge the message are skipped and we go to the neighbor's neighbor, and so on, until our ELECTION message sends successfully.
    2. If the ID number for the ELECTION message is less than P_j's ID number, P_j discards it if P_j has PARTICIPATED set, or otherwise sends a new ELECTION message to its other neighbor containing P_j's own ID number. Neighbors that don't acknowledge the message are skipped and we go to the neighbor's neighbor, and so on, until our ELECTION message sends successfully.
    3. If the ID number of the ELECTION message matches P_j's ID number:
      1. P_j unsets PARTICIPATED for itself, and then sends a VICTORY message to its neighbor containing its ID number.
  3. If a process P_k receives a VICTORY message from its neighbor:
    1. If P_k's ID number matches the ID number in the VICTORY message, P_k stores the fact that it is the coordinator.
    2. Otherwise, P_k unsets PARTICIPATED for itself, stores the fact that the ID number in the VICTORY message is the coordinator, and then forwards the message to its other neighbor.

The advantage of the ring algorithm is that it only sends a linear number of messages with respect to the number of nodes. However, it takes longer to forward all those messages around, and it isn't very fault tolerant if nodes in the ring go down or new ones come online while the algorithm is running.


;wip: missed due to interviews


What we really want in a database is a guarantee of serializability - transactions should run concurrently, but give the same results as if we ran them one at a time.

Some of the possible issues when we don't have isolation but still run transactions concurrently are:

An execution schedule/history is a partial order over the operations in a set of transactions.

A serial history is one where for any tranaction, all operations for that transactions appear together, with no interleaving between transactions. Serial histories are inefficient, so our goal is to find a more concurrent execution history that has the same effect as a serial history.

We really just care about the operations in the transactions that conflict - the relative order of conflicting operations being equal in any two equivalent transactions (this is called conflict equivalence).

Specifically, conflicting operations means a read in transaction A and a write in transaction B to the same data (since the read value depends on whether we read before or after the write), and a write in transaction A and a write in transaction B to the same data (since only the last write will take effect). In other words, we can have read-write and write-write conflicts when accessing the same data from different transactions. For non-conflicting instructions, we can freely run them concurrently and so on (if A writes to X and B writes to Y, we can do both writes at the same time, since they don't conflict).

So, we can freely reorder non-conflicting instructions while retaining conflict equivalence, as long as the relative order of conflicting instructions always stays the same. If an execution history gives a result equivalent to a serial history, it is a serializable history. Furthermore, if an execution history is conflict equivalent to a serial history, it is always a serializable history. In other words, a serializable schedule is one that has the exact same effect as running the transactions serially in some order.

A precedence/serialization graph for an execution history is a directed graph where nodes are transactions and an edge from A to B means that B has an operation that depends on an operation in A. There is a theorem that says that if there are no edges in the precedence graph, then the transaction is serializable. Additionally, we can obtain the corresponding serial history by doing a topological sort on the precedence graph (essentially, arranging the nodes in a horizontal line such that all the arrows point right).

For distributed transactions, we have to make sure both local and global histories are serializable. A global history is serializable when each local history is serializable, and conflicting operations in transactions in local histories are in the same order in every local history (e.g., if transaction A has an operation that conflicts with one in transaction B, we can't have one local history that runs A's operation before B's operation, and another that rns B's operation before A's operation).

The goal is to allow transactions to execute as concurrently as possible, while ensuring that the database is always consistent. While the serialization graph is a useful way to understand serializability, it's expensive to do in the real world.

Instead, we usually use something like two-phase locking. In this scheme, transactions lock all the objects they need in a locking phase, and then release locks as we're done with them (we can't lock anything after we unlock anything). It's straightforward to prove that this guarantees serializability, and is relatively resource-efficient. However, we don't always know what we'll need to lock beforehand, it's highly prone to deadlock, and dirty reads are still possible (since we might unlock something before we actually commit).

In strict/rigorous two-phase locking, we hold all of the locks until we commit, instead of whenever we're done with them. This fixes the dirty reads issue, but it's also slower since the locks are held for longer. We can also fix the deadlock issue by always ensuring that the locks are acquired in a particular order (this ensures that cycles in the lock dependency graph are impossible).

There's also something in between non-strict and strict two-phase locking, called strong two-phase locking, that is used a lot in the real world.


A deadlock occurs when A is locking something and waiting for B's lock to release, while B is locking something andwaiting for A's lock to release. In other words, there's a cycle in the wait-for graph.

When dealing with deadlocks, we ideally want to prevent them from happening (e.g., formally prove correctness, pre-declare resources). Alternative options are to avoid them at runtime (e.g., order lock acquisitions and transactions such that deadlocks don't occur), or detect/recover from deadlocks at runtime (e.g., deadlocks are detected, and the cycle is broken by killing off a part of the cycle).

In real-world systems, the last one is generally the most practical - letting deadlocks form and then recovering from it. When we have a deadlock, we usually want to kill off the transactions that have done the least work so far (to minimise the amount of work lost), and transactions that are in as many deadlocks as possible (to minimise the number of transactions to kill to get rid of all the deadlocks).

In a distributed system, we generally detect deadlocks either using a centralized system that keeps track of the global wait-for graph, or by keeping track of the leaf node in the chain of wait-for statements (this allows us to detect cycles as they form).

Timestamp ordering schemes are an alternative to lock-based schemes like 2-phase locking. Instead of resolving conflicts between transactions when they're executed, timestamp ordering resolves them while transactions are being created.

In a timestamp ordering scheme, transactions are assigned a globally-unique timestamp, which is attached to every operation they perform. Given a set of transactions, we start off with rts(x) = 0 and wts(x) = 0, which represent the largest timestamps that have read/written an object x, respectively.

Then, for any read from x from a transaction with timestamp t R_t(x), we perform the operation if t \ge wts(x) or abort otherwise (because the value is out of date), and then set rts(x) to t. Likewise, for any write W_t(x), we perform the operation if t \ge rts(x) \land t \ge wts(x), skip the write if t < wts(x) (because the value will be overwritten later anyways), or abort otherwise (because the old value is still being used), and then set wts(x) to t.

The globally-unique timestamps ensure that we have a relative ordering over every operation in every transaction. The conditions we check ensure that we can't read stale values, and can't write over values other transactions are reading or writing to.

Timestamp ordering works fine for read-intensive loads, but for write-intensive loads it's pretty bad - a lot of transactions will abort relative to lock-based schemes.

Timestamp ordering can be extended to get multi-version timestamp ordering (MVTO), where each write to x makes a new version x' instead of changing the old value, and reads of x are always reading the largest timestamp that's before the read's timestamp. So in addition to ensuring that operations are in timestamp order, we also enforce version order, where operations can't see versions with timestamps after their own timestamps. Turns out though, this is somewhat inefficient in the real world.


Replication in distributed systems is very commonly used to improve performance (server can be closer to client, more servers online) and availability (more copies of data available, failover upon server going offline). However, it also makes consistency more difficult, because when we have multiple copies of state, they can get out of sync.


;wip: missed due to midterm studying


;wip: went to get food instead of class


;wip: overslept


;wip: ACID properties

Write-Ahead Logging: log database operations as we're performing them. We add a checkpoint to the lock every time we commit.

If the node goes down, the log keeps track of all the operations we were performing. Upon recovery, we can scan the log from tail to head, reconstruct the list of transactions, and then replay them.

Two-part commits (2PC) ensure atomicity and durability. It's one of the most common ways to implement these properties:

The two phases here are the voting phase and the decision phase.


;wip: catch up on this


Software as a service is the idea of delivering applications over the internet as services. This was made economically feasible with the advent of cheap, reliable virtualization, which allows us to granularly divide computer resources.

Virtualization can take many forms. For example, Amazon EC2 exposes full virtual machines that behave as if they're full computers, while something like AppEngine will manage things like the OS and backing storage for you.

Companies want to use cloud computing because they can then pay by use instead of having to pay for enough capacity to handle peak load. Cloud companies are essentially taking on the server hardware risk. If there isn't enough capacity, users get a bad experience. If there's too much capacity, the company is wasting money on capacity it isn't currently using.

In practice, cloud computing is 5 to 7 times cheaper than in-house daacenters. Another advantage is that the extra capacity can be used in other projects like compute for research teams or Spot Instances.

Most of the popular cloud apps are highly data-intensive, and many cloud providers expose data management solutions such as mapreduce, online transaction processing (OLTP), and online analytics processing (OLAP).

Some cloud database architectures are shared-nothing (processors share interconnect, but have their own memory and disk), shared-memory (processors share memory, but not disk), and shared-disk (processors share disk, but not memory).

Mapreduce is a programming model that's been around since Lisp, but it's become more popular lately due to how suitable it is for large scale data processing. In the Mapreduce data model, there's only a simple key-value store that supports retrieving and storing values by key.

The ideal case is when we have straightforward computations on a very large dataset, and we want to distribute this workload over hundreds or thousands of machines. Mapreduce proceeds with two components: a map function is mapped to the dataset to create an index from the source data, and a reduce function is used to compute grouped aggregates from this index.

Some considerations for mapreduce are conserving network bandwidth, moving mappers/reducers closer to the data, and balancing the number of mappers and reducers. Additionally, mapreduce is only really suitable for certain situations, and can't begin to match the feature set of real databases.

Final exam info and review next class.

Creative Commons License This work by Anthony Zhang is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Copyright 2013-2017 Anthony Zhang.