Lecture Notes by Anthony Zhang.

\[ \newcommand{\set}[1]{\left\{ #1 \right\}} \newcommand{\tup}[1]{\left\langle #1 \right\rangle} \newcommand{\abs}[1]{\left\lvert #1 \right\rvert} \newcommand{\floor}[1]{\left\lfloor #1 \right\rfloor} \newcommand{\ceil}[1]{\left\lceil#1 \right\rceil} \newcommand{\mb}[1]{\mathbb{#1}} \newcommand{\rem}{\operatorname{rem}} \newcommand{\sign}{\operatorname{sign}} \newcommand{\imag}{\boldsymbol{i}} \newcommand{\dee}{\mathop{}\!\mathrm{d}} \newcommand{\lH}{\overset{\text{l'H}}{=}} \newcommand{\evalat}[1]{\left.\left(#1\right)\right|} \newcommand{\sech}{\operatorname{sech}} \newcommand{\spn}{\operatorname{Span}} \newcommand{\proj}{\operatorname{proj}} \newcommand{\prp}{\operatorname{perp}} \newcommand{\refl}{\operatorname{refl}} \newcommand{\magn}[1]{\left\lVert #1 \right\rVert} \newcommand{\rank}{\operatorname{rank}} \newcommand{\sys}[2]{\left[ #1 \mid #2\hskip2pt \right]} \newcommand{\range}{\operatorname{Range}} \newcommand{\adj}{\operatorname{adj}} \newcommand{\cof}{\operatorname{cof}} \newcommand{\diag}{\operatorname{diag}} \newcommand{\formlp}{\operatorname{Form}(\mathcal{L}^P)} \]


Applied cryptography.

Alfred Menezes
Section 001
Email: ajmeneze@uwaterloo.ca
Office hours: Mondays 3:00pm-5:00pm, Fridays 1:00pm-3:00pm in MC 5026
Mondays/Wednesdays/Fridays 11:30pm-12:20pm


All course resources are on LEARN. Course has 5 assignments (summing up to 20% of final grade), midterm (worth 30%), and a final exam (worth 50%). Midterm is March 8, 2017 at 7-9pm.

Cryptography is a tool we can use to secure communicatoins when there are malicious adversaries. Some of the fundaental goals of cyptography are:

Different applications might require different subsets of these goals.

In this course, we use Bob and Alice as placeholders to represent two parties trying to communicate securely. We use Eve or Mallory as a placeholder to represent one or more adversaries trying to attack the communications (by injecting data, eavesdropping, or other approaches, depending on the threat model)

Some famous early practical uses of cryptography was the enigma machine, which used multiple spinning rotors to encrypt/decrypt messages. In that cryptosystem, the German military leadership and the German U-boats had the role of Bob and Alice, trying to communicate confidentially, and Alan Turing and his team had the role of Eve, trying to break the confidentiality guarantees and read those communications.

However, cryptosystems like Enigma and its successor, Lorenz, are a far cry from modern cryptosystems, which have a much more mathematically sound foundation. Modern cryptography is what makes online banking, online shopping, and cellular networks possible.

For example, SSL, the protocol that makes online security possible, ensures origin authentication and confidentiality between a user and a website. SSL uses symmetric-key encryption to implement confidentiality, and a MAC scheme (HMAC) to implement origin authentication. Of course, to make this possible they both need to have the same secret key for the symmetric-key cryptography, and for the MAC. To confidentially and authentically share this secret key, we use public-key encryption. Of course, to make this possible we need a way to obtain authentic copies of the public keys of both parties - this is implemented using digital signatures, where trusted third parties known as certificate authorities keep track of public keys that are considered authentic (generally by vetting the site owner in the real world), and digitally sign them to confirm that they're authentic. The certificate authorities' public keys are pre-installed in the browser, which acts as a root of trust - if you trust that public key is authentic, and that the certificate authorities are correctly keeping track of authentic .

SSL is one of the most successful cryptosystems ever deployed, used for tons of sites on the internet. However, there are a number of possible weaknesses that might result in the guarantees being broken:

This demonstrates some useful concepts:

Cryptography is only one part of a large information security ecosystem, including other things like secure operating systems, auditing mechanisms, trusted computing, and risk analysis. In this context, cryptography provides a lot of useful and essential tools, but it's not all there is to security. Attackers will generally target the weakest part of the system, and if that link fails, it is possible that the entire system will.

This course focuses on breadth over depth. For depth, take CO 485 and try out the readings for this course as well.


SSL in more detail:

  1. Client makes request to the website (not the full URL, just the host).
  2. Server responds with the website's certificate, which contains the website identification info (like host and etc.).

Symmetric-key encryption

A symmetric key encryption scheme (SKES) is a definition containing:

To use one of these to implement message confidentiality, Alice and Bob agree on a particular \(k \in K\) over a secure channel (so that nobody else knows about the value of \(k\)). Then, Alice can compute \(c = E_k(m)\), where \(m\) is the message, and then sends \(c\) to Bob over an unsecure channel. Bob can read the message by then computing \(m = D_k(c)\), while anyone without \(k\) would have to figure out the correct \(D_k\) function to use. An SKES can be designed such that finding the correct \(D_k\) would be very difficult.

A substitution cipher is an SKES, under this definition: \(M\) is the set of all English messages, \(C\) is the set of all encrypted messages, \(K\) is the set of all permutations of the English alphabet, \(E_k\) maps letters onto \(k\) by index, and \(D_k\) does the inverse mapping.

A security model defines what the adversay is capable of, like what they can do to the communicating parties. For example, there are passive attacks like ciphertext attacks (attacker can obtain ciphertext, like if they're listening on the same network), known-plaintext attacks (attacker knows some of the plaintext as well the resulting ciphertext, like knowing that someone always signs off their message with their name). There are also active attacks, like chosen-plaintext attacks (attacker can choose some part of the plaintext), clandestine attacks (attacker is willing to use bribery/blackmail/etc.), and side-channel attacks (power analysis, RF emissions analysis, timing attacks).

The security model also includes the attacker's computational abilities. For example, information-theoretic security models assume the attacker has infinite computational resources, complexity-theoretic security models assume the attacker has a turing machine capable of computing any polynomial-time algorithm efficiently, and computational-theoretic security models assume the attacker simply has a lot of computers available.

The attacker's goals, in decreasing order of priority: recovering the secret key, recovering plaintext from ciphertext (without the key), or learn some characteristics of the plaintext given ciphertext (besides message length).

An SKES is secure if and only if it is resistant to a chosen-plaintext attack by a computationally bounded adversary (computational-theoretic security). A secure SKES is necessary but not sufficient to guarantee confidentiality.

Additionally, we also assume that the adversaries know everything about the cryptosystem except the plaintext and the key, including the all the algorithms and communicatoin mechanisms.

An SKES should ideally be efficient (for encryption/decryption), have small keys (but not so small that it makes brute force attacks possible), and be secure (even against the designer of the SKES).

A work factor measures how hard a task is, in terms of how many computations are needed to complete it: \(2^{56}\) operations is easy, \(2^{64}\) is feasible, \(2^{80}\) operations is barely feasible, and \(2^{128}\) operations is infeasible. This will change as our computers get faster. For example, the entire bitcoin network is doing \(2^{61}\) hashes per second. However, we use \(2^{128}\) operations as infeasible because the Landauer limit tells us doing that many operations on a classical computer would take a significant fraction of the world's power output for a year.

For example, a substitution cipher is not secure, because it's trivially broken by a chosen-plaintext attack. By choosing the plaintext as the alphabet, the ciphertext is just the secret key. In fact, it's not even secure against ciphertext-only messages. Although it's infeasible to exhaustively try every permutation of the alphabet for valid-looking decrypted messages, we can simply use letter frequency analysis to try likely candidates first.

A polyalphabetic cipher uses multiple alphabet permutations for substitution, choosing between them with a particular algorithm. For example, a Vigenere cipher has a word with non-repeated letters added to the message's letters mod 26. This is nice because as the message grows longer, the letter frequency graph is a bit flatter. We can break that by using a chosen-plaintext attack with plaintext "AAAAAAAAAAAAAAAAAA", so if "A" corresponds to the numerical value 0, the ciphertext is just the key. Therefore, the Vigenere cipher is not secure.

From now on, we assume plaintext, keys, and ciphertext are all binary strings. What happens when we apply the concept of a Vignere cipher, but with a uniformly random key that's as long as the entire message?

A one time pad XORs a random key with the plaintext (the key is as long as the message). However, it's important to not reuse keys, because \(c_1 = m_1 \xor k_1\) and \(c_2 = m_2 \xor k_2\) implies that \(c_1 \xor c_2 = m_1 \xor m_2\), which leaks a lot of information about the plaintext. If the key is uniformly randomly selected, every key is equally likely, so every ciphertext is also equally likely. This is information-theoretically secure - a one-time pad provably cannot be broken by a ciphertext-only attack even by attackers with infinite computational resources. While one-time pads have these nice properties, in practice they are hard to use because the key has to be as long as the ciphertext, making sharing keys a pain.

A stream cipher is like a one-time pad, but instead of a truly random key, we use a pseudorandom bit generator, where the PRBG's seed is the secret key. This is no longer perfectly secure because it depends on the quality of the randomness of the PRBG, but it's often a practical compromise since the key can be a lot shorter. Like with a one-time pad, we also shouldn't re-use keys, since that would give us some of the values of the PRBG's output by XORing the ciphertexts, which would make it easier to learn the PRBG's state. One example of a stream cipher is RC4.


The PRBG should satisfy two requirements:

Most random number generators built into programming languages are not cryptographically secure - they're only intended to satisfy the indistinguishability requirement. For example, rand in UNIX uses a linear congruential generator with known constants, which is straightforward to predict.

RC4 was designed by Rivest, the R in RSA. It was used in everything from SSL/TLS to Adobe Acrobat, as one of the most popular stream ciphers around. It's nice because it's very fast and has variable key sizes, but was proprietary for a long time has a lot of weaknesses, even though none of them are catastrophic. It consists of a key scheduling algorithm, and a keystream generator.

The key scheduling algorithm generates a random-looking permutation of \(0, \ldots, 255\). It starts by initializing \(S\) with \(0, \ldots, 255\), and \(\overline{K}\) to the key repeated over and over until it fills 256 entries in the array. Then, the array entries in \(S\) are swapped based on \(j_{i + 1} = (S[i] + \overline{K} + j_i) \mod 256\). The keystream generator just applies that permutation to generate the keystream, which is then XORed with the plaintext to get the ciphertext.

Wireless networks lose the physical security of wired networks, and security for those networks is a lot harder because attackers can do it from a distance with no physical evidence. The original WiFi standard, IEEE 802.11, includes the Wireless Equivalent Privacy (WEP) protocol for protecting link-level data in transit between clients and access points. WEP was intended to provide confidentiality using RC4, data integrity using a checksum, and access control by rejecting improperly encrypted packets.

In WEP, the client first shares a 40-bit or 104-bit key with the access point (this was because the US classified cryptography as munitions, disallowing most cryptography with keys greater than 40 bits for export; and keys greater than 104 bits for domestic use). Messages are then divided into fixed-size packets, which are then each encrypted with a per-packet initialization vector (IV). The issue is that WEP didn't specify certain things, like how the key should be distributed, and how IVs should be managed.

Implementations ended up using one shared key per LAN, infrequently changed, and just generating random IVs or using consecutive integers as IVs. To send a packet, a party would select a 24-bit IV, compute the CRC-32 checksum of the message, the checksum is then appended to the plaintext, and then XORed with the RC4 keystream to get the ciphertext, where the key is the IV the shared WEP key appended to the IV. The sender then transmits the IV and the ciphertext. The receiver then, having the IV and the WEP shared key, gets gets the plaintext concatenated with the checksum, and then verifies that the checksum is correct, rejecting the packet if the checksum doesn't validate.

Turns out, WEP implements none of its goals. One problem is IV collision - since the IV is only 24-bits, the birthday paradox means that only about \(2^{12}\) packets are needed for a collision. If two packets have the same IV, then \(c_1 \lxor c_2 = m_1 \lxor m_2\) - the ciphertexts can be XORed to get the plaintexts, which can then be analyzed using known plaintexts or statistical analysis, so confidentiality is broken. Another problem is that the checksum is linear - we can make certain changes to the ciphertext while still ensuring that the checksum will verify correctly, so data integrity is broken. Finally, the checksum is unkeyed, so knowing the plaintext for one encrypted packet is enough to get the RC4 keystream and encrypt messages properly themselves.

Shamir and co. (the S in RSA) came up with an even better attack in 2001, based on the fact that 104-bit keys were infrequently changed, the IV is very predictable/randomly selected, and we know the first byte of the plaintext (the protocols add known headers to the plaintext before encrypting). With 5 million packets, the secret key itself can be recovered. Modern techniques can recover the key in just 40000 packets.

IEEE published updated wireless security standards. For example, WPA was a temporary replacement, and WPA2 came out afterwards, using AES instead of RC4, and designed to be much stronger.


Assignment 1 should be started now.

Attacks only get better over time, never worse. In WEPs case, it wasn't the security of the algorithms like RC4 that were broken, it was the use of those algorithms that was incorrect. In the future, thanks to upcoming advances related to Moore's law and quantum computing, currently secure cryptosystems could easily become broken.

A block cipher is the other common type of SKES. While a stream cipher encrypts one bit at a time, a block cipher breaks up the plaintext into fixed-length blocks, encrypting the message one block at a time. Some commonly known block ciphers are DES and AES.

DES has a 56-bit key length (which is now relatively easy to break), and a block size of 64 bits. AES (Rijndael, "rined'all") is its successor, and has a 128/192/256 bit key length. There are currently no significant known attacks on AES.

Clause Shannon gave a couple of principles for designing good block ciphers:

Also, block ciphers should be fast, so we can use them in more applications.

A Feistel cipher is a class of ciphers, including DES. Feistel ciphers are parameterized based on \(n\) (half of block size), \(h\) (number of rounds), and \(l\) (key size). It generates \(h\) subkeys from the cipher key, one for each round. Each key is used to generate a component function \(f_1, \ldots, f_h\) that "scrambles" its input.

Each block \(m\) is then broken into two halves, \(\tup{m_0, m_1}\). Then, we perform \(h\) encryption rounds: at each round \(i\), the right half gets moved into the left half, and the original left half is XORed with \(f_i\) applied to the original right half, and that becomes the right half. After those \(h\) rounds, the two halves are the ciphertext.


;wip: feistel ciphers

The New Data Seal cipher was invented by IBM as the predecessor to DES. It's a Feistel cipher, and is a relatively complicated one, but it also happens to be completely broken today to chosen-plaintext attacks. It has a block size of 64 bits, and uses 16 rounds, so \(n = 64, h = 16\). Here's the basic idea:

  1. Let \(S_k: \set{0, 1}^8 \to \set{0, 1}^8\) be the secret key - an arbitrary function of a byte.
  2. Let \(f: \set{0, 1}^{64} \to \set{0, 1}^{64}\), the component function, be defined as follows:
    1. Since the input is 64-bits, let \(z = \tup{z^1, \ldots, z^8}\) be the 8 bytes of the input.
    2. For each byte \(z^j\), let \(n_1^j = z^j[7:4]\) and \(n_2^j = z^j[3:0]\) - the two nibbles of the byte.
    3. Let \(t = S_k(z^1[7] \ldots z^8[7])\) - the key function applied to the byte obtained by taking the first bit of each byte in \(z\).
    4. For each byte \(z_j\), if bit \(j\) of \(t\) is 1, let \(p_1^j = S_1(n_2^j)\) and \(p_2^j = S_0(n_1^j)\), otherwise let \(p_1^j = S_0(n_1^j)\) and \(p_2^j = S_1(n_2^j)\). \(S_0, S_1\) are functions defined by the NDS specification (that are too long to list here), and we're swapping the nibbles if the corresponding bit of the key function \(S_k\) is true.
    5. Output \(P(p_1^1 p_2^1 \ldots p_1^8 p_2^8)\) - a scrambling function \(P\) applied to all of the scrambled and permuted nibbles concatenated together. \(P\) is also a function defined by the NDS specification (that is too long to list here).
  3. For each 16-byte block of the input \(z = \tup{z^1, \ldots, z^8}\) (each block contains \(2n\) bits), run the Feistel cipher ladder with \(f\) as the component function:
    1. Let \(m_0, m_1\) be the two 8-byte halves of the 16-byte block.
    2. For 16 rounds \(1 \le j \le 16\), set \(m_j\) to \(m_{j - 1}\) and \(m_{j + 1}\) to \(m_{j - 1} \lxor f(m_j)\) (simultaneously).
    3. Output \(m_{16} m_{17}\).

Since \(S_k\) can be considered a random function, there are \(2^8\) possible values of \(S_k(x)\) for each of the \(2^8\) possible values of \(x\) - we might think of serializing the function as a 256-byte sequence. That means there are \(256^{256}\) possible values of \(S_k\), or \(2^{2048}\), which makes trying every possible \(S_k\) computationally infeasible.

In fact, we can recover the entire secret key from the ciphertext in a few hundred chosen plaintext attacks. The main issue is that there's no subkeys - each round used the same component function and secret key, rather than deriving unique component functions for each round - each \(f_1, \ldots, f_h\) is the same function! Let \(T\) denote the function that does one round of encryption, with regard to the secret key \(S_k\). Basically, \(T(\tup{m_{i - 1}, m_i}) = \tup{m_i, m_{i - 1} \xor f(m_i)}\) for some fixed function \(f\) and the two halves of the message are \(m_0, m_1\).

Clearly, applying \(T(m)\) to the message 16 times is the whole encryption function. Let's represent this as \(T^{16}(m)\). Let \(F = T^{16}\) - this is the full encryption function, since the cipher is just applying \(T\) 16 times. Clearly, \(T(F(m)) = T(T^{16}(m)) = T^{17}(m) = T^{16}(T(m)) = F(T(m))\) for any byte \(m\).

Here's the attack. For every possible byte \(r\), we want to determine \(S_k(r)\). Select \(u = \tup{m_0, m_1}\) such that the byte formed by taking the first bit of each byte in \(m_1\) is \(r\), and \(p_1^j \ne p_2^j\) for \(1 \le j \le 8\) - the scrambled nibbles aren't the same within each of the 8 bytes in \(m_1\).

As an aside, \(x^*\) seems to mean "the first bit of each byte in \(x\)".

By the rules of the chosen plaintext attack, we can get Bob to give us \(\tup{a, b} = F(u)\), but not what \(F(m)\) or \(T(m)\) are (we know the values of the function at this point, but that's it).

However, we know that \(T(F(u)) = F(T(u)) = \tup{b, \cdot}\). Since the value of \(S_k(r)\) is only a byte, we can guess every possible value of \(S_k(r)\), and check each guess \(t\) by computing \(T_t(u)\) and then get Bob to give us \(F(T_t(u)) = \tup{c, d}\). Clearly, \(b \ne c\) implies that \(S_k(r) \ne t\), and \(b = c\) implies that \(S_k(r) = t\) is very very likely (there is a tiny chance of accidentally getting it right). This is because we're assuming that \(F\) works roughly like a random permutation, so since \(b\) and \(c\) are 64 bits, \(F(T_t(u))\) has only a \(\frac 1 {2^{64}}\) probability of \(b = c\) without \(S_k(r) = t\), because we selected \(u\) such that \(p_1^j \ne p_2^j\). Eventually, we get every possible value of \(S_k\).

Also, since we check every possible byte \(r\) (256 values), and for each value of \(r\), we guess-and-check possible bytes \(t\) (on average, 128 checks, worst case 256), we should expect \(128 \times 256 = 2^{15}\) chosen plaintexts on average before recovering the entirety of \(S_k\).

The entire attack can be summarized as:

  1. For each possible byte \(r\):
    1. Find a byte \(u = \tup{m_0, m_1}\) such that \(m_1^* = r\) and \(p_1^j \ne p_2^j, 1 \le j \le 8\).
    2. Get Bob to compute \(F(u) = \tup{a, b}\).
    3. For each possible byte \(t\):
      1. Compute \(T_t(u)\).
      2. Get Bob to compute \(F(T_t(u)) = \tup{c, d}\).
      3. If \(b = c\), we know that \(S_k(r) = t\) with overwhelming probability, and then go to the next \(r\).
  2. Now we have \(S_k(r)\), the secret key. We can use this to decrypt the ciphertext.

Basically, we need about \(2^{15}\) chosen plaintexts to recover the entire secret key, which is quite feasible on today's computers.


Chosen plaintext attacks are not always feasible, but when they are they're very powerful. For example, consider a mail forwarder that encrypts incoming emails and forwards them to another destination.

DES is one of NDS' successors. Originally with 64-bit keys, the NSA weakened the final standard to 56 bits. DES is pretty weak compared to modern ciphers, but 3-DES is still commonly in use. The design principles are classified, but it's been heavily analyzed.

DES is a Feistel cipher with 16 rounds, a 64-bit block and a 56-bit key. For each round of the cipher, a 48-bit subkey is derived from the 56-bit secret key, by selecting 48 bits from the secret key. The plaintext halves are also expanded from 32-bits to 48-bits. After each Feistel round completes, we use 8 S-box functions (which are publicly known) to scramble the XORed result and reduce 48-bit results to 32-bits. This is the source of nonlinearity in DES, and is very important for security.

With a key space of size \(2^{56}\), it's perfectly practical to just brute force today, even if we can't do it on our laptops just yet. A massively parallel effort broke a DES-encrypted message in just over 22 hours in 1999, in response to a challenge set by RSA security. Also, it's got a block size of 64-bits, which means that you'd expect a collision at around \(2^{32}\) uniformly distributed blocks, thanks to the birthday paradox. So, on a busy network, if Alice and Bob are sending many blocks back and forth, Eve might observe a duplicated block, which tells us a little bit of information about the plaintext - those two corresponding blocks in the plaintext are identical (this isn't much of a problem in practice, but is still an issue).

Differential cryptoanalysis attacks can be used to recover the key in \(2^{47}\) chosen plaintexts, though this is usually an infeasibly large number of chosen plaintexts in practice (turns out, DES was specifically designed to guard against this - the NSA was aware of the attack before it was public knowledge). Linear cryptoanalysis attacks can do the same thing in just \(2^{43}\) chosen plaintexts, which is slightly more practical.

After these weakenings, 3-DES was invented as a more secure SKES in order to take advantage of existing DES hardware (DES was designed for hardware acceleration). Basically, it's just DES applied to DES applies to DES applied to the plaintext. Note that applying ciphers repeatedly doesn't always provide more security (consider a substitution cipher, for example), but in this case it provides reasonably good. Another variant is Double-DES, which is DES applied to DES applied to the plaintext (each DES function has its own key, so we have a 112-bit key overall).

Turns out Double-DES is vulnerable to a meet-in-the-middle attack. Basically, if \(E_{k_1}, E_{k_2}\) are the two DES encryption functions in Double-DES and \(c = E_{k_2}(E_{k_1}(m))\), then \(E^{-1}(c) = E_{k_1}(m)\).

Suppose we have 3 known plaintext/ciphertext pairs \(\tup{m_1, c_1}, \ldots, \tup{m_3, c_3}\). For each possible 56-bit \(k_2\), decrypting \(c_1\) with our guess for \(k_2\), and store that plaintext and the key used in a table, indexed by the plaintext. Then, for each possible 56-bit \(k_1\), try encrypting the \(m_1\) with our guess for \(k_1\), and see if it matches any entry in the table. If there's a match, and those guesses for \(k_1, k_2\) also satisfy \(c_2 = E_{k_2}(E_{k_1}(m_2))\) and \(c_3 = E_{k_2}(E_{k_1}(m_3))\), then our guesses for \(k_1, k_2\) are the actual secret keys.

Due to this attack, Double-DES is only slightly better than plain DES, with about \(2^{57}\) bits of security and requiring about \(2^{56}(64 + 56)\) bits of memory to store the table. The main hard part is getting the ~1 exabyte of storage needed to mount this attack, but it's quite within the realm of possibility. We can even do a time-memory tradeoff and do \(2^{56 + s}\) DES operations in return for \(2^{56 - s}\) bits of memory, which can easily bring the memory requirements down to a manageable size while keeping the number of operations feasible.


Why do we need 3 pairs of known plaintext/ciphertext pairs rather than just 1 or 2? Well, for each key in the key space, we can think of the encryption function is just a random function (it's not, but this is a good enough assumption for our purposes).

Let \(k\) be the actual \(l\)-bit secret key (so \(E_k(m) = c\)) and \(k'\) (also \(l\) bits) be our guess for this key. If \(E_{k'}\) is a uniformly random function (from our assumption), then it clearly has a \(\frac 1 {2^L}\) probability of satisfying \(E_{k'}(m) = c\), where \(L\) is the number of bits in the plaintext - it might just randomly happen to output the right ciphertext given our particular plaintext. What's the probability that our guess is correct if it satisfies \(E_{k'}(m) = c\)?

Suppose \(E_{k'}(m) = c\) and \(k' \ne k\) (our guess for the key is incorrect, but gives the right ciphertext for our particular plaintext). Then the number of \(E_{k'}\) such that \(E_{k'}(m) = c\) is \(\frac{2^l - 1}{2^{L}}\) (\(2^l - 1\) is the number of keys in the keyspace that are not the actual key, and \(2^{L}\) is the number of possible plaintexts). This is essentially the number of incorrect keys that work for a single plaintext/ciphertext pair without actually being the correct key.

We want to use enough plaintext/ciphertext pairs such that the expected number of these incorrect keys is close to 0. Clearly, with \(t\) of these plaintext/ciphertext pairs, we have \(\frac{2^l - 1}{2^{Lt}}\) expected false keys. For Double-DES, one pair gives us \(\frac{2^{112} - 1}{2^{64}}\) expected incorrect keys, which is far too many. With 3 pairs, we get around \(\frac 1 {2^{16}}\) expected false keys, much more reasonable.

There's also triple DES, which is just applying DES three times now - \(c = E_{k_1}(E_{k_2}(E_{k_3}(m)))\), where \(c\) is the resulting ciphertext, \(m\) is the plaintext, and \(k_1, k_2, k_3\) are the three 56-bit DES keys. We don't have any proof that it's more secure than DES alone, but a meet-in-the-middle attack takes around \(2^{112}\) steps, around \(2^{64}\) message/plaintext pairs, and a huge table.

Given our plaintext \(m = m_1 \ldots m_t\), how do we encrypt it if the Block cipher modes of operation:

AES was developed as part of a public, open competition in 1997 to build a successor to DES. It needed to have 3 key sizes (128, 192, and 256 bits), have a 128-bit block size, and be efficiently implementable on hardware/software. While 128 bits is infeasible for the foreseeable future, the larger key sizes are intended to protect against potential future attacks, as well as quantum computers - Grover's algorithm can perform exhaustive key search in only \(2^{\frac 1 2 l}\) operations, where \(l\) is the number of bits in the key.

With 15 submissions initially, Rijndael won in the end. Rijndael became AES in late 2001.

Rijndael is an iterated block cipher, not a Feistel cipher. Specifically, a substitution-permutation network. Currently, there's no known attack better than exhaustive key search on AES.

We won't cover the technical details of AES in this course, but some more details are available in the slides


\(x \in_R S\) means that \(x\) is randomly and independently chosen from \(S\).

A hash function is a fixed function that transforms an input of arbitrary length and output a fixed-length string, called a digest. Some common hash functions are MD5, SHA1, and SHA512.

Formally, an \(n\)-bit hash function is \(H: \set{0, 1}^* \to \set{0, 1}^n\). A good hash function should satisfy three examples:

A Davies-Meyer hash function is a family of hash functions that uses block ciphers. Basically, given a fixed IV \(H_0\) and an \(n\)-bit block encryption function \(E_k\), we break up the plaintext into \(n\)-bit blocks \(x_1, \ldots, x_t\) (after appending 1 to the end and padding with 0 to the nearest block size), then \(H_i = E_{x_i}(H_{i - 1}) \lxor H_{i - 1}\) and \(H_t\) is the hash value.

A one-way function is a pre-image resistant hash function. A cryptographic hash function is a pre-image resistant and collision-resistant hash function.

A generic attack is an attack on a hash function that treats it as a black-box/ideal hash function. As a result, a generic attack must treat the hash function as a random function \(H: \set{0, 1}^* \to \set{0, 1}^n\) (an ideal hash function is a random function). In practice, actual random functions are too complex to represent in a reasonable way.

A generic attack for finding pre-images: select arbitrary \(x \in \set{0, 1}^*\) until we find \(H(x) = y\). Clearly, we expect about \(2^n\) time. A generic attack for finding collisions: select arbitrary \(x \in \set{0, 1}^*\) and store them in a table indexed by \(H(x)\), until a hash is found. By birthday paradox, we expect that to take about \(\sqrt{2^n}\) time and memory.


Checking for collisions generically, as mentioned last class, would take around \(\sqrt{2^n}\) time and memory. However, there's actually a generic attack for finding collisions that has the same order of running time (\(\sqrt{\frac{\pi 2^n}{2}}\)), but requires far less memory.

Basically, given \(H: \set{0, 1}^* \to \set{0, 1}^n\), we want to find \(x_1, x_2 \in \set{0, 1}^*\), where \(x_1 \ne x_2\) yet \(H(x_1) = H(x_2)\). How does this algorithm, known as the Van Oorschot/Weiner (VW) algorithm, work?

First, we assume that \(H: \set{0, 1}^n \to \set{0, 1}^n\) is a random function. Now, we'll search for \(x_1, x_2\) in \(\set{0, 1}^n\) rather than \(\set{0, 1}^*\), by defining a sequence of hash values \(w_0 \in_R \set{0, 1}^n, w_i = H(w_{i - 1})\) - a sequence consisting of a hash value, the hash of that hash value, the hash of the hash of that hash value, and so on. This is a random sequence since we're assuming \(H\) is a random function.

Clearly, there must be a repeat (collision) in the sequence somewhere, because an element of the sequence can only have one of \(2^n\) possible values. That means the sequence is finite and ends in a cycle. Let \(j\) be the smallest index such that \(x_j = x_i\) for some \(i < j\) - \(j\) must exist because there must exist a collision. Since , \(x_{j + l} = x_{i + l}\) for any \(l \ge 0\).

Let \(N = 2^n\). From the birthday paradox, we expect the value of \(j\) to be \(E[j] = \sqrt{\frac{\pi N}{2}}\), or around \(\sqrt{2^n}\).

Also, \(E[i] = E[j - i] = \frac 1 2 \sqrt{\frac{\pi N}{2}}\). If we assume, without loss of generality, that \(i \ne 0\), then \(H(x_{i - 1}) = H(x_{j - 1})\) and \(x_{i - 1} \ne x_{j - 1}\) (with very high probability) - a collision!

How do we actually find \(x_{i - 1], x_{j - 1}\), without using much storage?

If we think of the sequence of hashes is like a linked list, where \(H(x_i)\) gets the next element \(x_{i + 1}\), then this simply becomes a problem analogous to finding cycles in a linked list, which we could solve using something like the tortoise-and-hare algorithm with just constant memory. However, that means a lot of wasted time unnecessarily computing hashes, and it only tells us that there is a collision, not what the collision actually is.

A better way we could do this is to define some easily distinguishable property of the hash values that applies to around fraction \(\theta\) of all hash values, for a small value of \(\theta\). For example, the property might be "the first 20 bits are 0" (\(\frac 1 {2^20}\) of hash values satisfy this property). Then, we only store the hash values in the sequence that satisfy this property (these hash values are called "distinguished points").

When we encounter a distinguished point for the second time, we've found a collision somewhere behind us in the sequence - assuming there's a distinguished point between \(i\) and \(j\), we must eventually encounter it again when our sequence loops around to \(x_{j}\). We can assume there's a distinguished point between \(i\) and \(j\) because our \(\theta\) is chosen to be large enough that there is a very high probability that there is a distinguished point in the cycle - for example, for a 32-bit hash we would expect the cycle to contain about \(\frac 1 2 2^{32} = 2^{31}\) entries, so we might make our distinguishing property "the first 21 bits are 0", giving an expected value of \(2^{10}\) distinguished points in the cycle.

That does mean there's a very tiny chance the algorithm doesn't terminate, however - we can get around this by limiting the length of the path to \(2^n\) and retrying. There are a lot of other edge cases too (what if the starting value has a collision in the sequence?), but they all happen with negligible probability and can be handled with relatively few operations.


  1. Randomly select a \(x_0\) from \(\set{0, 1}^n\), store \(\tup{x_0, 0, \text{empty}}\) in the table.
  2. Let \(L = x_0\) (\(L\) is the most recently stored distinguished point).
  3. For \(j \ge 1\) do (phase one starts - finding duplicate points):
    1. Let \(x_j = H(x_{j - 1})\).
    2. If \(x_j\) is distinguished:
      1. Store \(\tup{x_j, j, L}\) into the table and let \(L = x_j\).
      2. If \(x_j\) was already in the table, we know \(x_i = x_j\), and stop this loop to begin phase 2 - finding the actual collision values.
  4. Let \(a\) be the index of the distinguished point right before \(i\), and \(b\) be the index of the distinguished point right before \(j\). This can be looked up as the third element of tuples in the table - the values of \(L\) associated with \(x_i\) and \(x_j\) respectively.
  5. Let \(l_1 = i - a, l_2 = j - b\). Assume without loss of generality that \(l_1 \ge l_2\). Let \(R = l_1 - l_2\) - how much longer the earlier chain is until the collision than the later chain.
  6. Compute \(x_{a + 1}, \ldots, x_{a + R}\). Let \(p_1 = x_{a + R}\) and \(q_1 = x_b\). Now the collision is an equal number of hashes away from \(p_1\) and \(q_2\).
  7. Let \(p_{k + 1} = H(p_k)\) and \(q_{k + 1} = H(q_k)\). Compute \(p_k\) and \(q_k\) until we find a value \(k\) such that \(p_k = q_k\). Then, the previous value \(p_{k - 1}\) and \(q_{k - 1}\) form a collision.

So overall, step 3 takes around \(\sqrt{\frac{\pi N}{2}} + \frac 1 \theta\) time, while the steps after it take \(\frac 3 \theta\). The memory usage is around \(\theta \sqrt{\frac{\pi N}{2}} \times 3n\) bits.


If we have a 128-bit hash, we might choose a \(\theta = \frac 1 {2^{32}}\), for about \(2^{96}\) distinguished points. With this, we would expect around \(2^{64}\) operations and \(2^{41}\) bits, or around 256 gibibytes - totally feasible on modern hardware. This tells us that any 128-bit hashes are simply not collision-resistant today.

How do we parallelize the VW algorithm, given \(m\) processors? One naive way is to run the algorithm with different starting values on each processor (each with its own memory), and see which one finishes first. Turns out this uses time \(\frac{\sqrt{\frac{\pi N}{2}}}{\sqrt{m}} + \frac 4 \theta\) - not very good scaling.

Instead, we can store all of the distinguished points from the \(m\) processors in a central table - at each time unit, each processor adds a new unique hash value, until one of the distinguished points collide. This takes \(\frac{\sqrt{\pi N}{2}}{m} + \frac 4 \theta\) operations, which is about as well as you can expect this algorithm to be parallelized.

How do we find meaningful collisions in a hash function? Suppose \(m_1\) is "Alice owes Bob 1000 dollars" and \(m_2\) is "Alice owes Bob 10 dollars". Alice wants to find values \(m_1', m_2'\) that semantically mean the same thing as \(m_1, m_2\) (e.g., "Alice should give Bob $1000" and "Alice ought to pay Bob $10"), sign \(m_1\), make Bob agree to it, and then later claim to have signed \(m_2\). After Bob agrees to it, Alice can claim to have signed \(m_2\) instead, only owing 10 dollars rather than 1000 dollars. How can Alice find \(m_1', m_2'\) using the VW algorithm?

Partition \(m_1\) into \(n\) parts, with partitions at indices \(j_1, \ldots, j_n\). Let \(g_{m_1}(r): \set{0, 1}^n \to \set{0, 1}^*\) be \(m_1\) modified so that a space is inserted into \(m_1\) at position \(j_i\) if and only if the \(i\)th bit of \(r\) is 1. Note that \(g_{m_1}(r)\) should have the same meaning as \(m_1\), since we just added a space somewhere. Likewise, define \(g_{m_2}(r)\) for inserting spaces into \(m_2\).

Let \(S_0\) be the elements of \(\set{0, 1}^*\) that begin with 0, and \(S_1\) be the elements that begin with 1. Let \(f: \set{0, 1}^n \to \set{0, 1}^n\) as \(f(r) = \begin{cases} H(g_{m_1}(r)) \text{ if } r \in S_0 \\ H(g_{m_2}(r)) \text{ if } r \in S_1 \end{cases}\) - the hash of a message semantically equivalent to \(m_1\) if \(r\) begins with 0, and the hash of a message semantically equivalent to \(m_2\) if \(r\) begins with 1.

If we run the VW algorithm on \(f\) to find a colliding \(r\), we get a collision \(a, b \in \set{0, 1}^n\) with \(a \ne b\) and \(f(a) = f(b)\). With 50% probability, the first bit of \(a\) is different from the first bit of \(b\), assuming they're drawn from a uniform distribution (we can assume this because we assume the hash function is a random function).

Suppose that the first bits are different - then \(a = g_{m_1}(r)\) and \(b = g_{m_2}(r)\), and \(a, b\) are two messages, semantically the same thing as \(m_1, m_2\) respectively, that collide. If the first bits are not different, then we can just run the VW algorithm repeatedly with a different starting point until we get a collision that does have messages with two different first bits. Clearly, the running time is essentially the same for each iteration, and with a 50% probability of success, we can just run it a few times with different variants of \(g_{m_1}, g_{m_2}\) until we get a meaningful collision.


So now we've looked at the naive hash collision finding, the VW algorithm, the parallel VW algorithm, and the meaningful collisions with the parallel VW algorithm.

Now let's look at general methods for constructing hash functions. Merkle's meta method is one of these.

Given an input of \(n + r\) bits, we define a compression function \(f: \set{0, 1}^{n + r} \to \set{0, 1}^n\). We also have an IV value in \(\set{0, 1}^n\). Using these, we can use Merkle's meta method to define a hash function \(H: \set{0, 1}^* \to \set{0, 1}^n\).

Let \(x \in \set{0, 1}^*\). Let \(x_1 \ldots x_t\) be \(x\) broken into \(r\)-bit blocks (the last one should be zero-padded if needed). Let \(x_{t + 1}\) be the binary representation of the length of \(x\) in bits (technically, this limits the length of \(x\) to \(2^r\), but in practice \(2^r\) is so large that we can simply pretend \(x\) is unbounded in length).

Let \(H_0\) be the IV. For \(1 \le i \le t + 1\), we do \(H_i = f(H_{i - 1}, x_i)\). Then, \(H(x) = H_{t + 1}\).

Merkle's theorem says that if \(f\), the compression function, is collision-resistant, then so is \(H\). This means we only have to make a fixed input size function collision-resistant, rather than an unbounded-input-size function collision-resistant. That means we should usually try to attack the compression function \(f\) rather than the hash function itself.

This theorem can be proven using the contrapositive. Assume \(H\) is not collision resistant, so we can efficiently find a collision \(\tup{x, x'} \in \set{0, 1}^*\). Let \(\overline x = x_1 \ldots x_t\) and \(b\) be the length of \(x\) in bits and \(H_1, \ldots, H_{t + 1}\) be the iteration values for \(x\), and let \(\overline x' = x_1' \ldots x_{t'}'\) and \(b'\) be the length of \(x'\) in bits and \(H_1', \ldots, H_{t' + 1}'\) be the iteration values for \(x'\). Clearly, if \(b \be b'\), then \(x_{t + 1} \ne x_{t' + 1}'\), so \(f(H_t, x_{t + 1}) = f(H_t', x_{t' + 1}')\) and \(x_{t + 1} \ne x_{t' + 1}'\) - a collision! Now suppose \(b = b'\), so \(x_{t + 1} = x_{t + 1}'\) and \(t = t'\). If we work backwords from the last iteration toward the first iteration, we must eventually find some \(H_i = f(H_{i - 1}, x_i)\) and \(H_i' = f(H_{i - 1}', x_i')\) such that \(\tup{H_{i - 1}, x_i} \ne \tup{H_{i - 1}', x_i'}\), since \(x \ne x'\) - they must differ in at least one message block. This is a collision for \(f\). So, a collision is always computationally efficient to find for \(f\). So if \(f\) is collision resistant, \(H\) must be as well.


MDx is a family of iterated hash functions, the most well-known of which are MD4 and MD5. MD4 was a 128-bit hash invented by Ron Rivest in 1990, but it had some serious flaws - in 2004 someone found collisions by hand. Pre-image is still infeasible, however.

MD5 is another 128-bit hash, a strengthened version of MD4. Since its invention in 1991, we've found MD5 collisions in 31 seconds in 2006 on commodity hardware, while preimages can be found in \(2^{123.4}\) steps. However, MD5 is still in very wide use today.

SHA-1 is a 160-bit hash function invented by the NSA in 1993, and modified by the NSA again in 1994 to fix a classified weakness. In 2005, it was found that that fix increased the cost of finding collisions from \(2^{39}\) to \(2^{63}\). Collisions are feasible to find today, but nobody seems to have done it yet - we still want to move away from SHA-1 though, since it's theoretically feasible to find collisions. We don't know of any pre-image/second pre-image attacks on SHA-1 that are faster than the generic attacks. In about two weeks from now, SSL will no longer consider SHA-1 hashes for server certificates valid. SHA-1 is a Merkle-style hash with \(n = 160, r = 512\). Most of the design decisions are classified, because NSA, and the hash operations themselves are rather messy.

In 2001 the NSA proposed SHA-2, a family of hash functions with three variants, each with different output sizes - SHA-256, SHA-384, and SHA-512. Note that SHA-2 has the same security against collisions as AES-128/AES-192/AES-256. As of this writing, no attacks stronger than the generic attacks have been found, but there are concerns remaining because the design is pretty similar to SHA-1.

SHA-3 is a NIST hash function competition, much like the competition that resulted in AES. Starting with 64 candidates, Keecak was selected as the winner in 2012. Keecak isn't an iterated hash function, using a sponge construction method, and it isn't very widely used in practice due to most people being fine with the SHA-2 family.

NIST recommends we stop using SHA-1 if possible, though it's okay for HMACs/KDFs/RNGs. SHA-2 and SHA-3 are recommended.

Suppose we can find a meaningless collision in an iterated hash function \(\tup{x, y}\). How can we exploit this in practice?

Wang proposed an attack on SHA-1 that gives us a two-block collision (two two-block-long messages that hash to the same value) that takes only around \(2^{63}\) operations. Wang also proposed an attack on MD5 that gives us a one-block collision (two one-block-long messages that hash to the same value) in only \(2^{39}\) operations. We've actually applied these and found real collisions in SHA-1 and MD5. How do we exploit this concretely?

Suppose \(\tup{c_1, c_2}\) is a one-block MD5 collision (without loss of generality, assume the blocks don't contain ( or ), so it can fit into a Postscript literal). Now consider the following two Postscript files:

%%BoundingBox: 0 0 612 792                   
%%BoundingBox: 0 0 612 792                   

The two files are identical except for line 3, where we have <c_1> in the first and <c_2> in the second.

Clearly, the first one prints out the harmless message, while the second one prints out the harmful message. However, note that the second line has been padded with spaces so that the second block starts exactly at either <c_1> in the first message, or <c_2> in the second message (MD5 uses a block size of 64 bytes, and the file is assumed to be using Windows-style CR-LF line endings).

Suppose \(x\) and \(y\) have the same number of blocks in an iterated hash function. Let \(F(I, x) = H_t\) where \(I = H_0\) and \(H_i = f(H_{i - 1}, x_i)\). Clearly, if \(F(I, x) = F(I, y)\), then \(F(I, xz) = F(I, yz)\) for any string \(z\). In other words, given a collision \(\tup{x, y}\), \(\tup{xz, yz}\) is also a collision for any string \(z\). This is called a length extension attack.

Also, if \(F(I, x) = F(I, y)\), then \(F(I, zx) = F(I, zy)\) for any string \(z\) that is an integer multiple of the block size long. In other words, given a collision \(\tup{x, y}\), \(\tup{zx, zy}\) is also a collision for any blocks \(z\).

Note that each Postscript document can be made by prepending a block to one of <c_1>/<c_2>, and then appending the rest of the document. Since <c_1>/<c_2> collide, the Postscript documents themselves must also collide!

Suppose Alice sends Bob the first Postscript file, and asks Bob for a signature. Bob opens it in a Postscript viewer, sees the harmless message, then signs the MD5 hash of the document (because signing the whole document with public key crypto would be too unwieldy), and sends the signature to Alice. However, Alice now has a signature from Bob for the second Postscript document, which has the same MD5 hash but a different, harmful message - Alice has convinced Bob to sign something Bob didn't want to!

(Menezes now demonstrates an MD5 collision on a real document, where the MD5 hash of a Postscript file that results in a PDF saying "John didn't leave his hat in my office" collides with a Postscript file that results in a PDF saying "John should be given a mark of 100 in this course".)


;wip: slides up to slide 192 ish, MAC schemes (a lot like the cipher block chaining scheme used for symmetric ciphers),


Instead, we can append the key to the message before we hash it, rather than prepending it. This ensures that the attacker can't use a length extension attack anymore because they don't know what the secret key \(K\) is, but it requires the hash function to be collision resistant in order for the resulting MAC to secure. This is called the secret suffix method.

If the hash function isn't collision resistant, then we can efficiently find a collision \(\tup{x_1, x_2}\). Suppose that \(x_1\) and \(x_2\) have lengths that are multiples of the hash block length. Then \(H(x_1 . K) = H(x_2 . K)\), because the values being passed into the hash function start with two different but colliding values. Therefore, we've found a collision for the MAC - a collision attack!

To avoid the requirement that the hash function be collision resistant, we can use the envelope method, in which we prepend and append the key to the message before we hash it. This successfully avoids the length extension attack because the attacker doesn't know what \(K\) to append, and the collision attack because the attacker probably can't find a collision where both values start with \(K\), without even knowing what the value of \(K\) is.

The Hash MAC (HMAC) is a commonly used MAC scheme that . \(H_k(x) = H(K \oplus \mathrm{opad} . H(K \oplus \mathrm{ipad} . x))\), where \(\mathrm{ipad} = 0x36, \mathrm{opad} = 0x5C\) repeated for each byte of \(K\). The inner hash is just the secret prefix method, but hashing for the second time ensures that the hash can't . This can be proven to be secure if the compression function used in the iterated hash function \(H\) is itself a secure MAC scheme with fixed length messages and a secret IV.

HMAC is used a lot in industry, as part of IPSec, SSL, and TLS. In practice, we generally use SHA-1 for the HMAC hash function, and it seems to be unaffected by the known weaknesses in SHA-1.

A secure symmetric encryption scheme ensures confidentiality. A secure MAC scheme like HMAC ensures authentication (data integrity and data origin authentication). If Alice needs to send Bob a message with both confidentiality and authentication, one way to do this is Alice can send Bob \(\tup{c, t} = E_{k_1}(m), H_{k_2}(m)\), where \(k_1, k_2\) are secret keys pre-shared with Bob. This is called Encrypt-and-MAC. However, the issue here is that the MAC scheme isn't designed for confidentiality, only authentication, so theoretically, the HMAC might leak information about the plaintext. We still use this a lot in practice though, because most MAC schemes don't seem to leak any information about the plaintext.

A better way to do it would be to send \(\tup{c, t} = \tup{E_{k_1}(m), H_{k_2}(E_{k_1}(m))}\) (ciphertext, tag) instead - we tag the encrypted message using the HMAC, rather than tagging the plaintext. This can be proven to be secure if the encryption and MAC scheme are secure. This is called the Encrypt-then-MAC.

Authenticated encryption schemes try to provide both confidentiality and encryption in a more integrated package. The most common one in AES-GCM, which is faster than generic encrypt-then-MAC, as well as authentication-but-not-encrypted headers for data. It uses AES in counter mode (CTR mode - encrypting a counter using the key with AES, and then using the output of that as the CSPRNG for a stream cipher) and a custom MAC scheme, and requires the IV never be reused. One advantage of AES-GCM over something like the encrypt-then-MAC is that it's parallelizable and faster. AES-GCM outputs the IV, authenticated header, authenticated and encrypted message, and the 128-bit authentication tag.


AES-GCM is widely used in the real world for combined authentication and encryption, because it's fast and is proved correct by Iwata-Ohashi-Minematsu in 2012, under certain assumptions. In fact, Intel and AMD processors have special instructions to make AES-GCM implementations a lot faster.

Public Key Cryptography

The main drawback of symmetric cryptography is the key establishment problem - how do we distribute secret keys to all the parties that need it, but nobody else?

One way to solve this is to distribute keys over a secure channel point-to-point, like a face to face meeting, a trusted courier, or a key embedded in a smart card. However, it's hard to obtain a secure channel, and if we already had a secure channel, we could often just use that channel to communicate instead.

Another way is to give the key to a trusted third party, which distributes the secret key as desired. However, finding a trusted third party is quite difficult, and it's also a single point of failure.

There's also the key management problem - in a network of \(n\) users, each user needs one key for every other user, so there are \(O(n^2)\) keys needed.

Also, symmetric encryption can't achieve non-repudiation, because there are always multiple parties that could have written a given message (since the key needs to be shared with other parties in order to communicate). Theoretically we could do this by sharing a key only with a trusted third party that ensures non-repudiation, but that kind of defeats the point of using cryptography for non-repudiation in the first place.

Public key cryptography works somewhat differently. Using a public key cryptosystem, Alice and Bob first exchange authenticated (rather than secret) information, and then can communicate securely. It was first invented by Merkle, Diffie, and Hellman. It's much easier to exchange information in an authenticated way, then in a secret way. Intuitively this seems impossible, and Merkle's professor originally told him it was impossible back in 1975, but it is indeed possible.

The advantages of public key cryptography over symmetric cryptography are: only requiring an authenticated channel rather than a secured one, each user only has one keypair, a signed message can be verified by anyone, and signatures can achieve non-repudiation. The disadvantages are: private/public keys are usually a lot larger than symmetric cipher keys, making them harder to share in some cases, and public key cryptography is usually a lot slower.

Suppose Alice and Bob want to communicate securely. Alice and Bob want to establish a session key by communicating over a public, authenticated channel. We'll now describe Merkle's proof-of-concept public key cryptography idea.

Alice creates puzzles \(P_1, \ldots, P_N\) for some large \(N\), and each puzzle takes time \(t\) to solve. The solution to a given puzzle \(P_i\) is the session key \(k_i\) and the serial number \(n_i\), and Alice keeps track of the solutions to each generated puzzle. Alice sends all the puzzles to Bob, over the authenticated channel.

Then, Bob randomly selects a puzzle \(P_j\), solves it in time \(t\) to obtain \(k_j\) and \(n_j\), then sends \(n_j\) back to Alice. Since Alice knows \(n_j\), Alice can easily look up the corresponding \(k_j\). Now, both Alice and Bob know \(k_j\), which they can now use as the session key. In contrast, Eve doesn't know which puzzle has serial number \(n_j\), so Eve must try each puzzle to see if the serial number matches. That means on average Eve must solve \(\frac N 2\) puzzles, spending \(\frac{Nt}{2}\) time overall. For large enough values of \(N\) like \(N = 10^9\), it should become infeasible for Eve, while Bob can feasibly still solve the single puzzle.

One example of a Merkle puzzle is AES-encrypting the session key (128-bit) and the serial number (128-bit) (the serial number is appended to the session key twice, so we can check if the puzzle was solved correctly), with a randomly selected 40-bit string as the secret key. Then we can solve the puzzle by exhaustive key search in \(2^{40}\) operations.

More generally, we want each communicating party \(C\) to generate a key pair consisting of a public key \(P_C\) and a private key \(S_C\), such that it's hard to recover $S_AC from \(P_C\), then publishes \(P_C\) while keeping \(S_C\) secret.

If A wants to communicate confidentially with B, then A obtains an authentic copy of \(P_B\), then sends B the message encrypted with \(P_B\). Bob then decrypts the message using \(S_B\), which only B knows. Since public key cryptography is so slow, we usually implement this in practice by using public key cryptography to communicate a session key (which is a fixed-length, relatively short value), and then using symmetric cryptography to encrypt the message with that session key.

If A wants to sign a message, A just signs the message with \(S_A\), which only A knows. To verify the signature on the message, we obtain an authentic copy of \(P_A\) and verify the signature using \(P_A\). Since public key cryptography is so slow, we usually implement this in practice by hashing the message, and then using public key cryptography to sign the hash of the message (which is fixed-length, relatively short value).


Algorithmic Number Theory

Recall the fundamental theorem of arithmatic - ever integer \(n \ge 2\) has a unique prime factorization.

Just because it exists though, doesn't mean that prime factorization is efficient to find - as of this writing, we are not aware of any way to find the prime factorization in polynomial time - we need to do around \(O(\sqrt{n})\) operations, while the length of the input is around \(O(\log n)\) (the number of bits needed to write the input down in a reasonable encoding). This is still an open problem - whether we can factor numbers in polynomial time with respect to the length of the input.

However, it's quite simple to verify whether a given prime factorization of a number is correct - just multiply them together.

In this course, by running time we mean the worst case time complexity. and by efficient we mean polynomial time. We also allow average case time complexity in some cases, especially when we look at probabilistic algorithms.

Suppose we have two integers \(a, b\). Clearly, they have lengths \(O(\log a)\) and \(O(\log b)\) respectively. Since we usually look at length rather than the numerical value itself, we represent the length as \(k_a, k_b\) instead - the number of bits in each number.

Clearly, computing \(a + b\) takes \(O(k_a + k_b)\) - we always have to at least look at every bit of \(a\) and \(b\) to get the correct answer, and there are \(k_a + k_b\) bits total. Likewise, computing \(a - b\) is also \(O(k_a + k_b)\). We often assume without loss of generality that \(k_a = k_b\), and just call it \(k\)

Multiplication can be done naively in \(O(k^2)\), and modern algorithms can do it in as little as \(O(k^{\log_2 3})\) or \(k^{k \log k (\log \log k)}\) - it's still an open problem what the optimal value is.

For division, we can naively just repeatedly subtract the shifted version of the denominator from the numberator, zeroing one bit of the numerator each time. That means with the naive algorothm, we need to perform \(k\) subtractions, giving us an total time of \(O(k^2)\).

The GCD algorithm \(\gcd(a, 0) = 0, \gcd(a, b) = \gcd(b, a \mod b)\) continuously finds the remainder of dividing a number by another. Both numbers are positive and the second is strictly decreasing at each step. In fact, we can prove that the values are halved or lower after every two steps.

Since division/remainder takes \(O(k^2)\), and we're doing, in the worst case, \(k\) steps, we might expect that GCD takes \(O(k^3)\). However, GCD actually just takes \(O(k^2)\) time, because the values we're dividing are rapidly decreasing.

Let's say we're now doing modular arithmetic, so with the modulus \(n\), all values are within \([0, \ldots, n - 1]\). Now we can compute \(a + b \pmod n\) by computing \(a + b\) and subtracting \(n\) if \(a + b \ge n\), and we can compute \(a - b \pmod n\) by computing \(a - b\) and adding \(n\) if \(a - b < 0\). That means modular addition and subtraction are both \(O(k)\).

Multiplication in modular arithmetic can be done by multiplying normally, and then performing a division by \(n\) and finding the remainder - it takes \(O(k^2)\). Same goes for finding the inverse, \(\frac 1 a \mod n\)

Turns out, exponentiation can be done in \(O(k^3)\) time too - we can compute \(a^b \pmod n\) in just cubic time!


The naive way to compute \(a^m \mod n\) is to compute \(a^m\), then take that number mod \(n\). This potentially gets us a really large result, and is somewhat wasteful. Another slightly improved naive method is to let result = 1, then run result = (result * a) % n \(m\) times. This solves the issue of having really large intermediate results, but isn't polynomial time and is pretty inefficient.

An improved method is to use the squaring method, which decomposes \(m\) into a sum of powers of 2, then successively squares \(a\) and multiplies the result for each power:

A = a
result = 1
while m > 0:
    # at zero-index $i$, `A` is $a^{2^i}$ and the lowest bit of `m` is the $i$th bit of the original value of $m$
    if m & 2: # $m$ written as a sum of powers of 2 has a term with exponent $i$
        result = (result * A) % n
    A = (A ** 2) % n
    m >>= 1

Since there are at most \(k\) modular squarings and \(k\) multiplications, each of which is \(O(k^2)\), the worst case time is \(O(k^3)\).

Recall Fermat's Little Theorem, which says, among other things, that if \(a\) and \(n\) are coprime, and \(a^{n - 1} \mod n \ne 1\), then we know \(n\) is composite (if it is 1 though, we don't know if it's prime or not). This is a really useful primality test - we just need to choose the right \(a\) to prove that \(n\) is not prime.


RSA was invented by Rivest, Shamir, and Adleman in 1977.

To generate a key for a user U:

  1. U randomly selects 2 large distinct primes \(p, q\) of the same bit length \(k\). They're large so their product is hard to factor, and they're distinct so they can't jus tbe factored by square rooting the product. A good value of \(k\) is 1024, 2048, or 4096.
  2. U computes \(n = pq\) and \(\phi(n) - (p - 1)(q - 1)\).
  3. U selects an arbitrary \(e\) such that \(1 < e < \phi\) and \(e\) is coprime with \(\phi(n)\). A common value of \(e\) is 3 - we can just ensure \(e\) is coprime by ensuring it's prime.
  4. U computes \(d\) such that \(ed \equiv 1 \pmod{\phi(n)}\) and \(1 < d < \phi(n)\).
  5. U publishes the public key \(\tup{n, e}\), and keeps the secret key \(d\) to themselves.

To send an encrypted message from user U to user V (note that real implementations also need to handle things like side channel attacks):

  1. U obtains an authentic copy of V's public key \(\tup{n_V, e_V}\).
  2. U represents the message as \(0 \le m \le n_V - 1\) such that \(\gcd(m, n_V) = 1\) - the message is represented as a single, fixed-size number that is not one of the primes.
  3. U computes \(c = m^e_V \mod n_V\).
  4. U sends \(c\) to V.
  5. V computes \(m = c^d_V \mod n_V\). Note that no part of this requires \(U\)'s public or private keys.

Decryption is essentially computing \((m^e)^d = m \mod n\). Finding \(e\)th roots of a number mod \(n\) requires us to factor \(n\), which is hard. The hardness of solving for \(m\) is the basis for RSA's security.

Suppose \(p \mid m\). Then \(m \equiv 0 \pmod p\), so \(m^{ed} \equiv 0 \pmod p\), so \((m^e)^d = m \pmod p\).

Suppose \(p \nmid m\). Since \(ed \equiv 1 \pmod \phi(n)\), \(ed = 1 + k\phi(n) = 1 + k(p - 1)(q - 1)\) for some \(k\). By Fermat's Little Theorem, \(m^{p - 1} \equiv 1 \pmod p\), so taking both sides to the power of \(k(q - 1)\), we get \(m^{k(p - 1)(q - 1)} \equiv 1^{k(q - 1)} \pmod p\). Multiplying both sides by \(m\), we get \(m^{1 + k(p -1)(q - 1)} \equiv m \pmod p\), or \(m^{ed} \equiv m \pmod p\). By a similar argument, \(m^{ed} \equiv 1 \pmod q\). Since \(p\) and \(q\) are coprime (since they're distinct primes), \(m^{ed} \equiv m \pmod{pq}\), so \(m^{ed} \equiv m \pmod n\).


The obvious attack on RSA is to factor \(n\) to get \(p, q\), and then use the extended Euclidean algorithm to get \(d\) - we just need to compute the private key. After decades of research, however, the consensus is that it's a hard problem.

In other words, we can define the basic RSA attack as "given \(\tup{n, e}\), find \(p, q\) by factoring". Alternatively, "given \(\tup{n, e}\), compute \(d\)". Note that the compute-\(d\) attack is at least as hard as factor-\(n\) attack, because if we have \(p, q\), we can easily compute \(d\), but if we have \(d\), it's not particularly easy to get \(p, q\). More formally, there's a turing reduction from every compute-\(d\) problem to a corresponding factor-\(n\) problem - if we can factor \(n\) quickly, we can compute \(d\) quickly, by using the oracle to factor \(n\) to get \(p, q\), then compute \(\phi(n) = (p - 1)(q - 1)\) and \(ed \equiv 1 \pmod{\phi(n)}\).

Also, it turns out there's a turing reduction from every factor-\(n\) attack to a compute-\(d\) problem. Therefore, since factor-\(n\) and compute-\(d\) both turing reduce to each other, they are polytime equivalent (denoted \(A_1 = A_2\), where \(A_1\) is factor-\(n\) and \(A_2\) is compute-\(d\)). Knowing this is useful because factoring is a very well studied problem, and we've spent centuries convincing ourselves that it's a difficult problem.

The RSA problem is defined as: given \(\tup{n, e}\) and \(0 \le c \le n - 1\), find \(0 \le m \le n - 1\) such that \(c \equiv m^e \pmod n\). In other words, we're trying to find the \(e\)th root of \(c\) under mod \(n\). It turns out that the RSA problem turing reduces to factor-\(n\), so it's no harder than factoring. ;wip: why?

However, we'd really like the factor-\(n\) problem to turing reduce to the RSA problem, because that would mean the RSA problem is at least as hard as factoring. However, this is an open question, and nobody has managed to prove it yet. That said, we generally just assume the RSA problem is hard, because we want that to be the case and nobody'd solved it yet.

For RSA by itself, a dictionary attack is posible - if the plaintext space is small and predictable enough, the attacker can precompute every possible plaintext/ciphertext pairs. To prevent this, we can append/prepend a random salt (e.g., a random 128-bit string) to the plaintext before encrypting it, then remove it after decrypting the ciphertext (e.g., by trimming it off of the decryption result to get the plaintext). That means the plaintext space is no longer small and predictable.

For RSA by itself, a chosen plaintext attack is also possible - given \(c\) and a decryption oracle \(f(c') = m'\) for any \(c' \ne c\) (we can decrypt any ciphertext except the one we're given), we can compute \(m\) from \(c\). How does this work?

Suppose without loss of generality that \(\gcd(c, n) = 1\) (because if \(\gcd(c, n) \ne 1\), then it must be a factor of \(n\), so we could easily find \(p, q\) and compute \(d\)).

Now, select an \(2 \le x \le n - 1\) such that \(\gcd(x, n) = 1\) (again, if the GCD wasn't 1 then we've factored \(n\) and are done). Let \(\hat c = cx^e \pmod n\), so we can obtain \(\hat m\) using the decryption oracle: \(\hat m = f(\hat c)\). Since \(\hat m \equiv \hat c^d \pmod n\) and \(\hat c \equiv cx^e\), \(\hat m \equiv (cx^e)^d \equiv c^d x^{ed} \equiv mx \pmod n\). Therefore, \(m = \hat m x^{-1} \mod n = f(cx^e) x^{-1} \mod n\).

To mitigate this, we can make sure decryption routines only return results if the message is formatted correctly (e.g., a hash of the rest of the message included with the message is correct). That means the decryption oracle would be highly unlikely to give an answer for any ciphertext other than one resulting from a properly encrypted message. We add in the formatting before encrypting, and then strip it out after decrypting.


PKCS #1 V1.5 is a set of public key cryptography standards, and has a standard for RSA. It's not really that good in practice, but it's very widely used.

Among other things, it defines the correct way to format the plaintext. Let \(M\) be the original plaintext, \(m\) be the formatted plaintext, and \(c = m^e \mod n\) be the ciphertext. The format is as follows: the bytes 0x00 and 0x02 (the standard's version number), 8 or more random non-zero bytes (the salt, to prevent dictionary attacks), the 0x00 byte (to denote the end of the salt), followed by the message. The total length of the format can be up to the bit size of \(n\) - 2048-bit RSA means we can have total length of 256 bytes, and a maximum message size of 245 bytes (since there are at least 11 bytes before the message itself in the format).

A public key cryptosystem is secure if and only if it is semantically secure against chosen-ciphertext attacks by a computationally bounded adversary.

To simulate a computationally bounded adversary, the adversary gets a target ciphertext and a decryption oracle that can decrypt anything except that particular ciphertext, and is trying to learn anything about the plaintext besides its length. The decryption oracle rejects any decryptions that aren't properly formatted.

The PKCS #1 V1.5 RSA standard's formatting of messages does allow the decryption oracle to reject attacker-generated messages in practice, but there are theoretically weaknesses in this padding scheme.

A polynomial time algorithm has worst case time \(O(n^c)\) for a constant \(c\). An exponential time algorithm has worst case time \(O(2^{cn})\) for a constant \(c\). A subexponential time algorithm has worst case time \(2^{o(n)}\). Polynomial time (polytime) is efficient, Subexponential time is inefficient, and exponential time is very inefficient.

Subexponential algorithms can often be written in the form \(L_n[\alpha, c] = O(\exp((c + o(1)) (\log n)^\alpha (\log \log n)^{1 - \alpha}))\) where \(0 \le \alpha \le 1\) and \(c > 0\). Note that \(L_n[0, c]\) is polynomial time and \(L_n[1, c]\) is fully exponential.

The naive way of factoring a number \(n\) is to try dividing all the integers from 2 to \(\floor{\sqrt{n}}\). While this is polynomial with respect to \(n\), \(n\) means a different thing in this context - the value of the number (e.g., "5234" has value 5234), rather than how long the number is (e.g., "5234" needs at least 4 digits to store).

Besides trial divisions, there's also a number of other special-purpose factoring algorithms like the special number field seive, elliptic curve factoring algorithm, and so on, which are only efficient if \(n\) has a special form, like when \(n - 1\) has only small factors. To make sure RSA is robust to all known factoring attacks, we choose the RSA primes \(p, q\) at random and with the same bit length.

There are also new general-purpose algorithms that can factor arbitrary numbers in slightly better time. For example, the number field seive factoring algorithm can do it in \(L_n\left[\frac 1 3, 1.923\right]\), and before that, the quadratic seive factoring algorithm can do it in \(L_n\left[\frac 1 2, 1\right]\).

Because there haven't been improvements on these in several decades, most mathematicians believe that factoring is not solvable in polynomial time. That said, we've used these methods to successfully factor up to RSA-768. Today, RSA-1024 is the minimum length considered secure, while RSA-2048 and RSA-3072 is recommended. Also, Shor's algorithm can factor numbers very easily on quantum computers, though we currently don't know if large-scale quantum computers can be built at all, and depends on the details of quantum mechanics that we don't have yet.

In hand-wavy terms, 3-DES is roughly as secure as SHA-224 and RSA-2048 (112 bits of security), AES-128 is roughly as recure as SHA256 and RSA-3072 (128 bits of security), AES-192 is roughly as secure as SHA-384 and RSA-7680 (192 bits of security), and AES-256 is roughly as secure as SHA-512 and RSA-15360 (256 bits of security). Something having \(k\) bits of security means the fastest known attack for it requires around \(2^k\) operations.


To send a signed message from user U to user V (note that real implementations also need to handle things like side channel attacks):

  1. U computes \(s = m^d \mod n\), and publishes this as the signature (the signature is the \(e\)th root of the message).
  2. V obtains an authentic copy of U's public key \(\tup{n, e}\).
  3. V checks that \(s^e \equiv m \pmod n\) to verify that the signature is valid.

The RSA problem needs to be hard for this to work, because otherwise we could efficiently solve \(s^e \equiv m \mod n\), allowing us to easily compute \(s\) without \(d\), the secret key.

In decreasing order of strength, the adversary is trying to perform a total break (recovering the private key or forging signatures at will), selective forgery (forging signatures for some messages), and existential forgery (forging a signature for any one message). The adversary can perform a key-only attack, known plaintext, and chosen-plaintext attack.

Much like with MAC, a signature scheme is secure if it is existentially unforgeable by a computationally bounded adversary performing chosen-plaintext attack. We simulate computational boundedness by saying the adversary has access to a signing oracle - they can use the oracle to get an arbitrary message signed, as long as it's not one of the messages we actually want to forge a signature for.

Bellare and Rogaway proved proved that if the RSA problem is intractable and \(H\) is a random function, then RSA signatures with full-domain hashes is a secure signature scheme.

A full-domain hash function If we use something like SHA-256, there are \(2^{256}\) values, while the signature is between .

PKCS #1 V1.5 also defines a standard for RSA signatures. Invented in 1993, it's widely used today. The process looks like this:

  1. Alice computes \(h = H(m)\) where \(H\) is a hash function like SHA-1 or SHA-256.
  2. Alice formats \(h\) into a message \(M\) with length \(k\) (the RSA bit length): the bytes 0x00 and 0x01 (representing the version of the standard), the repeated bytes 0xFF (the padding, computed to be long enough to make the entire message have length exactly \(k\)), the byte 0x00 (to represent that the padding is over), the name of the hash function (e.g., "SHA-1", "SHA-256"), and finally the hash value itself.
  3. Alice signs \(M\) using \(s = M^d \pmod n\), then sends Bob \(\tup{m, s}\).
  4. Bob verifies the padding is valid, and checks the hash value in \(m\)

However, this signature scheme is actually really broken if you implement it the obvious way (forgetting to check for additional bytes after the padding), and we can even forge signatures by hand pretty easily.

Creative Commons License This work by Anthony Zhang is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Copyright 2013-2017 Anthony Zhang.