Assignment 7
The Great Firewall of Santa Cruz:
Bloom Filters, Linked Lists and Hash Tables
Prof. Da
ell Long
CSE 13S – Spring 2021
First DESIGN.pdf draft due: May 27th at 11:59 pm PST
Assignment (formally) due: June 4th at 11:59 pm PST
1 Introduction
War is peace. Freedom is slavery. Ignorance is strength.
—George Orwell, 1984
You have been selected through thoroughly democratic processes (and the machinations of your friend and hero
Ernst Blofeld) to be the Dear and Beloved Leader of the Glorious People’s Republic of Santa Cruz following the fail-
ure of the short-lived anarcho-syndicalist commune, where each person in turn acted as a form of executive office
for the week but all the decisions of that officer have to be ratified at a special bi-weekly meeting by a simple ma-
jority in the case of purely internal affairs, but by a two-thirds majority in the case of more major decisions. Where
it was understood that strange women lying in ponds distributing swords is no basis for a system of government.
Supreme executive power derives from a mandate from the masses, not from some farcical aquatic ceremony. It
was a very silly place, and so with He
Blofeld’s assistance you are now the leader of your people.
In order to promote virtue and prevent vice, and to preserve social cohesion and discourage unrest, you have
decided that the Internet content must be filtered so that your beloved children are not co
upted through the use
of unfortunate, hurtful, offensive, and far too descriptive language.
2 Bloom Filters
The Ministry of Peace concerns itself with war, the Ministry of Truth with lies, the Ministry of
Love with torture and the Ministry of Plenty with starvation. These contradictions are not
accidental, nor do they result from from ordinary hypocrisy: they are deliberate exercises in
doublethink.
—George Orwell, 1984
The Internet is very large, very fast, and full of badthink. The masses spend their days sending each other cat videos
and communicating through their beloved social media platforms: Twitter and Discord. To your dismay, you find
that a portion of the masses frequently use words you deem improper—oldspeak. You decide, as the newly elected
Dear and Beloved Leader of the Glorious People’s Republic of Santa Cruz (GPRSC), that a more neutral newspeak
is required to keep your citizens of the GPRSC content, pure, and from thinking too much. But how do you process
and store so many words as they flow in and out of the GPRSC at 10Gbits/second? The solution comes to you
illiant and pure mind—a Bloom filter.
© 2021 Da
ell Long 1
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton H. Bloom in 1970, and is
used to test whether an element is a member of a set. False-positive matches are possible, but false negatives are
not—in other words, a query for set membership returns either “possibly in the set” or “definitely not in the set.”
Elements can be added to the set but not removed from it; the more elements added, the higher the probability of
false positives.
A Bloom filter can be represented as an a
ay of m bits, or a bit vector. A Bloom filter should
utilize k different hash functions. Using these hash functions, a set element added to the Bloom
filter is mapped to at most k of the m bit indices, generating a uniform pseudo-random distribu-
tion. Typically, k is a small constant which depends on the desired false e
or rate ², while m is
proportional to k and the number of elements to be added.
Assume you are adding a word w to your Bloom filter and are using k = 3 hash functions,
f (x), g (x), and h(x). To add w to the Bloom filter, you simply set the bits at indices f (w), g (w),
and h(w). To check if some word w ′ has been added to the same Bloom filter, you check if the
its at indices f (w ′), g (w ′), and h(w ′) are set. If they are all set, then w ′ has most likely been
added to the Bloom filter. If any one of those bits was cleared, then w ′ has definitely not been added to the Bloom
filter. The fact that the Bloom filter can only tell if some word has most likely been added to the Bloom filter means
that false positives can occur. The larger the Bloom filter, the lower the chances of getting false positives.
So what do Bloom filters mean for you as the Dear and Beloved Leader? It means you can take a list of pro-
scribed words, oldspeak and add each word into your Bloom filter. If any of the words that your citizens use seem
to be added to the Bloom filter, then this is very ungood and further action must be taken to discern whether or not
the citizen did transgress. You decide to implement a Bloom filter with three salts for three different hash functions.
Why? To reduce the chance of a false positive.
You can think of a “salt” as an initialization vector or a key. Using different salts with the same hash function
esults in a different, unique hash. Since you are equipping your Bloom filter with three different salts, you are
effectively getting three different hash functions: f (x), g (x), and h(x). Hashing a word w , with extremely high
probability, should result in f (w) 6= g (w) 6= h(w). These salts are to be used for the SPECK cipher, which requires
a 128-bit key, so we have used MD5 1 “message-digest” to reduce three books down to 128 bits each. You will use
the SPECK cipher as a hash function, which will be discussed in §4.
2.1 BloomFilte
The following struct defines the BloomFilter ADT. The three salts will be stored in the primary, secondary,
and tertiary fields. Each salt is 128 bits in size. To hold these 128 bits, we use an a
ay of two uint64_ts. This
struct definition must go in bf.c.
1 struct BloomFilter {
2 uint64_t primary [2];
Primary hash function salt.
3 uint64_t secondary [2];
Secondary hash function salt.
4 uint64_t tertiary [2];
Tertiary hash function salt.
5 BitVector *filter;
6 };
2.2 BloomFilter *bf_create(uint32_t size)
The constructor for a Bloom filter. Working code for the constructor, complete with primary, secondary, and ter-
tiary salts that you will use, is shown below. Note that you will also have to implement the bit vector ADT for you
Bloom filter, as it will serve as the a
ay of bits necessary for a proper Bloom filter.
1Rivest, R.. “The MD5 Message-Digest Algorithm.” RFC XXXXXXXXXX): 1-21.
© 2021 Da
ell Long 2
Preethika Rangamgari
Preethika Rangamgari
1 BloomFilter *bf_create(uint32_t size) {
2 BloomFilter *bf = (BloomFilter *) malloc(sizeof(BloomFilter));
3 if (bf) {
4
Grimm 's Fairy Tales
5 bf ->primary [0] = 0x5adf08ae86d36f21;
6 bf ->primary [1] = 0xa267
d3116f3957;
7
The Adventures of Sherlock Holmes
8 bf ->secondary [0] = 0x419d292ea2ffd49e;
9 bf ->secondary [1] = 0x09601433057d5786;
10
The Strange Case of Dr. Jekyll and Mr. Hyde
11 bf ->tertiary [0] = 0x50d8
08de3818df;
12 bf ->tertiary [1] = 0x4deaae187c16ae1d;
13 bf ->filter = bv_create(size);
14 if (!bf->filter) {
15 free(bf);
16 bf = NULL;
17 }
18 }
19 return bf;
20 }
2.3 void bf_delete(BloomFilter **bf)
The destructor for a Bloom filter. As with all other destructors, it should free any memory allocated by the con-
structor and null out the pointer that was passed in.
2.4 uint32_t bf_size(BloomFilter *bf)
Returns the size of the Bloom filter. In other words, the number of bits that the Bloom filter can access. Hint: this
is the length of the underlying bit vector.
2.5 void bf_insert(BloomFilter *bf, char *oldspeak)
Takes oldspeak and inserts it into the Bloom filter. This entails hashing oldspeak with each of the three salts fo
three indices, and setting the bits at those indices in the underlying bit vector.
2.6 bool bf_probe(BloomFilter *bf, char *oldspeak)
Probes the Bloom filter for oldspeak. Like with bf_insert(), oldspeak is hashed with each of the three salts fo
three indices. If all the bits at those indices are set, return true to signify that oldspeak was most likely added to
the Bloom filter. Else, return false.
2.7 uint32_t bf_count(BloomFilter *bf)
Returns the number of set bits in the Bloom filter.
2.8 void bf_print(BloomFilter *bf)
A debug function to print out a Bloom filter.
© 2021 Da
ell Long 3
Preethika Rangamgari
Preethika Rangamgari
Preethika Rangamgari
Preethika Rangamgari
Preethika Rangamgari
Preethika Rangamgari
Preethika Rangamgari
Preethika Rangamgari
Preethika Rangamgari
Preethika Rangamgari
Preethika Rangamgari
Preethika Rangamgari
Preethika Rangamgari
Preethika Rangamgari
Preethika Rangamgari
Preethika Rangamgari
3 Hashing with the SPECK Ciphe
You will need a good hash function to use in your Bloom filter and hash table (discussed in §5). We will have
discussed hash functions in lecture, and rather than risk having a poor one implemented, we will simply provide
you one. The SPECK2 block cipher is provided for use as a hash function.
SPECK is a family of lightweight block ciphers publicly released by the National Security Agency (NSA) in June
2013. SPECK has been optimized for performance in software implementations, while its sister algorithm, SIMON,
has been optimized for hardware implementations. SPECK is an add-rotate-xor (ARX) cipher. The reason a ciphe
is used for this is because encryption generates random output given some input; exactly what we want for a hash.
Encryption is the process of taking some file you wish to protect, usually called plaintext, and transforming its
data such that only authorized parties can access it. This transformed data is refe
ed to as ciphertext. Decryption
is the inverse operation of encryption, taking the ciphertext and transforming the encrypted data back to its origi-
nal state as found in the original plaintext. Encryption algorithms that utilize the same key for both encryption and
decryption, like SPECK, are symmetric-key algorithms, and algorithms that don’t, such as RSA, are asymmetric-key
algorithms.
You will be given two files, speck.h and speck.c. The former will provide the interface to using the SPECK
hash function which has been named hash(), and the latter contains the implementation. The hash function
hash() takes two parameters: a 128-bit salt passed in the form of an a
ay of two uint64_ts, and a key to hash.
The function will return a uint32_t which is exactly the index the key is mapped to.
1 uint32_t hash(uint64_t salt[], char *key);
4 Bit Vectors
Symmetrical equations are good in their place, but ‘vector’ is a useless survival, or offshoot from
quaternions, and has never been of the slightest use to any creature.
—Lord Kelvin
You will reuse much of the bit vector implementation from assignment 5 for this assignment. If there were any
problems with your implementation,