Hash tables! What are they?

I'm sure we have used them in one way or another; whether as a dict() in Python, a HashMap<>() in Java, or as a Map() in JavaScript (as a side note, set() in Python is implemented using a hash table). ¹ But how do these implementations of hash tables work in these languages under the hood?

Introduction

First off, for those of you who do not know, a hash table is a data structure that implements the associative array² abstract data type that stores (key, value) pairs where all keys in the hash table are unique and distinct from eachother. Insertion, deletion, and lookup/search operations on hash tables take as input a key and inserts/deletes based on the input value. Keys/values in hash tables are NOT ordered.

For example, consider the following code snippet of a hash table in Python:

hash_table = dict()

# insert pair (10, "hello") and (20, "world")
hash_table[10] = "hello"
hash_table[20] = "world"
# hash_table is now {10: "hello", 20: "world"}

# delete pair (10, "hello")
del hash_table[10]
# hash_table is now {20: "world"}

# update value at key 20 to "apples"
hash_table[20] = "apples"
# hash_table is now {20: "apples"}

# retrieve value at key 20:
value = hash_table[20]
# value is "apples"

And here's the cool part:

\\[0.2em] \begin{array}{|c|c|c|} \hline \text{Operation} & \text{Average case} & \text{Worst case} \\ \hline Insert&Θ(1)&O(n) \\ \hline Delete&Θ(1)&O(n) \\ \hline Lookup&Θ(1)&O(n) \\ \hline \end{array}

All the operations have constant time average case!

Now, it's important to briefly mention that behind the scenes of a hash table lies a fixed size array where the (key, value) pairs are actually stored. Keep this in mind while reading the next section(s).

The Hash Function

H: k \to \set{0,1,2\dots m-1}

The hash function serves as the cornerstone of the hash table data structure. A hash function, in the context of hash tables, simply takes as input an arbitrary key value and mathematically calculates a hash value based on the input. Hash functions should demonstrate the following properties when calculating hash values:

Deterministic: Must always produce the same hash value for the same key.
Fixed range of outputs: A produced hash value $v$ should be fixed within range $0 \le v \le m-1$ where $m$ is the length of the fixed size array.
Uniformity: Every hash value in the output range should be generated with roughly the same probability.
Minimised collisions: A good hash function should have minimal collisions where a collision is defined as when two (or more) unique keys map to the same hash value.
Efficient: Hash value calculation should be computationally inexpensive.

Alright so now that we've established some properties of a good hash function, let's now see why these properties matter in relation to hash tables.

Do you remember when we talked about (key, value) pairs and the fixed size array? Well, these pairs are indexed within the array by first applying the hash function to the key and then using the resultant hash value as the index for the pair in the array.

Property 2 of the hash function now starts to make more sense. If we're using hash values as indexes for an array, then the hash values must be bounded by 0 and the last index of the array, or: $0 \le v \le m-1$ where $m$ is len(array).

Hopefully you can now see why the hash function is so important. It allows us to quickly get the index of a specific location within the hash table's array by simply hashing the input key and performing an operation on array[index]. We also know that indexing and array takes $O(1)$ constant time too.

This is fantastic news! $O(1)$ time complexity for all operations is amazing! But, as we've seen before...the worst case is indeed $O(n)$ so whats the catch?

Nothing's perfect

Now, in property 3 of a hash function we talked about collisions, when two or more unique keys hash to the same value. This can sometimes happen (albeit very rarely in well designed hash functions) and when it does, we have to deal with it. But by designing a hash function in consideration of properties 2 & 3, we can minimise the chance of collisions.

(Also for those of you wondering, yes, perfect hash functions with no collisions do exist)³

Keeping collisions in mind, let's now see how we can handle them within the implementation of the operations.

Operations & Collision Resolution Methods

Let's begin by assuming we have a perfect hash function: one that never has any collisions for sake of visualising what the operations do in a perfect world.

In the following examples, we use integer keys but a hash function should be able to hash other data types such as strings.

Operations

Insert

(Note: when inserting a value with a key that already exists in the table, the old value associated with that key is simply updated to the new value)

Delete

To do lookup for a value, the input key is simply hashed and the position in the array is checked for the value.

Pretty simple right?

Now let's see what happens when there's a collision with a non-perfect hash function during insertion in a seperate example hash table.

Collision!

Oh no! What happened? As we can see in the figure, both keys 5 and 7 just so happen to hash to the same value 2 which results in a collision when the value "fish" is being inserted.

So...how do we deal with this? Well, there are several ways we can approach handling collisions but let's begin with one called _"seperate chaining".

Seperate Chaining

This method involves storing a linked list at each index of the array, so when we have a collision during insertion, the new (key, value) pair is simply appended to the linked list at the index that has the collision.

Let's see how this works with the previous collision example:

Seperate Chaining

So as we can now see in the figure above, "fish" is now appended onto the linked list at index 2 of the array. The linked list at index 2 now looks like:

(7, "cats") --> (5, "fish")

Now, this does have some implications for the rest of the operations (delete and lookup), instead of just hashing the input key for these operations and looking at the appropriate index in the array, we now also have to traverse/search through the linked list located at that index to look for the desired key since all the keys in the linked list hash to the same value.

You might now be able to see where the $O(n)$ worst case comes from...

If we were super-duper unlucky and all of our elements, $n$ , were stored within a linked list at some index in the array, then we obviously will need to traverse that whole linked list in $O(n)$ time in order to delete and search for an element.

Instead of linked lists, other data structures may be used in seperate chaining but those won't be covered here, you can, however, have a read of them in the Wikipedia page for hash tables.⁴

Open Addressing

This is another collision resolution method but unlike seperate chaining, it does not involve introducing other data structures to resolve collisions.

Instead, all keys and values are kept within the array but collisions are dealt through a method called probing.

I know it sounds like the beginning of an alien abduction but trust me it's a pretty clever process.

Linear Probing

Suppose we try and insert an arbitrary (key, value) pair into our hash table but encounter a collision at some index $i$ (i.e: theres already an entry at index $i$ and the keys don't match) the next course of action with the linear probing method is to simply check the next space in the array, i.e: at index $i+1$ and if that space is free, just insert into it. If not, continue searching at $i+2$ , and $i+3$ , and so on.

Sounds pretty simple right? That's because it is, at least on first impression because there are a few implications with just chucking in (key, value) pairs in the next available space in our array.

Insert

Suppose we have 4 arbitrary (key, value) pairs that all just so happen to have keys that hash to the same index $i$ , when inserting, we'll end up with a cluster of entries within the range

$\set{i,~i+1,~i+2,~i+3}$

This concept of entries in the hash table clustering together is called primary clustering and can have a serious impact on the performance of operations on the hash table. But this concept will be expanded upon later, for now, just internalise the fact that with linear probing, entries have the tendency to bunch together when inserting.

Insert

In this example, we are going to insert

(6, "Carbon")
(8, "Oxygen")
(7, "Nitrogen")
(1, "Hydrogen)

Which all contain keys that hash to the value 1.

Lookup/Search

Now, when a search is conducted we of course first hash the key and look at array[i] which is indicted by the hash value. If the key at array[i] matches our input key then we return the value at array[i], if the key doesn't match then we keep searching starting at array[i+1]. It's also important to note that when our search brings us to the end of the array, i.e: array[len(array)-1], we have to loop back and continue our search at array[0].

But what if the key we're looking for doesn't exist in our hash table? When do we stop our search?

We stop when we encounter an empty entry in the array, i.e: when array[i] == None as this denotes the end of the cluster of entries.

Deletion

Ok let's talk about deletion in a linear probing hash table. Let's use the scenario from above, where we have a cluster of entries sitting in our hash table.

When we naively delete (3, "Nitrogen"). We now have a big gaping hole in our cluster of entries (which has now become two clusters). Now let's search for key 1 in our hash table.

Delete

Uh oh...we can clearly see that key 1 is in our hash table but our search operation ended because we found an empty entry (i.e: where (3, "Nitrogen") was deleted) and we were not able to continue searching into the second cluster. So how do we handle this?

There's two ways to approach this:

Reinsertion: After deleting an entry at array[i], reinsert all entries from array[i+1] until an empty entry.
Ghost entry: After deleting an entry at array[i], insert a special symbol at array[i] to denote an element was deleted at that index.

Now, when using the ghost entry approach, the ghost entry essentially "fills" the empty whole that would otherwise be left by deleting the entry naively. So when searching, you can ignore this ghost entry and continue searching past it. When inserting, the to-be-inserted pair can safely be inserted at the index which has the ghost entry.

Primary Clustering and Load Factor

I briefly mentioned primary clustering before but let's now get a bit deeper into it.

So we now know what clusters in a hash table are but why do they degrade performance? Well, imagine a hash table that contains a big, big cluster of entries of varying hash values that have merged together to form this giant cluster. Now, whenever we want to search our hash table we have to linearly search this cluster to find the key we want. This is pretty bad and you can probably quite obviously see that this is quickly becoming $O(n)$ time for operations.

On top of this, the larger the cluster, the higher chance it's going to get even larger since there will be a statistically higher chance that any given (key, value) pair inserted into the hash table is going to be hashed into the cluster, even if the cluster's associated hash value is not the same as the one being inserted.

So what do we do?

Towards the beginning I mentioned that hash tables were implemented using fixed size arrays, but some of you might now be thinking...what if that array gets full? Well I sorta lied to you guys, when a hash table reaches a certain threshold of "fullness" then the size of the underlying array structure must be increased.

Or, more specifically, a new array is created and every (key, value) pair in the old array is reinserted/rehashed into the new array. Upon doing this, previously large clusters present in the old array can disappear which increases the performance of operations on the new hash table.

This threshold/level of "fullness" is called the load factor of a hash table and is defined by this expression:

\mathbf{load~factor~(\alpha)} = \frac{n}{m}

where

$n$ is the number of entries present within the hash table.
$m$ is the size of the array.

Now, without getting too deep into it, rehashing is obviously a computationally expensive operation so we don't want to constantly rehash our whole hash table since the cost of rehashing will outweigh the benefits of reducing clustering in the new hash table, but we also don't want to wait until our table is almost completely full to rehash since performance degradation of operations will be too noticeable.

This is why it's important to come up with a reasonable $\alpha_{max}$ value to rehash at.

Conclusion

Thanks for reading the first post on this blog! There's still quite a bit more stuff to learn for hash tables, this post just serves as a light introduction to them (I also didn't want to make it super long) but I recommend having a look at quadratic probing as a continuation of linear probing and also reading up on actual implementations of hash functions to see how they work. The Wikipedia page on hash tables is also quite comprehensive on all different methods and processes.⁴

Footnotes

Set implemention in Python: GitHub ↩
Wikipedia page on associate arrays: Wikipedia ↩
Wikipedia page on perfect hash functions: Wikipedia ↩
Wikipedia page on hash tables Wikipedia ↩ ↩²