Learning more about hash tables

Hash tables are one of the most commonly used data structures in programming. While I consider myself to be extremely comfortable with using them in code that I write every day, up until this week I had never taken the time to consider how they actually work. In this article, I will be giving a brief overview of what hash tables are, hash functions, how to handle collisions, and the big O of hash tables. Many of the screenshots in this post are from this replit page that I made, where I constructed a hash table from scratch.

Overview

Hash tables are one of the most commonly used data structures. The only data structure that I can think of that is more common is an array. It is really difficult to get through even some of the simplest coding exercises without encountering a hash table. Every single programming language makes use of hash tables, although each language tends to have their own name for it. For example, in JavaScript they are known as objects(object literals, not objects as in object-oriented programming), in Python they are called dictionaries, and in Ruby, Matz decided to keep it simple and call them hashes.

Hash tables use key and value pairs to store data, and one of their most important strengths is that these keys and values are easily readable by humans. for example, you might have a key called “species” with a value of “frog”. This makes them much easier to navigate to a new programmer than say, a Binary Search Tree.

Hash Functions

When you access a key in a hash map, it is often a string. for example, if every object in your hash map has a color prop, you could call .color or [color] on an instance. However, our computers actually store those keys as numbers. In order to make sure that these keys are both human and machine-readable, Hashing Functions are used. These functions take human-readable strings and convert them into machine-readable numbers. This is far from the only use of hash functions, but the only one that we will be covering today.

A collision is when your hash function hashes multiple keys into the same integer. For example, lets say your hash function hashes ‘John’ and ‘James’ into 8. What happens then? Your initial thought might be why not just write a better hash function or that there is something wrong with the code, but that is not necessarily the case. While it is true that some hash functions are better than others and cause less collisions, they are basically inevitable. So, how can we handle collisions when they do happen? A couple of the more common approaches are called separate chaining and linear probing.

Separate chaining means making it so that at each index of the array, there are actually multiple indices. For example, we could have each index of the array be an array or a linked list, so that when a collision happens, that key just gets added to the array or linked list that is at that index.

Linear probing takes a different approach. Rather than handling a collision by further nesting the data like separate chaining, we instead search through the array to find the next empty index and store the data there. While this is a little bit more complicated to implement, it does mean that we don’e have to further nest our data. The down side of linear probing is that you are limited in the amount of keys you can fit in your hash table.

Big O

Hash tables are extremely fast at completing some essential operations. The average case for search, insertion, and deletion is O(1). To my knowledge, no other commonly used data structure is so fast in all three of these operations. When you take into account the speed at which they complete these operations and how easy they are to use(especially for beginners), it is not surprising that hash tables are so ubiquitous. I’m very happy to have gone on a slightly deeper dive with hash tables, and am excited to try some of the more complex hashing functions.