
Understanding Hash Tables and Functions
Explore the concepts of hash tables, hash functions, and collisions in data structures and algorithms. Learn how hash tables operate, the importance of hash functions, and how collisions are managed. Discover the key considerations when implementing a hash table for efficient data storage and retrieval.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
ECE264: Data Structures and Algorithms I (DSA 1) Hash Tables
Hash Tables A hash table is an abstract data type that supports three operations: insert, search, and (usually some form of) delete The goal is to support the three supported operations in constant average time This can be achieved, under certain assumptions Certain other operations that are available with binary search trees (including balanced binary search trees) will not be available For example, there is no efficient way to loop through all the elements in a hash table in sorted order by key
Hash Functions The hash table will have a size that we ll call TableSize (or later, M); this represents the capacity of the hash table Typically, the data in a hash table will be stored in an array or vector of the appropriate size A common convention is to have the array use indices ranging from 0 to TableSize - 1 A key for each element, or item, in the hash table is typically a string with some associated meaning (e.g., a person s username or social security number) The key is considered the identifier of the item, and duplicate keys are not allowed in a hash table (i.e., each item s key must be unique) It must be possible to map every key to an index in the appropriate range That is, each element gets mapped to an appropriate slot, or cell, of the array The mapping is done by a hash function The application of the hash function to map an element s key to a slot in the hash table is called hashing
Collisions Ideally, a hash function would be able to ensure that any two distinct keys get mapped to different cells The book points out that "since there are a finite number of cells, and a virtually inexhaustible supply of keys, this is clearly impossible" Even if you could conceivably allocate enough memory to store every possible key, this would be wasteful if only a small percentage of keys are likely to occur in practice Instead, we seek a hash function that distributes the keys evenly among the cells When two distinct keys get mapped to the same location, this is called a collision Thus, there are at least three choices you need to make when implementing a hash table: 1. The (initial) size of the hash table 2. How to implement the hash function 3. What to do when two keys hash to the same position (i.e., when a collision occurs)
Bad Hash Function Examples One possible hash function: add the ASCII values of the characters in the key and take the remainder when the sum is divided by the table size Taking this remainder as the final step makes the hash function a modular hash function This simple function is bad for multiple reasons: If the keys tend to be short strings, the range of likely hash values might be much smaller than the size of a typical hash table (i.e., most of the hash table will not be used) Strings with different permutations of the same characters will all collide Another simple hash function (assuming keys consist of letters and spaces, with each character having a value from 0 to 26, and assuming all keys have at least three characters): (Key[0] + Key[1] * 27 + Key[2] * 729) % TableSize Even if the range of this hash function is exactly the table size (i.e., the table was created with this hash function in mind), this is still a poor hash function for multiple reasons: One problem is that English is not random; some combinations are much more likely than others Another problem is that if all keys start with the same characters, everything will collide The next slide shows a hash function that the book says is good (but it is likely just OK at best)
A Semi-Reasonable Hash Function int hash(const string & key, int tableSize) { int hashVal = 0; for(int i = 0; i < key.length(); i++) hashVal = 37 * hashVal + key[i]; hashVal %= tableSize; if (hashVal < 0) hashVal += tableSize; return hashVal; }
More About Hash Functions Commonly, hash functions will not always use all the key s characters, and the exact implementation might depend on expected properties of the keys Some hash functions do not apply to strings; for example, for some applications, the keys that are mapped to slots in the hash table might be integers or floats There has been a lot of research into appropriate hash functions for various applications The goal is that for a hash table of size M (where M is typically a prime number), the probability of two unequal keys colliding is as close to possible to 1 / M Some hash functions even consider the fact that there can be malicious generators of keys A universal hash function chooses randomly from a set of appropriate hash functions Of course, once selected, it must be consistently used for the hash table
Choosing the (Initial) Size You also must choose a size for the hash table According to an earlier edition of our textbook, good values tend to be prime numbers that are "not close to a power of 2" Exact powers of two can be bad because you will wind up hashing based on the low order bits of the computed value (assuming a modular hash function) It is not clear to me that "not close to a power of 2" really matters; I have never seen any explanation about why "close to a power of 2" would be bad I have also heard, semi-recently, that some modern hash functions do allow, or perhaps even suggest, powers of two as the size In any case, the current edition of our textbook still points out that "it is often a good idea to ensure that the table size is prime" Conventional hash table implementations tend to use prime numbers for the table size Prime numbers tend to avoid bad circumstances involving many collisions due to unexpected patterns in the data For certain collision resolution strategies (which we will talk about shortly), this can be crucial
Handling Collisions As previously stated, when two distinct keys hash to the same location in a hash table, this is called a collision A major issue when implementing a hash table is how to handle collisions There are various methods that can be used; these are commonly referred to as collision resolution strategies
Separate Chaining: Insertions The simplest technique is to store a linked list starting at every slot of the hash table This is known as separate chaining because items that collide are chained together in a list, typically implemented as a linked list Each slot is in the hash table's array stores a linked list of items that collide at that slot Insertions typically insert a new item at the start of the appropriate list This is because, in practice, recently inserted items are often more likely to be searched for, compared to items inserted a long time ago Although the textbook points this out, their sample code inserts the new item at the end of the list Hypothetically, if you don't need to check for duplicate keys, when separate chaining is used as the collision resolution strategy, insertion can be worst-case constant Assuming you need to check for duplicate keys, which is typical, you need to search the entire list, making insertion worst-case linear Nonetheless, we will see that insertion can still be average-case constant time under certain assumptions
Separate Chaining: Searches A search will use the hash function to determine the appropriate linked list, which then must be traversed one node at a time The amount of time required for a search depends on how many items are located at the current slot In the worst case, every key will map to the same slot (i.e., they will all collide), and using a hash table is no better than using a simple linked list Therefore, like insertion, search is worst-case linear If keys are distributed somewhat evenly among cells, you can expect that the time required for a search will be reduced by approximately M, the size of the hash table Under certain assumptions, this can provide us with average-case constant searches Successful searches tend to be empirically faster than unsuccessful searches, since they do not wind up searching the entire list (we'll see more details soon)
Separate Chaining: Deletion To delete an item with a specified key, first, you must search for it If the item is found, you delete it from the linked list As we discussed during a previous topic, deletion from a linked list is a constant time routine, once you have found the item to delete Therefore, the complexity of deletion is the same as that of search (worst-case linear, average-case constant under certain assumptions) Separate chaining is the only collision resolution strategy we will discuss which supports true deletion
Separate Chaining: Load Factor Define to be the load factor, which for separate chaining, is the average number of elements stored in a chain (i.e., stored at each cell) Under the assumption of simple uniform hashing, every slot will be equally likely, and we will have = N / M, where N is the number of items in the hash table, and M is the size of the hash table The hash function requires constant time, so an unsuccessful search takes time O(1 + ), on average The asymptotic running time for successful searches will be at least as fast Typically, you might assume that half of the nodes in a chain will be searched, on average, for successful searches If you have an idea of how many elements the hash table will need to deal with, you can create a hash table for which M = (N) Then, search and deletion will take constant time on average Insertion with separate chaining, assuming you need to check for duplicate keys (which is typical), also takes average-case constant time, since it requires a search The worst-case time complexity for search, deletion, and insertion if you need to check for duplicates, using separate chaining, is linear, which occurs if everything collides (i.e., if every item is mapped to the same slot)
Separate Chaining: In Practice Typically, when using separate chaining, choosing M to be a prime number close to the expected number of input values is a reasonably choice If unexpectedly grows to be significantly greater than 1, rehashing, which we will discuss near the end of the topic, can be used Consider a hash table that is using separate chaining, for which the table size is M, and N items with random keys are inserted Also assume that a good hash function is being used Under these conditions, it can be shown that the probability that the number of keys in each list is within a small constant factor of N/M is close to 1 Although separate chaining has some nice properties, it requires the implementation of a secondary data structure (i.e., a linked list) This requires extra memory, and it empirically slows down the supported operations, since linked lists require dynamic memory allocation
Open Addressing Schemes An alternative type of collision resolution strategy is to use one of a set of open addressing methods With an open addressing collision resolution strategy, if a collision occurs, other cells of the hash table are probed until an empty slot is found Such hash tables are therefore also referred to as probing hash tables Such a scheme cannot be used if it is possible that N (the number of inserted elements) might be greater than M (the size of the hash table) This can be avoided with rehashing, which will be discussed later As described in the book, when an item, x, is inserted into the hash table, the slots, or cells, are probed in a particular order: h0(x), h1(x), h2(x), ... The sequence is defined by: hi(x) = (hash(x) + f(i)) % TableSize The main requirement for f is that f(0) = 0
Linear Probing In linear probing, f is a linear function of i, and typically, f(i) = i This amounts to trying cells sequentially, with wraparound, in search of an empty cell When you implement it, you don't need to use the formula from the previous slide Rather, you start with the return value of the hash function, and if there is a collision, increase the index with wraparound until an empty cell is found When performing an insertion, you eventually find an empty cell (assuming the table is not full) and put the new element there When searching, you probe until you either find a cell with a key matching the one you are searching for (a hit), or you find an empty cell (a miss)
Linear Probing Example To explain linear probing, we will step through a toy example We ll assume there are only ten slots in the hash table We ll assume we are inserting integers, and the key is the one s digit We ll insert the following integers, in the order specified here: 89, 18, 49, 58, 69 0: 1: 2: 3: 4: 5: 6: 7: 8: 9:
Linear Probing Example (continued) 89, 18, 49, 58, 69 ^ 0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 89
Linear Probing Example (continued) 89, 18, 49, 58, 69 ^ 0: 1: 2: 3: 4: 5: 6: 7: 8: 18 9: 89
Linear Probing Example (continued) 89, 18, 49, 58, 69 ^ 0: 49 1: 2: 3: 4: 5: 6: 7: 8: 18 9: 89
Linear Probing Example (continued) 89, 18, 49, 58, 69 0: 49 1: 58 2: 3: 4: 5: 6: 7: 8: 18 9: 89 ^
Linear Probing Example (continued) 89, 18, 49, 58, 69 0: 49 1: 58 2: 69 3: 4: 5: 6: 7: 8: 18 9: 89 ^
Linear Probing Analyzed You can see that the time to find a free cell can get quite large Even in a relatively empty table, clusters can form; this particular type of clustering is known as primary clustering For linear probing, we can still define the load factor to be = N / M With separate chaining, the load factor represents the average size of each linked list With linear probing, the load factor represents the fraction of the table that is currently occupied (so it must be 1) It can be shown that the expected (i.e., average) number of probes for successful searches (hits) is 1/2 * [1 + 1 / (1 - )] For unsuccessful searches (misses) and insertions, it is 1/2 * [1 + 1 / (1 - )2] For = 1/2, this leads to 1.5 and 2.5; for = 3/4, this leads to 2.5 and 8.5; for = 9/10, this leads to 5.5 and 50.5 The textbook suggests that the load factor, , should be kept below 1/2
Quadratic Probing (briefly) One alternative to linear probing is quadratic probing, although this strategy is probably less common in practice than the others we are discussing This is similar to linear probing, but the function f(i) is replaced with a quadratic function; the book suggests that f(i) = i2is a good choice Using linear probing, when the table is nearly full, performance degrades For quadratic probing, the situation is more drastic You may not find an empty cell if the table is more than half full, or even before that if the table size is not prime However, if the table is less than half full, and the table size is prime, we are guaranteed to find an empty cell Quadratic probing gets rid of primary clustering There is still a problem called secondary clustering because the initial hash function determines the entire sequence for a probe Our textbook refers to secondary clustering as a "slight theoretical blemish"
Double Hashing Another open addressing strategy is called double hashing For double hashing, one popular choice is: f(i) = i * hash2(X) Here, hash2is an auxiliary hash function that is not allowed to evaluate to 0 The book's suggestion for hash2(X) is: hash2(X) = R - (X mod R), where R is a prime smaller than the table size However, this does not make sense for most applications, because X is typically a string A popular choice based on other sources seems to be to use another hash function that uses % (TableSize - 1) and then adds 1, instead of using % TableSize In general, it is important that hash2(X) never evaluates to zero, and that all return values are relatively prime to the actual table size Both of these constraints will be satisfied using the method mentioned above (which was based on other sources) if TableSize is prime
Double Hashing Analyzed Double hashing avoids both primary and secondary clustering From another source: The average number of probes for successful searches (hits), when double hashing is used, is 1 / * ln(1 / (1 - )) The average number of probes for unsuccessful searches (misses) and insertions 1 / (1 - ) For = 1/2, this leads to approximately 1.4 and 2; for = 3/4, this leads to about 1.8 and 4; for = 9/10, this leads to about 2.6 and 10 These numbers of probes look considerably smaller than the equivalent numbers for linear probing, listed a few slides back However, the proving for double hashing does require the calculation of a second hash function
Lazy Deletion With all forms of open addressing, standard deletion cannot be performed This is because the cell you would be deleting may have caused a collision to go past it If you truly delete the item, searches may stop at this cell by mistake and miss finding keys later in the same sequence of probing Therefore, lazy deletion must be used instead Instead of actually deleting the item, you keep it in the hash table, but mark it as deleted Searches will then skip lazily deleted items, but not stop searching In the textbook's implementation, every cell has a state, which is ACTIVE, EMPTY, or DELETED
Comparing Collision Resolution Strategies According to another source, the three major collision resolution strategies are linear probing, double hashing, and separate chaining Linear probing is the fastest if the hash table is big enough so that it remains sparse Double hashing makes the most efficient use of memory, but it takes extra time to compute the second hash function Separate chaining is the simplest to implement, it supports standard deletion, and it makes rehashing less important However, separate chaining is empirically slower than the other choices Typically, when you are using a hash table, you care about speed, so I would suggest that separate chaining is not usually a good choice
Rehashing If the hash table gets too full, operations might take longer than desired; this can happen more often when lazy deletion is used One solution: When an insertion causes the size of the hash table to be too full, create a bigger hash table with a new hash function and copy all non-deleted values from the original hash table into the new one This is called rehashing Although rehashing means that, once in a while, an insertion operation will take a relatively long time, this might not be as bad as it seems It is customary to make the new hash table at least twice the size of the original Often, a precomputed list of prime numbers, each more than twice the size of the previous one, is used to quickly determine the new table size (there is no need to compute prime numbers on the fly) If you receive 1000 times as much data as you expect, you only perform the rehashing step about 10 times The rehashing operation is linear with respect to the size of the original hash table (I'll explain why in class); therefore, it is also linear with respect to N, the number of items inserted so far It can be shown that insertion still has constant average time after considering the rare, linear rehashes One way to intuit this is to consider dividing up the time taken by the linear rehash over the linear number of insertions that have taken place so far
Some Important Applications of Hash Tables Compilers use hash tables to store symbol tables to keep track of information about variables Some game-playing programs (e.g., chess, checkers, Othello, etc.) use hash tables for transposition tables to avoid repeating searches Spell checkers use hash tables to detect errors and highlight misses Database management systems sometimes use hash tables (or B+ trees, or other related data structures) as indexes to optimize queries Information retrieval systems (e.g., web search) use hash tables for inverted indexes, which map words or terms to lists of documents