
Understanding Hashing and Hash Tables
This content provides detailed information about hashing, hash tables, and hash functions in computer science. Learn how data can be organized efficiently using these concepts to perform operations in constant time. Discover the basics of hashing, the role of hash functions in generating hash values, and how hash tables store associated data values.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
David Stotts Computer Science Department UNC Chapel Hill
Hashing, Hash Table, HashMap
A means of organizing data that allows us to do a subset of the BST operations in O(1) time !! i insert, delete, find A means of organizing data that allows us to do a subset of the BST operations in O(1) time !! nsert, delete, find o ordering operations ( done rdering operations ( findMin done findMin, traversals ) can t be , traversals ) can t be
Hashing integer (the hash some data value (the key We intend to use that hash integer as an index into an array or table of associated data Hash Table HashMap hashing or hash table Hashing is the basic concept of computing an hash or hash value key ) hash value ) from Hash Table is the array where data is stored HashMap is a MAP ADT implemented via
Hash Function generates a hash value from a key hash ( key ) Used to implement MAP via a hash table get ( key ) is (generally) table [ hash(key) ] Hash Function is the computation that hash ( key ) int int get ( key ) from the MAP ADT table [ hash(key) ] in the implementation
0 0 1 1 Hash Table associated data values Want to MAP key jones Hash Table is an array of the 2 2 3 3 4 4 jones to data 4824173 4824173 5 5 j jones, 4824173 ones, 4824173 Hash function takes STRING So if hash( jones ) we have stored this association ( jones , 4824173) By putting the data pair into array cell 5 5 STRING and produce INT INT 6 6 hash( jones ) is 5 5 7 7 ( jones , 4824173) 8 8 9 9 10 10
0 0 smith, 6713490 smith, 6713490 1 1 a allen llen, 5610072 , 5610072 2 2 miller, 8225193 m iller, 8225193 We really want something like this MAP ( jones ) is 4824173 We really want something like this MAP ( jones ) is 4824173 3 3 t trane rane, 3770216 , 3770216 We have to get it this way table[ hash( jones ) ] is We have to get it this way table[ hash( jones ) ] is 4824173 4 4 4824173 5 5 j jones, 4824173 ones, 4824173 table table[ [ 5 ] 5 ] is is ( jones, 4824173 ) ( jones, 4824173 ) 6 6 g gordon ordon, 1377216 , 1377216 7 7 For this hash function What is What is For this hash function 8 8 k knox nox, 9937010 , 9937010 What is hash( What is hash( miller ) ? hash( knox hash( miller ) ? knox )? )? 8 8 2 2 b barnes arnes, 2165561 , 2165561 9 9 10 10
0 0 1 1 We often show the hash table with just the key stored, for simplicity We often show the hash table with just the key stored, for simplicity 2 2 b bob ob 3 3 j jane ane Remember there will be associated data stored with the key Remember there will be associated data stored with the key b bill ill 4 4 h hash( amy ) = 9 so we put amy data into array slot 9 ash( amy ) = 9 so we put amy data into array slot 9 5 5 hash( sam array slot 7 hash( array slot 7 sam ) ) = 7 = 7 so we put so we put sam sam data into data into 6 6 7 7 What happens if hash( lara What happens if hash( lara ) = 7 ? ) = 7 ? s sam am l lara ara ? ? sam sam is already there no room for collision is already there no room for lara lara 8 8 collision a amy my 9 9 10 10
1) What is a good hash function? c can we find one that spreads keys out well over table, minimizes collisions? 1) What is a good hash function? an we find one that spreads keys out well over table, minimizes collisions? 2) How big to make the table/array? i if we have more room, might there be fewer collisions? 2) How big to make the table/array? f we have more room, might there be fewer collisions? 3) How do we manage collisions? t they will happen, so what do we do? 3) How do we manage collisions? hey will happen, so what do we do?
0 0 1 1 Let table size be N Let table size be N 2 2 b bob ob hash( amy ) = 9 assume this is fast, not dependent on table size hash( amy ) = 9 assume this is fast, not dependent on table size 3 3 j jane ane b bill ill 4 4 f find( amy ) ind( amy ) ( or get( amy ) in MAP ) ( or get( amy ) in MAP ) 5 5 if LIST this is O(N) linear search if BST, this is O(log N) binary search if LIST this is O(N) linear search if BST, this is O(log N) binary search 6 6 s sam am 7 7 Hashing O(1) Hashing O(1) 8 8 Look in array slot Look directly in slot 9 Look in array slot hash( amy ) Look directly in slot 9 hash( amy ) a amy my 9 9 10 10
Hash function has to turn a key type We will (for this discussion) assume key type into int int key type key type is string string Some hashing is done with other key types, or extreme keys For example, git hashes every object you store So the key type is file contents Hash generates a 40-digit hex number
Hash must be fast to compute Hash must be fast to compute Hash must distribute keys evenly over the available array subscripts Hash must distribute keys evenly over the available array subscripts Ideally, two distinct keys should get two different hash values Ideally, two distinct keys should get two different hash values However, ideally is not the same as reality However, ideally is not the same as reality
No hash function is perfect Eventually two keys will hash to same number No hash function is perfect Eventually two keys will hash to same number How do we know this? Chicken Holes Chicken Holes
Chicken Hole Principle Chicken Hole Principle If you have and you must have a box with at least If you have 8 8 chickens (each laying 1 egg a day) and 5 5 nesting boxes, you must have a box with at least 2 chickens (each laying 1 egg a day) nesting boxes, 2 eggs in it eggs in it In general In general For if For E E items being put into if E > B items being put into B B slots, E > B then some slot will have slots, then some slot will have 2 2 or more items or more items a collision
There are 2 people in Philadelphia with the same number of hairs on their heads Avg Allow 10x to 1,500,000) Philly has 1.55 million heads Chicken holes Eggs There are 2 people in Philadelphia with the same number of hairs on their heads Avg head has Allow 10x avg to 1,500,000) Philly has 1.55 million heads Chicken holes B = 1.5million Eggs E = 1.55 million E > B QED head has 150,000 hairs avg for variance (hair counts from 1 150,000 hairs for variance (hair counts from 1 B = 1.5million diff hair counts E = 1.55 million heads E > B QED diff hair counts heads
In any group of 367 people, at least 2 of them have the same birthday. This one is obvious in the realm of what we normally contemplate or experience Diff birthdays Eggs In any group of 367 people, at least 2 of them have the same birthday. This one is obvious, since the numbers are in the realm of what we normally contemplate or experience Diff birthdays B = 366 Eggs E = 367 E > B QED , since the numbers are B = 366 (remember Feb 29) E = 367 people E > B QED (remember Feb 29) people
Just a side note Just a side note In any group of 23 people, there is a 50% chance that 2 have the same birthday In any group of 23 people, there is a 50% chance that 2 have the same birthday In a group of 75, the probability goes to 99.9% In a group of 75, the probability goes to 99.9% that 2 have the same birthday that 2 have the same birthday Nothing to do with CHP it s just probabilities
So # of collisions is affected by quality of hash function -- So # of collisions is affected by quality of hash function -- how well it distributes keys over the integers how well it distributes keys over the integers table structure -- -- table structure -- total number of array slots -- mathematical properties of the maximum index total number of array slots mathematical properties of the maximum index
Lets say our hash function is this take first character of the string return position of that char in the alphabet (minus 1) Example: Let s say our hash function is this take first character of the string return position of that char in the alphabet (minus 1) Example: hash( hash( ardvaark ardvaark ) a is position 1, ) a is position 1, hash( hash( tarheels tarheels ) t is position 20, 20 ) t is position 20, 20- -1 = 19 hash( data ) d is position 4, 4 hash( data ) d is position 4, 4- -1 = 3 hash( apple ) a is position 1, 1 hash( apple ) a is position 1, 1- -1 = 1 1- -1 = 0 1 = 0 1 = 19 1 = 3 1 = 0 0 collision collision
Why is this bad? Only 26 different range elements by CHP can only store 26 keys before collisions Why is this bad? Only 26 different range elements by CHP can only store 26 keys before collisions English words themselves first char is not well distributed over the alphabet Lots of s , m , t words Not many x , z , q words English words themselves first char is not well distributed over the alphabet Lots of s , m , t words Not many x , z , q words We will use it anyway, later, for ease of illustration
Other ideas Use middle char, better distributed than first Or use several chars (or all of them) and build an integer arithmetically Other ideas Use middle char, better distributed than first Or use several chars (or all of them) and build an integer arithmetically function hash (key, tabSize) { hval = 0; for (var i=0; i<key.length(); i++) { hval += key.charAt(i); } return hval % tabSize; } If keys short (say 8 If tabSize keys short (say 8- -15 chars) tabSize is large (say 10007) is large (say 10007) 15 chars) then the sum of the chars is small and will then the sum of the chars is small and will cluster at low end cluster at low end of table of table
Other ideas Maybe add up chars several times (based on Adjust final integer to Other ideas Maybe add up chars several times (based on tabSize Adjust final integer to tabSize tabSize) ) tabSize function hash (key, tabSize) { hval = 0; nt = tabSize/key.length(); for (var i=0; i<key.length(); i++) { hval += key.charAt(i); } return (nt*hval) % tabSize; } Now low slots might be underused if keys are short (say 1 Now low slots might be underused if keys are short (say 1- -5 chars) 5 chars) then sum of the chars is small and is multiplied into table. then sum of the chars is small and is multiplied into middle table. middle of the of the
Other ideas Spin through string using prime cycles Other ideas Spin through string using prime mults cycles mults to prevent to prevent function hash (key, tabSize) { hval = 7; for (var i=0; i<key.length(); i++) { hval = (hval*31) + key.charAt(i); } hval % tabSize; if (hval<0) { hval += tabSize; } return hval; } Do in 32 use of overflow Java Do in 32- -bit integer arithmetic, make use of overflow Java hashCode bit integer arithmetic, make hashCode( ) does similar to this ( ) does similar to this
Best to pick table size that is Load factor d defined as ratio of # Best to pick table size that is prime Load factor efined as ratio of #elts prime elts in table --------------- table size in table --------------- table size for example if we have table is about half for example if we have 5003 in a table with size then 5003 elements stored (so far) in a table with size 10007 then is approx. elements stored (so far) 10007 is is approx. 0.5 is ( 5003 / 10007 ) ( 5003 / 10007 ) 0.5 table is about half- -full ? depends on how collisions are handled full ? depends on how collisions are handled
Two main forms of handling collisions Hash into lists (buckets, chaining) e each array slot contains not a single element, but rather a list Linear probing e each array slot contains one element hash to a full slot, there is a plan for going on to some next slot to try How will this affect O(1) on insert, find ? Two main forms of handling collisions Hash into lists (buckets, chaining) ach array slot contains not a single element, but rather a list Linear probing ach array slot contains one element hash to a full slot, there is a plan for going on to some next slot to try
Table entry is a null, or a list of cells Table entry is a null, or a list of cells Each cell contains a data object key and associated data Each cell contains a data object key and associated data If a new key hashes to an empty slot, then start a new list If a new key hashes to an empty slot, then start a new list with that key data with that key data If a new key hashes to an occupied slot, then If a new key hashes to an occupied slot, then add that key data to the list add that key data to the list
andy andy 0 0 1 1 Keys: a andy, dennis zorba, claire wanda charles, fern, warren, cindi, xerxes, donna Keys: hash ndy, dennis, , zorba, claire, , wanda, , charles, fern, warren, cindi, xerxes, donna hash 2 2 charles charles cindi cindi claire claire 0 0 3 3 3 3 donna donna dennis dennis 25 25 4 4 2 2 22 2 2 22 fern fern 5 5 5 5 22 22 2 2 warren warren wanda wanda 22 22 23 3 3 23 xerxes xerxes 23 23 Let s use that bad hash function (char pos in alphabet), for ease of illustration 24 24 25 25 zorba zorba
X X andy andy 0 0 X X 1 1 Keys: a andy, dennis zorba, claire wanda charles, fern, warren, cindi, xerxes, donna Keys: ndy, dennis, , zorba, claire, , wanda, , charles, fern, warren, cindi, xerxes, donna X X 2 2 claire claire cindi cindi charles charles X X 3 3 donna donna dennis dennis X X 4 4 X X fern fern 5 5 X X warren warren wanda wanda 22 22 X X xerxes xerxes 23 23 Add at head of list X X 24 24 X X 25 25 zorba zorba
Is a BST worth it (instead of LIST) ? find is O(1) hash + O( LIST length) i insert is O(1) hash + O(1) add to head of LIST Is a BST worth it (instead of LIST) ? find is O(1) hash + O( LIST length) nsert is O(1) hash + O(1) add to head of LIST If list length is short, O( LIST length ) is about O(1) If list length is short, O( LIST length ) is about O(1) BST is over complicated for little gain, if any BST is over complicated for little gain, if any Better to concentrate on Better to concentrate on how to keep lists short how to keep lists short Make table size bigger, make hash function distribute over more slots Make table size bigger, make hash function distribute over more slots
Table entry is null, or a single cell Table entry is null, or a single cell Each cell contains a data object key and associated data Each cell contains a data object key and associated data If a new key hashes to an empty slot, then store a cell there If a new key hashes to an empty slot, then store a cell there with that key data with that key data If a new key hashes to an occupied slot, then needed) If a new key hashes to an occupied slot, then compute a next slot needed) compute a next slot to try (repeat as to try (repeat as
On collision, we try other slots On collision, we try other slots near by near by This is a systematic, repeatable procedure This is a systematic, repeatable procedure Linear If table[ Linear probing says (informally) If table[ hash(key) table[ table[ until one is open probing says (informally) hash(key) ] is full, then try next slot table[ hash(key)+1 table[ hash(key)+2 until one is open ] is full, then try next slot hash(key)+1 ] if full try next slot hash(key)+2 ] ] if full try next slot ] etc. etc.
andy andy 0 0 1 1 Keys: a andy, dennis zorba, claire wanda charles, fern, warren, cindi, xerxes, donna Keys: hash ndy, dennis, , zorba, claire, , wanda, , charles, fern, warren, cindi, xerxes, donna hash 0 0 3 3 25 2 2 22 2 5 5 22 2 23 3 2 2 3 3 4 4 claire claire Shows clustering or clumping where you get heavily used crowded parts, empty parts dennis dennis charles charles 5 5 fern fern 25 cindi cindi 6 6 7 7 8 8 22 2 donna donna 3 3 4 4 22 2 23 3 23 3 24 4 23 4 24 4 3 4 5 5 6 6 22 22 wanda warren xerxes wanda warren xerxes 5 5 6 6 7 7 23 23 24 25 24 25 zorba zorba
Clustering slows access It s like having to search a list in hashing to lists Clustering slows access It s like having to search a list in hashing to lists Solution: larger table size Table space in probing is like the list cells in chaining More space means more open slots for initial hash, less hopping to probe Load should be distributed hash function) Solution: larger table size Table space in probing is like the list cells in chaining More space means more open slots for initial hash, less hopping to probe should be distributed hash function) for probing (assumes well Load for probing (assumes well
Solution: need a custom hash function? Some data may have a form that makes clusters with some hash Solution: need a custom hash function? Some data may have a form that makes clusters with some hash funcs funcs, not with others , not with others Consider: If the hash uses, say, first 3 chars, then they all are going to 2 or 3 cells Consider: McDuff, MacLean, If the hash uses, say, first 3 chars, then they all are going to 2 or 3 cells McDuff, MacBeth MacLean, McKensie MacBeth, McBride, McDaniel, McKensie, McDermott, , McBride, McDaniel, MacGraw , McDermott, MacGraw, MacDonald, , MacDonald, For any hash For any hash fn fn, we can make data to cause such clusters No hash func is perfect, or even good for ALL data sets , we can make data to cause such clusters
Solution: probe more randomly Obviously some clustering comes from putting a key right after one it collides with Solution: probe more randomly Obviously some clustering comes from putting a key right after one it collides with (linear probing) (linear probing) Maybe we could probe farther away from the collision site and perhaps leave the slots near the collision open for future keys, maybe find an open area Maybe we could probe farther away from the collision site and perhaps leave the slots near the collision open for future keys, maybe find an open area General probing formula General probing formula ?? = ?(?) = ? = ????(???) + ?(?) Can get different probing patterns from our general formula by selecting different function Can get different probing patterns from our general formula by selecting different function ?(?)
Lets formalize this Let s formalize this A key defines a sequence of hash values A key defines a sequence of hash values ?? , , ?? , , ??, , ??, , We try each hash We try each hash val ?? = ?(?) = ? val in sequence until we get an open slot = ????(???) + ?(?) in sequence until we get an open slot this makes this makes ?? = = ????(???) the basic hash value the basic hash value
We get different probing patterns by defining different Linear Probing: We get different probing patterns by defining different ?(?) functions Linear Probing: ?? = ?(?) = ? , functions = ????(???) + ?(?) , ?(?) = ? ??? ? > ? Sequence: Sequence: hash(key)+0 hash(key)+1 hash(key hash(key)+3 hash(key)+0 hash(key)+1 hash(key)+ hash(key)+3 % table length )+2 2 % table length
Probe via skipping by squares Probe via skipping by squares ?? = ?(?) = ? , = ????(???) + ?(?) , ?(?) = ?? ??? ? > ? Sequence: 0: hash(key)+0 0 1: hash(key)+?? is hash(key)+1 2: hash(key)+ 3: hash(key)+ 8: hash(key)+ % table length Sequence: 0: hash(key)+ 1: hash(key)+ 2: hash(key 3: hash(key is hash(key)+1 )+?? is )+?? is is hash(key)+ is hash(key)+9 hash(key)+4 4 hash(key)+9 8: hash(key)+?? is % table length is hash(key)+64 hash(key)+64
Probe via skipping by powers of 2 Probe via skipping by powers of 2 ?? = ?(?) = ? , = ????(???) + ?(?) , ?(?) = ?? ??? ? > ? Sequence: hash(key)+0 0 1: hash(key)+?? is hash(key)+2 2: hash(key)+ 3: hash(key)+?? is 8: hash(key)+?? is % table length Sequence: 0: 0: hash(key)+ 1: hash(key)+ 2: hash(key 3: hash(key)+ 8: hash(key)+ % table length is hash(key)+2 )+?? is is hash(key)+8 is hash(key)+ hash(key)+8 hash(key)+4 4 is hash(key)+256 hash(key)+256
Probe via adding a secondary hash value Probe via adding a secondary hash value ?? = ?(?) = ? , = ????(???) + ?(?) , ?(?) = ? ?????(???) ??? ? > ? Sequence: Sequence: Let ?????(???) can t ever be 0 Let ?????(???) be 13 be 13 0: 1: hash(key)+ 2: hash(key)+(2*13) 3: hash(key)+(3*13) 8: hash(key)+(8*13) % table length 0: hash(key)+ 1: hash(key)+( (1*13) 2: hash(key)+(2*13) 3: hash(key)+(3*13) 8: hash(key)+(8*13) is hash(key)+104 % table length hash(key)+0 0 1*13) is hash(key)+13 is hash(key)+26 is hash(key)+39 is hash(key)+13 is hash(key)+26 is hash(key)+39 is hash(key)+104
Optimal table loading varies Optimal table loading varies We assume a well We assume a well- -distributing hash function distributing hash function Recall load is ( Load should be Load should be Recall load is ( # #elts for 1 for elts stored / #slots in table stored / #slots in table ) ) should be should be 1 Load for probing probing Load for chaining chaining Java uses Java uses chaining chaining implementation for implementation for HashMap HashMap Java selects of Java selects of 0.75 0.75 by default (you can override) by default (you can override)
Load should be should be ? ? Means what? Load Means what? Normally means Normally means no more than no more than Table loading is dynamic, changes as we add elements Table loading is dynamic, changes as we add elements Consider table size 100 We add first element Consider table size 100 empty empty load is load is 0/100 = 0 0/100 = 0 We add first element load is load is 1/100 = 0.01 1/100 = 0.01 We add 50 more elements load is We add 50 more elements load is 51/100 = 0.51 51/100 = 0.51 Now what?
Load goes over the now what? goes over the limit limit Load now what? We can With chaining the table never really runs out of space Even if table has a list in every array slot, We can always extend the lists We can continue to load the table With chaining the table never really runs out of space continue to load the table Cons Cons suffer w won t work for probing, as the table *will* really fill suffer slowing on t work for probing, as the table *will* really fill slowing of access operations, O( LIST length) of access operations, O( LIST length)
Load goes over the We can extend the array (table) size Resize: Allocate a new larger array double it is a good plan, to a prime beyond goes over the limit extend the array (table) size limit now what? Load We can Resize: Allocate a new larger array double it is a good plan, to a prime beyond now what? Rehash: For each item in the old table, hash it into the new table O(N) operation for N keys in the table Must rehash since hash function is based on table size Rehashing will shorten the chains, space out keys Load is cut in half this way (doubling table size) Rehash: For each item in the old table, hash it into the new table O(N) operation for N keys in the table Must rehash since hash function is based on table size Rehashing will shorten the chains, space out keys Load is cut in half this way (doubling table size) Java Java HashMap HashMap does this automatically does this automatically
We have discussed insert, find What about remove ? We have discussed insert, find What about remove ? Chaining remove is easy, it s just removing from a list O(1) to hash key O( LIST length ) to find item O(1) to Chaining remove is easy, it s just removing from a list O(1) to hash key O( LIST length ) to find item O(1) to unlink it unlink it
We have discussed insert, find What about remove ? We have discussed insert, find What about remove ? Probing remove is not simple c can t just hash/probe to find, then empty the table cell might well cause a gap in some probe chain Probing remove is not simple an t just hash/probe to find, then empty the table cell might well cause a gap in some probe chain lazy deletion is required can replace the removed key with some on search, on insert, lazy deletion is required can replace the removed key with some inactive on search, inactive on insert, inactive inactive marker marker inactive says inactive says says occupied, keep probing says open occupied, keep probing open, , free for use free for use
Why is load for Linear Probing? Insert and unsuccessful search: Successful search: Work these numbers If we let L = Why is load for Linear Probing? Insert and unsuccessful search: ( 1 + 1/(1 Successful search: ( 1 + 1/(1 Work these numbers If we let L = insert: ( 1 + 1/(1 search: (1+1/(1 ( 1 + 1/(1- -L)^2 ) ( 1 + 1/(1- -L) ) L)^2 ) L) ) insert: ( 1 + 1/(1- - )^2 ) is (1+4) is search: (1+1/(1- - )) is (1+2) is )^2 ) is (1+4) is 2.5 )) is (1+2) is 1.5 2.5 1.5 If we let L = If we let L = insert: search 1 + 1/(1- - : (1+1/(1 insert: ( search: (1+1 ( 1 + 1/(1 )^2 ) is (1+16) is /(1- - )) is (1+4) is )^2 ) )) is (1+4) is 2.5 is (1+16) is 8 8.5 .5 2.5 If we let L = If we let L = insert: ? ?? insert: ( ( 1 + 1/(1 1 + 1/(1- - ??)^2 ) )^2 ) is (1+100) is is (1+100) is 50.5 50.5 ?
Ignore anything past this Ignore anything past this