Implementing Hashes for Practical Data Structures

cosc 2030 n.w
1 / 33
Embed
Share

Explore the practical implementations of hashes in data structures, focusing on creating a custom hash class for efficient data storage and retrieval. Understand collision resolution techniques like open addressing and separate chaining with insights on ideal algorithms for optimal performance analysis.

  • Data Structures
  • Hashes
  • Implementation
  • Collision Resolution
  • Open Addressing

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. cosc 2030 data structures: Hashes practical implementations

  2. Data structure for a hash. template <class T> class myHash { private: //hashsize is number you are going to decided on. const static int HASHSIZE = ?; T * hash; int size; public: myHash(); // constructor void insert (T item); bool find (T item); int getsize() { return size;} int findhash(String word) }; Class itself is going to be very similar to the linked list and tree class. There is no node class/structure necessary. It's an array of the size that would work for the hash the size should be a prime number. We still need the standard functions, plus one very important one. findhash(String word) This method will create the hash value from the word passed to it. This was talked about in the last lecture.

  3. Constructor and destructor Assuming you are going to pick a very large hashsize at least 133168 to th next prime (or double ) We need to this to be in the heap, not the stack space constructor allocates memory and initializes the destructor is very simple. ~myHash() { delete [] hash; } myHash() { hash = new T[HASHSIZE]; //initialize size =0; for(int i=0; i<HASHSIZE; i++) { hash[i]=""; } }

  4. Basic version void insert (T item) { int key = findhash(item); if (hash[key].empty()) { hash[key] = item; } else { //we have a problem. } } bool find (T item) { int key = findhash(item); if (hash[key].compare(item) ==0) { //found it return true; } else if (hash[key].empty()) { return false; } else { //we have a problem } } The problem in both cases is a collision, what if the hash of word collides with another word

  5. Open Addressing Separate chaining has the disadvantage of using linked lists. Open addressing resolves collisions by trying alternative slots in the hash table, until an empty cell is found. In general we try cells in the following order: ( ), h ( ), h ( etc., ), i h key h key h key 0 1 2 = + = where ( ) [ ( ) ( )]% and ) 0 ( . 0 key key f T f i

  6. An Ideal Algorithm Suppose that the f(i) are randomly generated. This isn t practical, because although we could easily insert our data, we would have trouble finding the data again. However, this algorithm is easy to analyze and gives us a good idea of the best performance we can expect from hashing with open addressing.

  7. Analysis How can we analyze the performance of open addressing? Clearly as we increase the number of data items N, more and more items will be crammed into the table, potentially slowing everything down. Equally clearly, increasing the table size T allows you to hold more data in an efficient manner. It turns out that the ratio is the important quantity to analyze. This is called the load factor . = N / T

  8. Analysis For open addressing, the load factor has to be greater than 0, but less than 1. The load factor gives the probability that I will encounter a filled slot in the hash table, if I randomly choose slots. Thus the value is the probability that I will encounter an empty slot. 1

  9. Analysis (reminder) So, what is the expected number of slots U looked at in an Unsuccessful search, for a given load factor ? We hash the key. If that slot is empty we stop. Otherwise we go to the next random slot. If that is empty we stop. Otherwise we go to the next random slot. If that is empty we stop. We continue in this fashion.

  10. Conclusions (reminder) This is the best that open addressing can do. Practical implementations are somewhat worse. In general it is best to keep the load factor no higher than 0.5. At this load factor only 2.5 slots are examined for an insertion (this is the same as U) and 1.5 slots are examined for a successful search). Note how much better this is than using trees!

  11. Linear Probing With linear probing f(i) = i. 0 49** 58** 1 Here is a hash table of size T = 10, where the entries 89, 18, 49, 58, and 69 have been inserted. The hash function is h(key) = key%10. 69** 2 3 4 5 6 7 Throughout this talk we use a table size T = 10, although in practice it should be prime. 8 18 89 9

  12. updated insert void insert (T item) { int key = findhash(item); if (hash[key].empty()) { hash[key] = word; } else { //fix to the right } } bool entered = false; while (!entered) { key++; //linear probing if (hash[key].empty()) { hash[key] = word; entered = true; } } Also make sure the key has not over flowed the array as well!!!!

  13. updated find bool find (T item) { int key = findhash(item); if (hash[key].compare(item) ==0) { //found it return true; } else if (hash[key].empty()) { return false ; } else { //fix } } Also make sure the key has not over flowed the array as well!!!! bool found = false; while (!found) { key++; //linear probing if (hash[key].compare(item) ==0) { return true; // or found = true; } else if (hash[key].empty()) { return false; } }

  14. Problem with Linear Probing The biggest problem with linear probing is that blocks of occupied cells start forming (this can be seen in the previous example). This is referred to as primary clustering , and it means that any key that hashes into a cluster will require several attempts to resolve the collision, and then it will add to the cluster. This decreases efficiency.

  15. Graph of Analysis Linear Probing (LP) is worse than the ideal. LP U Ideal U # of slots examined LP S 6.0 3.0 Ideal S 1.0 0.5 1.0 0.0

  16. Conclusions Due to primary clustering, linear probing is not quite as good as the ideal situation. In general it is best to keep the load factor no higher than 0.5. Let s see if we can remove the primary clustering effect

  17. Quadratic Probing With quadratic probing f(i) = i2. This eliminates the primary clustering problem of linear probing.

  18. Example 0 49** With quadratic probing f(i) = i2. 1 58** 2 69** 3 Here is a hash table of size T = 10, where the entries 89, 18, 49, 58, and 69 have been inserted. The hash function is h(key) = key%10. 4 5 6 7 8 18 89 9

  19. updated for insert/find change the key++; which is linear probing we need a count of how many attempts have been made and then use for the quadric probing. key += attempttimes * attempttimes; Then make sure the key has not over flowed the array as well.

  20. Advantage of Quadratic Probing The performance of open addressing with quadratic probing is much closer to the ideal situation than with linear probing.

  21. Problems with Quadratic Probing You have to be more careful with quadratic probing. For linear probing, performance degrades gradually as the table becomes full. With quadratic probing there is no guarantee that an empty cell will be found if the table gets more than full. Also, it is more crucial that the table size be a prime number. If it is not, the number of alternative locations can be severely reduced.

  22. Example )% ( i f T = 2 1 % 16 1 = 2 2 % 16 4 Suppose the table size T=16. Then the only alternative locations will be at distances 1, 4, and 9 away. Thus, large parts of the table will be inaccessible. = 2 3 % 16 9 = 2 4 % 16 0 = 2 5 % 16 9 = 2 6 % 16 4 = 2 7 % 16 1

  23. Double Hashing With double hashing: f(i) = i hash2(key). Thus we apply a second hash function to the key. We must make sure that this hash function never evaluates to 0! Why? A function such as hash2(key) = R (key%R), with R a prime smaller than T, will often work well.

  24. Example 0 69** In this example hash2(key) = 7-(key%7) 1 2 3 58** Here is a hash table of size T = 10, where the entries 89, 18, 49, 58, and 69 have been inserted. The hash function is h(key) = key%10. 4 5 49** 6 7 8 18 89 9

  25. updated for insert/find change the key++; which is linear probing or quadric TO key += doublehash(word); the return value of doublehash(word) should be found before the loop for efficiency reasons. Then make sure the key has not over flowed the array as well.

  26. Conclusions Double hashing has performance that is almost optimal (almost as good as the ideal algorithm we outlined earlier). However, calculating the 2nd hash function does provide some additional computation inefficiency. What if the open addressing gets to be 10, 20, or even 30 in length? Hint, think about the depth of a tree search.

  27. Rehashing As noted before, with open addressing, if the hash tables become too full, performance can suffer (a lot). So, what can we do? We can double the hash table size, modify the hash function, and re- insert the data. More specifically, the new size of the table will be the first prime that is more than twice as large as the old table size.

  28. Example Here is a hash table of size T = 7, where the entries 13, 15, 24, 6, and 23 have been inserted. The hash function is h(key) = key%7. Use linear probing to resolve collisions. 0 6** 1 15 2 23 24 3 4 5 6 13 Because this table is too full, enlarge it to size 17, and redefine the hash function.

  29. Example 6 6 7 23** 8 24** 9 Here is a hash table of size T = 17, where the entries 6, 15, 23, 24, and 13 have been inserted. The hash function is h(key) = key%17. Use linear probing to resolve collisions. Only part of the table is shown. 10 11 12 13 13 14 15 15

  30. Complexity of Rehashing It takes linear time to rehash, since there are N elements to rehash, and the table size is roughly 2N. However, it happens infrequently, namely when almost N items have been inserted since the last rehash. So, this adds a negligible constant cost to each of the N insertions.

  31. Cryptographic hash functions. These are similar to what we have already talking about, with a "minor" change, that makes a huge difference. 1. it is deterministic, meaning the same message will always result in the same hash value. 2. it is efficient, meaning the hash function is capable of returning the hash of an input quickly. 3. it is one-way, meaning that it s computationally infeasible to derive the original message (the pre-image) from its hash value. In other words, given a hash, it should be extremely difficult to retrace the deterministic steps taken to reproduce the pre-image of that hash. 4. It is collision resistant, meaning it is computationally infeasible to find two different messages with the same hash value!!! 5. Any small change to the message results in a large change in the resulting hash value.

  32. Cryptographic hash functions (2) What is collision resistance and why is it important? Collision resistance dictates that it should be extremely hard to find or generate two different messages (M, M ) such that Hash(M) == Hash(M ). Collision resistance is very important, because without it, the whole usefulness of a hash function is completely lost. If an attacker is able to create collisions, they can pass off malicious files or data as having a valid hash, and thereby as being legitimate. https://medium.com/@StevieCEllis/the-beautiful-hash-algorithm-f18d9d2b84fb

  33. QA &

More Related Content