
Bloom Filters
Learn about Bloom Filters, a data structure for efficient item existence checks in large datasets. Explore encoding methods, error estimation, tradeoffs, and optimal parameter selection.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Bloom Filters 1 Bloom-Filters S.Sioutas CEID@UPATRAS
Bloom Filters 2 Bloom Filters Lookup questions: Does item x exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with negative results before accessing data. Allow false positive errors, as they only cost us an extra data access. Don t allow false negative errors, because they result in wrong answers.
Bloom Filters 3 Bloom Filter [B70] Encoding an attribute a U Maintain a Bit Vector V of size m Use k hash functions (h1..hk) , hi: U [1..m] Encoding: For item x, turn on bits V[h1(x)]..V[hk(x)]. Lookup: Check bits V[h1(i)]..V[hk(i)] . If all equal 1, return Probably Yes . Else Definitely No .
Bloom Filters 4 Bloom Filter x V0 Vm-1 0 0 0 1 h1(x) 0 0 0 1 h2(x) 0 1 h3(x) 0 1 hk(x) 0 0 0
Bloom Filters 5 Bloom Errors a b c d V0 Vm-1 0 0 0 1 h1(x) 0 0 0 1 h2(x) 0 1 h3(x) 0 1 hk(x) 0 0 0 x didn t appear, yet its bits are already set
Bloom Filters 6 Error Estimation Assumption: Hash functions are perfectly random Probability of a bit being 0 after hashing all n elements: ( ) e m = / 1 1 nk kn = / kn m , e m Let p=e-kn/m. The probability of a false positive is: k kn ( 1 ) 1 1 ( )k k = = / kn m 1 1 f e p m Assuming we are given m and n, the optimal k is: ( ) ( ( ) ( ) m dk ) = / kn m exp ln 1 f k 1 e = / kn m ln g k e / kn m dg kn e dg m = + / kn m = = ln 1 e 0 (ln ) 2 k / kn m min 1 e dk n = = / k m n ( ) / 1 ( ) 2 . 0 ( 6185 ) f k min
Bloom Filters 7 Bloom Filter Tradeoffs Three factors: m,k and n. Normally, n and m are given, and we select k. Small k Less computations. Actual number of bits accessed (nk) is smaller, so the chance of a step over is smaller too. However, less bits need to be stepped over to generate an error. For big k, the exact opposite holds. Not surprisingly, when k is optimal, the hit ratio (ratio of bits flipped in the array) is exactly 0.5