Forgetful Bloom Filter and Its Usage in NoSQL Databases

the forgetful bloom filter and their use in nosql n.w

1 / 24

Embed Share

Explore the Forgetful Bloom Filter and its applications in NoSQL databases, focusing on guaranteeing idempotence in counters and introducing the Forgetful Bloom Filter for efficient management of updates.

erunn Follow

Uploaded on Jun 18, 2025 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

The Forgetful Bloom Filter (and their use in NoSQL Databases) Rajath Subramanyam, Indranil Gupta Luke Leslie, Wenting Wang University of Illinois at Urbana-Champaign 1 DPRG: http://dprg.cs.uiuc.edu

Key-value/NoSQL Databases and Counters Key-value and NoSQL Database Systems Growing segment in Industry (21% CAGR, $3.4 B by 2020) Offer Eventual consistency Orders of magnitude lower latency Apache Cassandra (Facebook), HBase (Yahoo/Hortonworks), Voldemort (LinkedIn), Riak (Basho), MongoDB, Used to implement counters Incrementally count things Likes, tweets, updates, events, etc. 2

Need: Idempotence In Counters Basic counter operation: <inc> (update or +1) Update identified uniquely by (client id, client update sequence number) Applications using such counters require a guarantee of idempotence. Exactly-once semantics Idempotence known to be impossible in distributed systems with message losses and failures Client sends udpate to server, but receives no response [Cassandra JIRA-2495] 3

Guaranteeing Idempotence In Counters Three possible solutions 1. Approach 1: Clients don t submit duplicate updates. Under-count 2. Approach 2: Server maintains list of all updates. Duplicate updates are checked and rejected. Too much space 3. Approach 3: Server maintains a Bloom filter to store all client updates. Less space and few false positives Grows large and continuously over time Can t delete old entries (updates) 4

Bloom Filter - Refresher Can t forget old entries Fills up over time Bloom filter size is fixed Cannot auto-scale without maintaining history Compact way of representing a set of items Checking for existence in set is cheap Some probability of false positives: an item not in set may check true as being in set Never false negatives On insert, set all hashed bits. On check-if-present, return true if all hashed bits set. False positives Large Bit Map 0 1 2 3 Hash_1 False positive rate low k=4 hash functions 100 items 3200 bits FP rate = 0.02% Key-K Hash_2 . . 6 9 Hash_m 111 5 127

Introducing: Forgetful Bloom Filter (FBF) Kept at server Maintains a moving window that stores recent updates Updates are automatically forgotten after a timeout FBF is self-adaptive to meet user-specified inaccuracy requirement 6

Simple FBF Past Bloom Filter Insert happens here Present Bloom Filter Future Bloom Filter Membership check happens here Time Now 7

Simple FBF Over Time Past Bloom Filter Insert Present and Future Check All 3 Present Bloom Filter Future Bloom Filter Now Time 8

Simple FBF Over Time Past Bloom Filter Insert Present and Future Check All 3 Present Bloom Filter Now Time 9

Simple FBF Over Time Past Bloom Filter Insert Present and Future Check All 3 Present Bloom Filter Future Bloom Filter (empty) Now Time 10

FBF Optimizations and Analysis Faster membership check Future and Past BFs do not overlap Only check consecutive pairs of BFs, starting from Future BF First hit results in a true answer Mathematical Analysis of FBF False Positive Details in Paper Used to decide when to auto-scale FBF 11

FBF Refreshing and N-FBF Refresh operation slides Bloom Filters left Happens periodically, every t time units t=Refresh period (configurable) Also configurable: number of Bloom filters N-FBF Have N Past Bloom Filters, 1 Present BF, and 1 Future BF Simple FBF = 1-FBF N-FBF Refresh Oldest Past Second-oldest Past Second-Oldest Past Third-oldest Past Newest Past Present Present Future Future Empty 12

Self-Adaptive N-FBF Given a user-specified SLO on (minimum) false positive probability Perform Dynamic Resizing of N-FBF Adjust t (refresh period) Adjust N, number of BFs in N-FBF Continuously measure false positives and use analysis to calculate False Positive Probability (FPP) Adjust t and N 13

Self-Adaptive N-FBF (2) Given a user-specified SLO on (minimum) false positive probability Continuously measure false positives and use analysis to calculate False Positive Probability (FPP) If measured FPP > 90% of SLO FPP (in danger) Increase space and sampling rate Multiplicative increase of N, and Additive decrease of t If measured FPP < 10% of SLO FPP (too conservative) Decrease space and sampling rate Additive decrease of N, and Additive increase of t 14

Integration Into Apache Cassandra Background In Cassandra Counters are a special data structure (column family) Clients can issue update operations (e.g., +1 s) on any key in that table Counters may be replicated 15

(Non-)Idempotence in Cassandra Cassandra v0.8-2.0: uses sharding and replication Client sends requests to same server Not idempotent Incorrect values can result from client retries (e.g., after a write timeout) or commit log replay Cassandra v2.1: eliminated remote and local shards, instead uses local locking Higher overhead due to locking Can still be non-idempotent due to client retries 16

FBFs in Cassandra Integrated into Cassandra v0.8-2.0 The following changes are required: Client-side Each client operation needs to be identified by a globally unique identifier, such as < client id, per- client sequence number>. Server-side An FBF associated with each counter column. Commitlog shards also carry unique request id (to avoid duplication) Server immediately propagates update to all other servers (who update their FBFs and commitlogs) When server receives an update Performs membership check for update id in FBF If present, update rejected Else, applied, and added to FBF, commitlog, and memtable Above steps are performed atomically 17

I. FBF Datastructure Experiments Optimized membership check lowers false positives Analysis predicts false positives well 18

Self-Adaptive FBF More Filters (N) Lower FPP Frequent Refresh Lower FPP 19

FBF Self-Adaptation Over Time As more elements are inserted, FPP goes up; But FBF adjusts N and t To keep FPP below SLO 20

II. FBF Integrated into Generic Key-Value Database 100% accurate when using FBF Default is 6% inaccurate FBF increases latency by at most 10% (2 ms)

FBF vs Very Large Bloom Filter (Recycled - RBF) FBF uses same space to achieve lower false positive probabilities 22

III. FBF Integrated into Apache Cassandra v2.0 100% accurate counts with FBF Default has 8% error 23

Takeaways New Datastructure called Forgetful Bloom Filter Forgets entries after a while, self-adjusting to meet false positive rate SLO Used as compact way to maintain list of received counter updates (server side) Uses space more efficiently than just one large Bloom filter , to reach lower false positive rate Avoids locking of Cassandra v2.1 Increases latency by at most 10% Integrated into Cassandra v2.0 100% accurate, while default is off by about 6-8% 24 DPRG: http://dprg.cs.uiuc.edu

Forgetful Bloom Filter and Its Usage in NoSQL Databases

Download Presentation

Presentation Transcript

Related

More Related Content