Stream Processing Challenges and Solutions for Big Data Analytics

integrating scale out and fault tolerance n.w
1 / 19
Embed
Share

Explore the integration of scale-out and fault tolerance in stream processing through operator state management. Discover the complexities of real-time big data analysis, distributed stream processing systems, and elastic DSPSs in the cloud. Uncover the innovative approaches to building stream processing platforms in the cloud while addressing workload surges, resource allocation, and stateful operator maintenance.

  • Stream Processing
  • Big Data Analytics
  • Fault Tolerance
  • Operator State Management
  • Real-time Analysis

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management with Raul Castro Fernandez* Matteo Migliavacca+and Peter Pietzuch* *Imperial College London, +Kent Univerisity Peter R. Pietzuch prp@doc.ic.ac.uk

  2. Big data in numbers: 2.5 billions on gigabytes of data every day (source IBM) LSST telescope, Chile 2016, 30 TB nightly come from everywhere: web feeds, social networking mobile devices, sensors, cameras scientific instruments online transactions (public and private sectors) have value: Global Pulse forum for detecting human crises internationally real-time big data analytics in UK 25 billions 216 billions in 2012-17 recommendation applications (LinkedIn, Amazon) processing infrastructure for big data analysis 2

  3. A black-box approach for big data analysis users issue analysis queries with real-time semantics streams of data updates, time-varying rates, generated in real-time streams of result data processing in near real-time Stream Processing System time 3

  4. Distributed Stream Processing System queries consist of operators (join, map, select, ..., UDOs) operators form graphs operators process streams of tuples on-the-fly operators span nodes 4

  5. Elastic DSPSs in the Cloud Real-time big data analysis challenge traditional DSPS: ? what about continuous workload surges? ? what about real-time resource allocation to workload variations? ? keeping the state correct forstateful operators? Massively scalable , cloud-based DSPSs [SIGMOD 2013] 1. gracefully handles statefuloperators state 2. operator state management for combined scale out and fault tolerance 3. SEEP system and evaluation 4. related work 5. future research directions 5

  6. Stream Processing in the Cloud clouds provide infinite pools of resources ? How do we build a stream processing platform in the Cloud? Intra-query parallelism: provisioning for workload peaks unnecessarily conservative Failure resilience: active fault-tolerance needs 2x resources passive fault-tolerance leads to long recovery times dynamic scale out: increase resources when peaks appear hybrid fault-tolerance: low resource overhead with fast recovery Both mechanisms must support stateful operators 6

  7. Stateless vs Stateful Operators operator state: a summary of past tuples processing 7 9 7 1 5 9 stateless: failure recovery scale out filter > 5 9 9 filter filter (the, 10) (with, 5) (the, 2) !=12 (with, 1) !=6 (the, ) (the, 10) (with, 5) the with the counter with stateful: failure recovery scale out (with, ) counter counter 7

  8. State Management processing state: (summary of past tuples processing) routing state: (routing of tuples) B A buffer state: (tuples) C operator state is an external entity managed by the DSPS primitives for state management mechanisms (scale out, failure recovery) on top of primitives dynamic reconfiguration of the dataflow graph 8

  9. State Management Primitives checkpoint takes snapshot of state and makes it externally available backup moves copy of state from one operator to another restore partition A A splits state in a semantically correct fashion for parallel processing B B 9

  10. State Management Scale Out, Stateful Ops periodically, stateful operators checkpoint and back up state to designated upstream backup node, in memory backup node already has state of operator to be parallelised backup checkpoint A A A A B A A upstream ops send unprocessed tuples to update checkpointed state restore A partition A A How do we partition stateful operators? 10

  11. Partitioning Stateful Operators 1. Processing state modeled as (key, value) dictionary 2. State partitioned according to key k of tuples 3. Tuples will be routed to correct operator as of k buffer state t=1, key=c, computer t=3, key=c, cambridge processing state t=3, (c, computer:1, cambridge:1) t=1, computer t=2, laboratory t=3, cambridge counter A splitter counter (a k), A (l z), A routing state A t=2, (l, laboratory:1) t=2, key=l, laboratory 11

  12. Passive Fault-Tolerance Model recreate operator state by replaying tuples after failure: upstream backup: sends acks upstream for tuples processed downstream ACKs data C A B D d a b c may result in long recovery times due to large buffers: system is reprocessing streams after failure inefficient 12

  13. Recovering using State Management (R+SM) Benefit from state management primitives: use periodically backed up state on upstream node to recover faster trim buffers at backup node same primitives as in scale out A A A A A A A state is restored and unprocessed tuples are replayed from buffer same primitives for parallel recovery 13

  14. State Management in Action: SEEP (1) dynamic Scale Out: detect bottleneck , add new parallelised operator (2) failure Recovery: detect failure, replace with new operator queries EC2 stats (1) scale out coordinator bottleneck detector scaling policy VM pool query manager faults (2) recovery coordinator fault detector deployment manager 14

  15. Dynamic Scale Out: Detecting bottlenecks 85% 35% 30% CPU utilisation report 35% 85% 30% bottleneck detector logical infrastructure view 15

  16. The VM Pool: Adding operators problem: allocating new VMs takes minutes... monitoring information bottleneck detector fault detector VM2 VM2 VM3 VM1 (dynamic pool size) virtual machine pool provision VM from cloud (order of mins) add new VM to pool Cloud provider 16

  17. Experimental Evaluation Goals: investigate effectiveness of scale out mechanism recovery time after failure using R+SM overhead of state management Scalable and Elastic Event Processing (SEEP): implemented in Java; Storm-like data flow model Sample queries + workload Linear Road Benchmark (LRB) to evaluate scale out [VLDB 04] provides an increasing stream workload over time query with 8 operators, 3 are stateful; SLA: results < 5 secs Windowed word count query (2 ops) to evaluate fault tolerance induce failure to observe performance impact Deployment on Amazon AWS EC2 sources and sinks on high-memory double extra large instances operators on small instances 17

  18. Scale Out: LRB Workload scales to load factor L=350 with 50 VMs on Amazon EC2 (automated query parallelisation, scale out policy at 70%) L=512 highest result [VLDB 12] (hand-crafted query on cluster) scale out leads to latency peaks, but remains within LRB SLA SEEP scales out to increasing workload in the Linear Road Benchmark 18

  19. Conclusions Stream processing will grow in importance: handling the data deluge enables real-time response and decision making Integrated approach for scale out and failure recovery: operator state an independent entity primitives and mechanisms Efficient approach extensible for additional operators: effectively applied to Amazon EC2 running LRB parallel recovery 19

Related


More Related Content