Real-Time Stream Processing

Real-Time Stream Processing
Slide Note
Embed
Share

This content delves into the realm of real-time stream processing using Apache Storm, a top-level Apache project that enables distributed, fault-tolerant, and guaranteed computation. It covers the evolution from traditional data processing to the need for stream processing systems to supplement batch views, leading to the emergence of Apache Storm for addressing scalability, fault tolerance, and guaranteed data processing.

  • Real-Time
  • Stream Processing
  • Apache Storm
  • Distributed Computing
  • Big Data

Uploaded on Feb 28, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Real-Time Stream Processing CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook

  2. Agenda Apache Storm

  3. Traditional Data Processing Index Query Batch Pre- Computation (aka MapReduce) !!!ALL!!! the data Index Query Index

  4. Traditional Data Processing Slow... and views are out of date Absorbed into batch views Not absorbed Now Time

  5. Compensating for the real-time stuff Need some kind of stream processing system to supplement our batch views Applications can then merge the batch and the real time views together!

  6. How do we do that?

  7. APACHE STORM

  8. Enter: Storm Open-source project originally built by Twitter Now a top-level Apache project Enables distributed, fault-tolerant, real-time, guaranteed computation

  9. A History Lesson on Twitter Metrics Twitter Firehose

  10. A History Lesson on Metrics Twitter Firehose

  11. Problems! Scaling is painful Fault-tolerance is practically non-existent Coding for it is awful

  12. Wanted to Address Guaranteed data processing Horizontal Scalability Fault-tolerance No intermediate message brokers Higher level abstraction than message passing Just works

  13. Storm Delivers Guaranteed data processing Horizontal Scalability Fault-tolerance No intermediate message brokers Higher level abstraction than message passing Just works

  14. Use Cases Stream Processing Distributed RPC Continuous Computation

  15. Storm Architecture Supervisor ZooKeeper Supervisor ZooKeeper Supervisor Nimbus ZooKeeper Supervisor Supervisor

  16. Glossary Streams Constant pump of data as Tuples Spouts Source of streams Bolts Process input streams and produce new streams Functions, Filters, Aggregation, Joins, Talk to databases, etc. Topologies Network of spouts and bolts

  17. Tasks and Topologies

  18. Grouping When a Tuple is emitted from a Spout or Bolt, where does it go? Shuffle Grouping Pick a random task Fields Grouping Consistent hashing on a subset of tuple fields All Grouping Send to all tasks Global Grouping Pick task with lowest ID

  19. Topology [ id1 , id2 ] shuffle shuffle [ url ] shuffle all

  20. Guaranteed Message Processing A tuple has not been fully processed until it all tuples in the tuple tree have been completed If the tree is not completed within a timeout, it is replayed Programmers need to use the API to ack a tuple as completed

  21. Stream Processing Example Word Count TopologyBuilder builder = new TopologyBuilder(); builder.setSpout(1, new SentenceSpout(true), 5); builder.setBolt(2, new SplitSentence(), 8).shuffleGrouping(1); builder.setBolt(3, new WordCount(), 12).fieldsGrouping(2, new Fields( word )); Map conf = new HashMap(); conf.put(Config.TOPOLOGY_WORKERS, 5); StormSubmitter.submitTopology( word-count , conf, builder.createTopology());

  22. public static class SplitSentence extends ShellBolt implements IRichBolt { public SplitSentence() { super( python , splitsentence.py ); } public void declareOutputFields(OutputFieldsDeclaraer declarer) { declarer.declare(new Fields( word )); } } #!/usr/bin/python import storm class SplitSentenceBolt(storm.BasicBolt): def process(Self, tup): words = tup.values[0].split( ) for word in words: storm.emit([word])

  23. public static class WordCount implements IBasicBolt { Map<String, Integer> counts = new HashMap<String, Integer>(); public void prepare(Map conf, TopologyContext context) {} public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word); if (count == null) { count = 0; } ++count; counts.put(Word, count); collector.emit(new Values(word, count)); } public void cleanup () {} } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields( word , count )); }

  24. Local Mode! TopologyBuilder builder = new TopologyBuilder(); builder.setSpout(1, new SentenceSpout(true), 5); builder.setBolt(2, new SplitSentence(), 8).shuffleGrouping(1); builder.setBolt(3, new WordCount(), 12).fieldsGrouping(2, new Fields( word )); Map conf = new HashMap(); conf.put(Config.TOPOLOGY_WORKERS, 5); LocalCluster cluster = new LocalCluster(); cluster.submitTopology( word-count , conf, builder.createTopology()); Thread.sleep(10000); cluster.shutdown();

  25. Command Line Interface Starting a topology storm jar mycode.jar twitter.storm.MyTopology demo Stopping a topology storm kill demo

  26. Distributed RPC

  27. DRPC Example Reach Reach is the number of unique people exposed to a specific URL on Twitter Follower Distinct Follower Tweeter Follower Distinct Follower Count Reach URL Tweeter Follower Follower Distinct Follower Tweeter Follower

  28. Reach Topology shuffle shuffle Spout GetTweeters GetFollowers [ follower-id ] Distinct global CountAggregator

  29. Storm Review Distributed code and configurations Robust process management Monitors topologies and reassigns failed tasks Provides reliability by tracking tuple trees Routing and partitioning of streams Serialization Fine-grained performance stats of topologies

  30. References http://storm.incubator.apache.org/ http://www.slideshare.net/nathanmarz/storm- distributed-and-faulttolerant-realtime- computation http://www.slideshare.net/Hadoop_Summit/real time-analytics-with-storm http://spark.apache.org http://www.cs.berkeley.edu/~matei/papers/2012 /nsdi_spark.pdf

More Related Content