Real-time Data Stream Processing and Late Data Handling in Structured Streaming

data stream n.w
1 / 7
Embed
Share

Explore the intricacies of real-time data stream processing using structured streaming, including unbounded tables, triggers, event timestamps, word count queries, and windowed aggregation. Delve into the challenges of handling late data and updates to counts within specified time windows for effective data analysis and management.

  • Data Processing
  • Stream Processing
  • Late Data Handling
  • Structured Streaming

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Data stream Unbounded Table new data in the data stream = = new rows appended to an unbounded table Data stream as an unbounded table

  2. Trigger: every 1 sec 1 2 3 Time data up to t=1 data up to t=2 data up to t=3 Input Query Result result up to t=1 result up to t=2 result up to t=3 Output complete mode Programming Model for Structured Streaming

  3. nc dog owl cat dog dog dog owl cat 1 2 3 Time Input Unbounded table of all input data up to t=1 data up to t=2 data up to t=3 cat dog cat dog cat dog dog dog dog dog dog dog owl cat owl cat dog owl word count query result up to t=1 result up to t=2 Result Table of word counts result up to t=3 cat 2 cat 2 cat 1 dog 4 dog 3 dog 3 owl 2 owl 1 Output Complete Mode print all the counts to console Model of the Quick Example

  4. 12:02 12:03 cat dog dog dog 12:11 12:13 dog owl Input Stream 12:07 owl cat 12:05 12:10 12:15 12:00 Time 12:00 - 12:10 12:00 - 12:10 cat dog 1 3 12:00 - 12:10 12:00 - 12:10 12:00 - 12:10 12:05 - 12:15 12:05 - 12:15 cat dog owl cat owl 2 3 1 1 1 12:00 - 12:10 12:00 - 12:10 12:00 - 12:10 12:05 - 12:15 12:05 - 12:15 12:05 - 12:15 12:10 - 12:20 12:10 - 12:20 cat dog owl cat owl dog dog owl 2 3 1 1 2 1 1 1 Result Tables after 5 minute triggers counts incremented for windows 12:00 - 12:10 and 12:05 - 12:15 counts incremented for windows 12:05 - 12:15 and 12:10 - 12:20 Windowed Grouped Aggregation with 10 min windows, sliding every 5 mins

  5. late data that was generated at 12:04 but arrived at 12:11 12:02 12:03 cat dog dog dog 12:04 12:13 dog owl Input Stream 12:07 owl cat 12:05 12:10 12:15 12:00 Time 12:00 - 12:10 12:00 - 12:10 cat dog 1 3 12:00 - 12:10 12:00 - 12:10 12:00 - 12:10 12:05 - 12:15 12:05 - 12:15 cat dog owl cat owl 2 3 1 1 1 12:00 - 12:10 12:00 - 12:10 dog 12:00 - 12:10 12:05 - 12:15 12:05 12:05 - - 12:15 12:15 12:10 - 12:20 cat 2 4 1 1 2 2 1 Result Tables after 5 minute triggers owl cat owl owl owl counts updated for window 12:00 - 12:10 Late data handling in Windowed Grouped Aggregation

  6. Data as (event time, word) Data late but within watermark Data too late outside watermark 12:20 12:21, owl 12:17, owl Event Time 12:15 Max event time seen till now Watermark = max event time -- late threshold 12:15, cat 12:14, dog intermediate state for 12:00 - 12:10 dropped as watermark > 12:10 12:13, owl 12:10 wm = 12:21 - 10m = 12:11 12:09, cat 12:08, owl 12:08, dog 12:07, dog 12:05 watermark updated every trigger using late threshold = 10 min 12:04, donkey data too late, ignored in counts wm = 12:14 - 10m = 12:04 12:00 12:20 12:25 12:05 12:10 12:15 Processing Time with 5 min triggers table not updated with too late data (12:04, donkey) 12:00 - 12:10 owl 1 12:00 - 12:10 owl 1 12:00 12:00 - - 12:10 12:10 owl owl 1 1 12:00 - 12:10 owl 1 12:00 - 12:10 dog 2 12:00 12:00 - - 12:10 12:10 dog dog 2 2 12:00 12:00 - - 12:10 12:05 12:05 - - 12:15 12:10 dog 12:15 owl dog 1 1 owl 1 1 12:00 - 12:10 dog 1 12:00 - 12:10 cat 1 12:00 - 12:10 cat 1 12:00 12:00 - - 12:10 12:10 cat cat 1 1 12:05 12:05 - - 12:15 12:15 dog dog 1 1 12:05 - 12:15 owl 2 12:05 12:05 - - 12:15 12:15 owl owl 2 2 12:05 - 12:15 owl 1 12:05 - 12:15 dog 3 12:05 12:05 - - 12:15 12:15 dog dog 3 3 12:05 12:05 - - 12:15 12:15 dog dog 2 2 Result Tables after each trigger 12:05 - 12:15 cat 2 12:05 12:05 - - 12:15 12:15 cat cat 2 2 12:05 12:05 - - 12:15 12:15 cat cat 1 1 table updated with late data (12:17, owl) 12:10 - 12:20 dog 1 12:10 - 12:20 dog 1 12:10 12:10 - - 12:20 12:20 dog dog 1 1 12:10 - 12:20 cat 1 12:10 12:10 - - 12:20 12:20 cat cat 1 1 purple rows purple rows are updated rows that are written to the sink as output 12:10 12:10 - - 12:20 12:20 owl owl 2 2 12:10 12:10 - - 12:20 12:20 owl owl 1 1 Watermarking Watermarking in Windowed Grouped Aggregation Grouped Aggregation with Update Mode Windowed Update Mode

  7. 12:26, owl 12:25 Data as (event time, word) Data late but within watermark Data too late outside watermark 12:20 12:21, owl 12:17, owl Event Time 12:15 wm = 12:26 - 10m =12:16 Max event time seen till now Watermark = max event time -- late threshold 12:15, cat 12:14, dog 12:13, owl 12:10 wm = 12:21 - 10m =12:11 12:09, cat 12:09, cat 12:08, owl 12:08, dog 12:07, dog 12:05 12:04, donkey data too late, ignored in counts wm = 12:14 - 10m =12:04 12:00 12:20 12:30 12:25 12:05 12:10 12:15 Processing Time with 5 min triggers partial counts for window 12:00 - 12:10 maintained as internal state while waiting for late data, so not yet added to result table 12:00 - 12:10 owl 1 12:00 12:00 - - 12:10 12:10 owl owl 1 1 12:00 - 12:10 cat 1 12:00 12:00 - - 12:10 12:10 cat cat 1 1 12:00 - 12:10 dog 2 12:00 12:00 - - 12:10 12:10 dog dog 2 2 12:05 12:05 - - 12:15 12:15 owl owl 2 2 final counts final counts for 12:00 - 12:10 added to table when watermark > 12:10, late data counted, and intermediate state for window dropped 12:05 12:05 - - 12:15 12:15 cat cat 2 2 12:05 12:05 - - 12:15 12:15 dog dog 3 3 Watermarking Watermarking in Windowed Grouped Aggregation Grouped Aggregation with Append Mode Windowed Result Tables after each trigger Append Mode

Related


More Related Content