Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020historical data • batched updates during downtimes, e.g. every night Streaming Data Warehouse • low-latency materialized view updates • pre-aggregated, pre-processed streams and historical data Data append-only Update rates relatively low high, bursty Processing Model query-driven / pull-based data-driven / push-based Queries ad-hoc continuous Latency relatively high low 5 Vasiliki Kalavri | Boston Boston University 2020 Traditional DW vs. SDW Traditional DW SDW Update Frequency low high Update propagation synchronized asynchronous Data historical recent and historical ETL process complex fast and0 码力 | 45 页 | 1.22 MB | 1 年前3
Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020• scheduling Objectives • optimize resource utilization or minimize resources • decrease latency, increase throughput • minimize monetary costs (if running in the cloud) Query optimization running might be impractical. • state accumulation and re-partitioning • high-availability and low latency requirements • scheduling overhead Challenges in streaming optimization ??? Vasiliki Kalavri constraints or QoS latency constraints. Batching Process multiple data elements in a single batch A A’ ??? Vasiliki Kalavri | Boston University 2020 43 • Batching trades throughput for latency • It improves0 码力 | 54 页 | 2.83 MB | 1 年前3
Notions of time and progress - CS 591 K1: Data Stream Processing and Analytics Spring 2020Watermarks provide a configurable trade-off between results confidence and latency: • Eager watermarks ensure low latency but provide lower confidence • Late events might arrive after the watermark watermark • Slow watermarks increase confidence but they might lead to higher processing latency. Trade-offs 17 Vasiliki Kalavri | Boston University 2020 Periodic: periodically ask the user-defined function0 码力 | 22 页 | 2.22 MB | 1 年前3
Introduction to Apache Flink and Apache Kafka - CS 591 K1: Data Stream Processing and Analytics Spring 2020Kafka, RabbitMQ, ... HDFS, JDBC, ... Event logs ETL, Graphs, Machine Learning Relational, … Low latency, windowing, aggregations, ... 2 Vasiliki Kalavri | Boston University 2020 Basic API Concept0 码力 | 26 页 | 3.33 MB | 1 年前3
监控Apache Flink应用程序(入门)........................................................................... 14 4.12 Monitoring Latency................................................................................................. records-lag-max > threshold • millisBehindLatest > threshold 4.12 Monitoring Latency Generally speaking, latency is the delay between the creation of an event and the time at which results based which then writes the results to a database or calls a downstream system. In such a pipeline, latency can be introduced at each stage and for various reasons including the following: 1. It might take0 码力 | 23 页 | 148.62 KB | 1 年前3
Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020trades-off result accuracy for sustainable performance. • Suitable for applications with strict latency constraints that can tolerate approximate results. Slow down the flow of data: • The system buffers plan • It detects overload and decides what actions to take in order to maintain acceptable latency and minimize result quality degradation. 7 ??? Vasiliki Kalavri | Boston University 2020 DSMS University 2020 Load shedding decisions • When to shed load? • detect overload quickly to avoid latency increase • monitor input rates • Where in the query plan? • dropping at the sources vs. dropping0 码力 | 43 页 | 2.42 MB | 1 年前3
Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020• out-of-sync sources may produce out-of-order streams • can be connected to the network • latency and unpredictable delays • might be producing too fast • stream processor needs to keep up and to a file or database • Consumer periodically polls and retrieves new data • polling overhead, latency? • Consumer receives a notification when new data is available • how to implement triggers? messages, as each partition is read by a single thread What would you use when priority is: - latency but not ordering? - throughput and ordering? 31 How long to keep the log? • Log compaction: a0 码力 | 33 页 | 700.14 KB | 1 年前3
Elasticity and state migration: Part I - CS 591 K1: Data Stream Processing and Analytics Spring 2020• CPU utilization, congestion, back pressure, throughput Policy • Queuing theory models: for latency objectives • Control theory models: e.g., PID controller • Rule-based models, e.g. if CPU utilization scaled and upstream channels • All-at-once • move state to be migrated in one operation • high latency during migration if the state is large • Progressive • move state to be migrated in smaller pieces Andrea Lattuada, Frank McSherry, Vasiliki Kalavri, John Liagouris, Timothy Roscoe. Megaphone: Latency-conscious state migration for distributed streaming dataflows. (VLDB 2019). 370 码力 | 93 页 | 2.42 MB | 1 年前3
Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020migration • minimize communication • keep duration short • minimize performance disruption, e.g. latency spikes • avoid introducing load imbalance • Resource management • utilization, isolation0 码力 | 41 页 | 4.09 MB | 1 年前3
Scalable Stream Processing - Spark Streaming and Flinkcoarse-grained approach than Storm. • Based on consistent global snapshots (inspired by Chandy-Lamport). • Low runtime overhead, stateful exactly-once semantics. 73 / 79 Fault Tolerance (1/2) ▶ Fault tolerance coarse-grained approach than Storm. • Based on consistent global snapshots (inspired by Chandy-Lamport). • Low runtime overhead, stateful exactly-once semantics. 73 / 79 Fault Tolerance (1/2) ▶ Fault tolerance coarse-grained approach than Storm. • Based on consistent global snapshots (inspired by Chandy-Lamport). • Low runtime overhead, stateful exactly-once semantics. 73 / 79 Fault Tolerance (2/2) ▶ Acks sequences0 码力 | 113 页 | 1.22 MB | 1 年前3
共 15 条
- 1
- 2













