Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020Analytics Vasiliki (Vasia) Kalavri vkalavri@bu.edu Spring 2020 1/28: Stream ingestion and pub/sub systems Streaming sources Files, e.g. transaction logs Sockets IoT devices and sensors Databases might process a message out-of-order or twice 14 How can we avoid this? 15 Publish/Subscribe Systems publisher publisher publisher publisher subscriber notify() subscriber notify() subscriber subscribe notify unsubscribe advertise(): information reg. future events Publish/Subscribe Systems 17 Pub/Sub levels of de-coupling • Space: interacting parties do not need to know each other0 码力 | 33 页 | 700.14 KB | 1 年前3
Scalable Stream Processing - Spark Streaming and FlinkThe Course Web Page https://id2221kth.github.io 1 / 79 Where Are We? 2 / 79 Stream Processing Systems Design Issues ▶ Continuous vs. micro-batch processing ▶ Record-at-a-Time vs. declarative APIs streaming sources: 1. Basic sources directly available in the StreamingContext API, e.g., file systems, socket connections. 2. Advanced sources, e.g., Kafka, Flume, Kinesis, Twitter. 3. Custom sources streaming sources: 1. Basic sources directly available in the StreamingContext API, e.g., file systems, socket connections. 2. Advanced sources, e.g., Kafka, Flume, Kinesis, Twitter. 3. Custom sources0 码力 | 113 页 | 1.22 MB | 1 年前3
Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020distributed streaming 4 Fundamental for representing, summarizing, and analyzing data streams Systems Algorithms Architecture and design Scheduling and load management Scalability and elasticity streaming systems • be proficient in using Apache Flink and Kafka to build end-to-end, scalable, and reliable streaming applications • have a solid understanding of how stream processing systems work and industry • Learn from experts with decades of hands-on experience in building and using distributed systems and data management platforms • Have fun! 10 Vasiliki Kalavri | Boston University 2020 Important0 码力 | 34 页 | 2.53 MB | 1 年前3
Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020Boston University 2020 Dataflow Streaming Model Vasiliki Kalavri | Boston University 2020 Dataflow Systems Distributed execution Partitioned state Exact results Out-of-order support Single-node execution execution Synopses and sketches Approximate results In-order data processing Stream Database Systems 2000 1992 2013 MapReduce 2004 Tapestry NiagaraCQ Aurora TelegraphCQ STREAM Naiad Spark Streaming Evolution of Stream Processing 35 Vasiliki Kalavri | Boston University 2020 Distributed dataflow systems • Computations as Directed Acyclic Graphs (DAGs) • nodes are operators and edges are data channels0 码力 | 45 页 | 1.22 MB | 1 年前3
Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020D A B C D ??? Vasiliki Kalavri | Boston University 2020 22 • Multi-tenancy • in streaming systems that build one dataflow graph for several queries • when applications analyze data streams from Operator Placement for Stream-Processing Systems. ICDE 2006. • Brian Babcock et. al. Chain : Operator Scheduling for Memory Minimization in Data Stream Systems. SIGMOD 2003. • Donald Carney et. al.0 码力 | 54 页 | 2.83 MB | 1 年前3
Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 2020Operators • Probably the most important operators in stream processing systems • Almost universally supported across streaming systems and languages albeit with various names and semantics • Allow un-blocking0 码力 | 53 页 | 532.37 KB | 1 年前3
High-availability, recovery semantics, and guarantees - CS 591 K1: Data Stream Processing and Analytics Spring 2020in dataflow computations 3 Vasiliki Kalavri | Boston University 2020 4 Distributed streaming systems will fail • how can we guard state against failures and guarantee correct results after recovery0 码力 | 49 页 | 2.08 MB | 1 年前3
Graph streaming algorithms - CS 591 K1: Data Stream Processing and Analytics Spring 2020?? Vasiliki Kalavri | Boston University 2020 Batch Graph Processing 9 Batch graph processing systems, such as Apache Graph, GraphX, Pregel, operate offline. They are built to analyze a snapshot of0 码力 | 72 页 | 7.77 MB | 1 年前3
Exactly-once fault-tolerance in Apache Flink - CS 591 K1: Data Stream Processing and Analytics Spring 2020initiating process 18 The Chandy-Lamport Algorithm A snapshot algorithm that is used in distributed systems for recording a consistent global state of an asynchronous system ??? Vasiliki Kalavri | Boston a streaming application • Some result records might be emitted multiple times to downstream systems 50 ??? Vasiliki Kalavri | Boston University 2020 End-to-end exactly once • Flink’s checkpointing a streaming application • Some result records might be emitted multiple times to downstream systems 50 How can we ensure exactly-once output? ??? Vasiliki Kalavri | Boston University 2020 Enabling0 码力 | 81 页 | 13.18 MB | 1 年前3
PyFlink 1.15 Documentationfine-grained control over state and timer, which allows for the implementation of advanced event-driven systems. You can run the latest version of these examples by yourself in ‘Live Notebook: DataStream’ at0 码力 | 36 页 | 266.77 KB | 1 年前3
共 11 条
- 1
- 2













