Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020Analytics Vasiliki (Vasia) Kalavri vkalavri@bu.edu Spring 2020 1/28: Stream ingestion and pub/sub systems Streaming sources Files, e.g. transaction logs Sockets IoT devices and sensors Databases might process a message out-of-order or twice 14 How can we avoid this? 15 Publish/Subscribe Systems publisher publisher publisher publisher subscriber notify() subscriber notify() subscriber subscribe notify unsubscribe advertise(): information reg. future events Publish/Subscribe Systems 17 Pub/Sub levels of de-coupling • Space: interacting parties do not need to know each other • Publishers0 码力 | 33 页 | 700.14 KB | 1 年前3
High-availability, recovery semantics, and guarantees - CS 591 K1: Data Stream Processing and Analytics Spring 2020in dataflow computations 3 Vasiliki Kalavri | Boston University 2020 4 Distributed streaming systems will fail • how can we guard state against failures and guarantee correct results after recovery MillWheel: Fault-Tolerant Stream Processing at Internet Scale (PVLDB’13) • https://research.google/pubs/pub41378/ 220 码力 | 49 页 | 2.08 MB | 1 年前3
Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 2020Operators • Probably the most important operators in stream processing systems • Almost universally supported across streaming systems and languages albeit with various names and semantics • Allow un-blocking as input to multiple downstream operators. • Group by / Partition Operators split a stream into sub-streams according to a function or the event contents. • one stream per customer Id • round-robin0 码力 | 53 页 | 532.37 KB | 1 年前3
Cardinality and frequency estimation - CS 591 K1: Data Stream Processing and Analytics Spring 2020split the input stream into m = 2p sub-streams S0, S1, …, Sm-1 For every element x, we compute h(x) and use the p first bits of the M-bit hash value to select a sub-stream and the next M-p bits to compute split the input stream into m = 2p sub-streams S0, S1, …, Sm-1 For every element x, we compute h(x) and use the p first bits of the M-bit hash value to select a sub-stream and the next M-p bits to compute that maps elements to a binary representation of length 5. We split the stream into m = 2p = 4 sub-streams. Consider the input elements {5, 14, 5, 2, 8, 1, …} ??? Vasiliki Kalavri | Boston University0 码力 | 69 页 | 630.01 KB | 1 年前3
Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020distributed streaming 4 Fundamental for representing, summarizing, and analyzing data streams Systems Algorithms Architecture and design Scheduling and load management Scalability and elasticity streaming systems • be proficient in using Apache Flink and Kafka to build end-to-end, scalable, and reliable streaming applications • have a solid understanding of how stream processing systems work and industry • Learn from experts with decades of hands-on experience in building and using distributed systems and data management platforms • Have fun! 10 Vasiliki Kalavri | Boston University 2020 Important0 码力 | 34 页 | 2.53 MB | 1 年前3
Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020Boston University 2020 Dataflow Streaming Model Vasiliki Kalavri | Boston University 2020 Dataflow Systems Distributed execution Partitioned state Exact results Out-of-order support Single-node execution execution Synopses and sketches Approximate results In-order data processing Stream Database Systems 2000 1992 2013 MapReduce 2004 Tapestry NiagaraCQ Aurora TelegraphCQ STREAM Naiad Spark Streaming Evolution of Stream Processing 35 Vasiliki Kalavri | Boston University 2020 Distributed dataflow systems • Computations as Directed Acyclic Graphs (DAGs) • nodes are operators and edges are data channels0 码力 | 45 页 | 1.22 MB | 1 年前3
Scalable Stream Processing - Spark Streaming and FlinkThe Course Web Page https://id2221kth.github.io 1 / 79 Where Are We? 2 / 79 Stream Processing Systems Design Issues ▶ Continuous vs. micro-batch processing ▶ Record-at-a-Time vs. declarative APIs streaming sources: 1. Basic sources directly available in the StreamingContext API, e.g., file systems, socket connections. 2. Advanced sources, e.g., Kafka, Flume, Kinesis, Twitter. 3. Custom sources streaming sources: 1. Basic sources directly available in the StreamingContext API, e.g., file systems, socket connections. 2. Advanced sources, e.g., Kafka, Flume, Kinesis, Twitter. 3. Custom sources0 码力 | 113 页 | 1.22 MB | 1 年前3
Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020D A B C D ??? Vasiliki Kalavri | Boston University 2020 22 • Multi-tenancy • in streaming systems that build one dataflow graph for several queries • when applications analyze data streams from Operator Placement for Stream-Processing Systems. ICDE 2006. • Brian Babcock et. al. Chain : Operator Scheduling for Memory Minimization in Data Stream Systems. SIGMOD 2003. • Donald Carney et. al.0 码力 | 54 页 | 2.83 MB | 1 年前3
Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 20202020 • Read all the previous subtask state from the checkpoint in all sub-tasks and filter out the matching keys for each sub-task • Sequential read pattern • Tasks read unnecessary data and the0 码力 | 41 页 | 4.09 MB | 1 年前3
Graph streaming algorithms - CS 591 K1: Data Stream Processing and Analytics Spring 2020?? Vasiliki Kalavri | Boston University 2020 Batch Graph Processing 9 Batch graph processing systems, such as Apache Graph, GraphX, Pregel, operate offline. They are built to analyze a snapshot of0 码力 | 72 页 | 7.77 MB | 1 年前3
共 14 条
- 1
- 2
相关搜索词













