Skew mitigation - CS 591 K1: Data Stream Processing and Analytics Spring 2020
??? Vasiliki Kalavri | Boston University 2020 CS 591 K1: Data Stream Processing and Analytics Vasiliki (Vasia) Kalavri vkalavri@bu.edu Spring 2020 4/16: Skew mitigation ??? Vasiliki Kalavri | hitters 3 ??? Vasiliki Kalavri | Boston University 2020 Lossy Counting • Find all items x in a data stream such that: • freq(x) > δ*N, where N is the number of stream elements • The solution will randomized load balancing. IEEE TPDS 2001. • Manku, G.S., Motwani, R. Approximate frequency counts over data streams. VLDB 2002. Further reading0 码力 | 31 页 | 1.47 MB | 1 年前3State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020
Vasiliki Kalavri | Boston University 2020 CS 591 K1: Data Stream Processing and Analytics Vasiliki (Vasia) Kalavri vkalavri@bu.edu Spring 2020 2/25: State Management Vasiliki Kalavri | Boston types can you think of? • Count, sum, list, map, … Vasiliki Kalavri | Boston University 2020 All data maintained by a task and used to compute results: a local or instance variable that is accessed by ache-flink Vasiliki Kalavri | Boston University 2020 • RocksDB is a persistent key value store: data lives on disk, state can grow larger than available memory and will not be lost upon failure.0 码力 | 24 页 | 914.13 KB | 1 年前3Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020
??? Vasiliki Kalavri | Boston University 2020 CS 591 K1: Data Stream Processing and Analytics Vasiliki (Vasia) Kalavri vkalavri@bu.edu Spring 2020 4/14: Stream processing optimizations ??? Vasiliki Revisiting the basics 4 Dataflow graph • operators are nodes, data channels are edges • channels have FIFO semantics • streams of data elements flow continuously along edges Operators • receive University 2020 Types of Parallelism 7 B A C A B D A A B split Pipeline: A || B Task: B || C Data: A || A ??? Vasiliki Kalavri | Boston University 2020 8 Distributed execution in Flink ??? Vasiliki0 码力 | 54 页 | 2.83 MB | 1 年前3Windows and triggers - CS 591 K1: Data Stream Processing and Analytics Spring 2020
Vasiliki Kalavri | Boston University 2020 CS 591 K1: Data Stream Processing and Analytics Vasiliki (Vasia) Kalavri vkalavri@bu.edu Spring 2020 2/11: Windows and Triggers Vasiliki Kalavri | Boston0 码力 | 35 页 | 444.84 KB | 1 年前3Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020
Vasiliki Kalavri | Boston University 2020 CS 591 K1: Data Stream Processing and Analytics Vasiliki (Vasia) Kalavri vkalavri@bu.edu Spring 2020 1/21: Introduction Vasiliki Kalavri | Boston University architecture of modern distributed streaming 4 Fundamental for representing, summarizing, and analyzing data streams Systems Algorithms Architecture and design Scheduling and load management Scalability Learn from experts with decades of hands-on experience in building and using distributed systems and data management platforms • Have fun! 10 Vasiliki Kalavri | Boston University 2020 Important dates0 码力 | 34 页 | 2.53 MB | 1 年前3Notions of time and progress - CS 591 K1: Data Stream Processing and Analytics Spring 2020
Vasiliki Kalavri | Boston University 2020 Vasiliki (Vasia) Kalavri vkalavri@bu.edu CS 591 K1: Data Stream Processing and Analytics Spring 2020 2/06: Notions of time and progress Vasiliki Kalavri captures the progress of the stage itself • minimum of input watermarks and event-times of non-late data Watermark propagation 12 Vasiliki Kalavri | Boston University 2020 13 Event-time update Vasiliki0 码力 | 22 页 | 2.22 MB | 1 年前3Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020
Vasiliki Kalavri | Boston University 2020 CS 591 K1: Data Stream Processing and Analytics Vasiliki (Vasia) Kalavri vkalavri@bu.edu Spring 2020 1/23: Stream Processing Fundamentals Vasiliki Kalavri What is a stream? • In traditional data processing applications, we know the entire dataset in advance, e.g. tables stored in a database. A data stream is a data set that is produced incrementally time, rather than being available in full before its processing begins. • Data streams are high-volume, real-time data that might be unbounded • we cannot store the entire stream in an accessible0 码力 | 45 页 | 1.22 MB | 1 年前3Cardinality and frequency estimation - CS 591 K1: Data Stream Processing and Analytics Spring 2020
??? Vasiliki Kalavri | Boston University 2020 CS 591 K1: Data Stream Processing and Analytics Vasiliki (Vasia) Kalavri vkalavri@bu.edu Spring 2020 4/23: Cardinality and frequency estimation certain for very large data streams with high-frequency elements Counting Bloom Filter ??? Vasiliki Kalavri | Boston University 2020 20 • A space-efficient probabilistic data structure that can be be used to estimate frequencies and heavy hitters in data streams • It was introduced in 2003 by Cormode and Muthukrishnan • It uses a hash table of p arrays of m counters • Elements update different0 码力 | 69 页 | 630.01 KB | 1 年前3Graph streaming algorithms - CS 591 K1: Data Stream Processing and Analytics Spring 2020
??? Vasiliki Kalavri | Boston University 2020 CS 591 K1: Data Stream Processing and Analytics Vasiliki (Vasia) Kalavri vkalavri@bu.edu Spring 2020 4/28: Graph Streaming ??? Vasiliki Kalavri | Kalavri | Boston University 2020 Streaming Connected Components • State: a disjoint set (union-find) data structure for the components • it stores a set of elements partitioned in disjoint subsets • Single-pass 8 7 5 1 4 ??? Vasiliki Kalavri | Boston University 2020 59 • Similar challenges exist for a data-parallel implementation of spanners • How to represent the spanner? As an adjacency list? which0 码力 | 72 页 | 7.77 MB | 1 年前3Filtering and sampling streams - CS 591 K1: Data Stream Processing and Analytics Spring 2020
??? Vasiliki Kalavri | Boston University 2020 CS 591 K1: Data Stream Processing and Analytics Vasiliki (Vasia) Kalavri vkalavri@bu.edu Spring 2020 4/21: Sampling and filtering streams ??? Vasiliki 2020 Synopses for massive data streams • Maintaining synopses is often the only means of providing interactive response times when exploring massive datasets or high speed data streams. • Queries are ??? Vasiliki Kalavri | Boston University 2020 A simple and efficient synopsis Suppose that our data consists of a large numeric time series. What summary would let us compute the statistical variance0 码力 | 74 页 | 1.06 MB | 1 年前3
共 25 条
- 1
- 2
- 3