Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020state that reflect part of the stream history they have seen • windows, continuous aggregations, distinct… • State is commonly partitioned by key • State can be cleared based on watermarks or punctuations logical conjunction • if A is a projection on multiple attributes • if A is an idempotent aggregation Operator separation A A2 A1 Separate operators into smaller computational steps • beneficial Boston University 2020 24 • Cost of Merge = 0.5 • Cost of A = 0.5 • Splitting A allows a pre-aggregation similar to what combiners do in MapReduce Operator separation merge X merge A A X merge0 码力 | 54 页 | 2.83 MB | 1 年前3
Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020measurements analysis • Monitoring applications • Complex filtering and alarm activation • Aggregation of multiple sensors and joins • Examples • Real-time statistics, e.g. weather maps • Monitor activity analysis • Visualization and aggregation • impressions, clicks, transactions, likes, comments • Analytics on user activity • Filtering, aggregation, joins with static data (e.g. user profile0 码力 | 34 页 | 2.53 MB | 1 年前3
Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020given the constraint that system throughput matches the data input rate • In the case of known aggregation functions, results can be scaled using approximate query processing techniques, where accuracy a data stream manager. (VLDB ’03) • N. Tatbul and S. Zdonik. Window-aware load shedding for aggregation queries over data streams. (VLDB’06) • N. Tatbul, U. Çetintemel, and S. Zdonik. Staying fit:0 码力 | 43 页 | 2.42 MB | 1 年前3
Windows and triggers - CS 591 K1: Data Stream Processing and Analytics Spring 2020functions define the computation that is performed on the elements of a window • Incremental aggregation functions are applied when an element is added to a window: • They maintain a single value as0 码力 | 35 页 | 444.84 KB | 1 年前3
Scalable Stream Processing - Spark Streaming and Flinkwhere("id > 10") // using untyped APIs ds.filter(_.id > 10).map(_.action) // using typed APIs // Aggregation df.groupBy("action") // using untyped API ds.groupByKey(_.action) // using typed API // SQL commands0 码力 | 113 页 | 1.22 MB | 1 年前3
Cardinality and frequency estimation - CS 591 K1: Data Stream Processing and Analytics Spring 20202020 Counting distinct elements 2 ??? Vasiliki Kalavri | Boston University 2020 How can we count the number of distinct elements seen so far in a stream? 3 Example use-case: Distinct users visiting Kalavri | Boston University 2020 How can we count the number of distinct elements seen so far in a stream? 3 Example use-case: Distinct users visiting one or multiple webpages Naive solution: maintain Kalavri | Boston University 2020 How can we count the number of distinct elements seen so far in a stream? 3 Example use-case: Distinct users visiting one or multiple webpages Naive solution: maintain0 码力 | 69 页 | 630.01 KB | 1 年前3
Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020ins(P:i) = insert(i, ins(P)), where P:i denotes the sequence P extended by item i. Insert-Unique (distinct): The reconstitution function ins_u checks for duplicates: • ins_u([]) = Ø • ins_u(P:i) = if i Vasiliki Kalavri | Boston University 2020 • The average of a stream on integers? • The number of distinct users who have visited a website? • The top-10 queries inserted in a search engine? • The connected universal synopsis solution • They are purpose-built and query-specific • different synopsis to count distinct elements than to keep track of top-K events 33 Vasiliki Kalavri | Boston University 2020 Dataflow0 码力 | 45 页 | 1.22 MB | 1 年前3
State management - CS 591 K1: Data Stream Processing and Analytics Spring 20202020 Use keyed state to store and access state in the context of a key attribute: • For each distinct value of the key attribute, Flink maintains one state instance. • The keyed state instances of0 码力 | 24 页 | 914.13 KB | 1 年前3
Filtering and sampling streams - CS 591 K1: Data Stream Processing and Analytics Spring 2020difficult to find a good estimator for some queries: • How can we scale the answer for NOT IN, DISTINCT, anti-joins, outer-joins Drawbacks of sampling ??? Vasiliki Kalavri | Boston University 20200 码力 | 74 页 | 1.06 MB | 1 年前3
共 9 条
- 1













