Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020data access, high-rate append-only updates Data Warehouse • complex, offline analysis • large and relatively static and historical data • batched updates during downtimes, e.g. every night Boston University 2020 1. Process events online without storing them 2. Support a high-level language (e.g. StreamSQL) 3. Handle missing, out-of-order, delayed data 4. Guarantee deterministic (on analytics … Building a stream processor… 8 ? Vasiliki Kalavri | Boston University 2020 Basic Stream Models Vasiliki Kalavri | Boston University 2020 A stream can be viewed as a massive, dynamic, one-dimensional0 码力 | 45 页 | 1.22 MB | 1 年前3
Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 2020type, content, timing constraints. • Actions define how to produce results from the matches. Language Types 3 Vasiliki Kalavri | Boston University 2020 Three classes of operators: • relation-to-relation: portions of a stream. • relation-to-stream: create streams through querying tables Declarative language: CQL 4 Vasiliki Kalavri | Boston University 2020 Select IStream(*) From S1 [Rows 5], S2 [Rows τ> whenever tuple s is in R at time τ. 6 Vasiliki Kalavri | Boston University 2020 Imperative language: Aurora SQuAl Queries are represented in graphical representation using boxes and arrows Tumble0 码力 | 53 页 | 532.37 KB | 1 年前3
Elasticity and state migration: Part I - CS 591 K1: Data Stream Processing and Analytics Spring 2020Queuing theory models: for latency objectives • Control theory models: e.g., PID controller • Rule-based models, e.g. if CPU utilization > 70% => scale out • Analytical dataflow-based models Action Predictive: at-once for all operators 8 ??? Vasiliki Kalavri | Boston University 2020 Queuing theory models 9 • Metrics • service time and waiting time per tuple and per task • total time spent processing predictive, at-once for all operators ??? Vasiliki Kalavri | Boston University 2020 Queuing theory models 9 • Metrics • service time and waiting time per tuple and per task • total time spent processing0 码力 | 93 页 | 2.42 MB | 1 年前3
State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020maintains state: • rolling aggregations • window contents • input offsets • machine learning models State in dataflow computations 2 Vasiliki Kalavri | Boston University 2020 • No explicit state as regular objects on TaskManager’s heap • Low read/write latencies • OutOfMemoryError if large grows too large, GC pauses • Checkpoints sent to JobManager's heap memory, i.e. the state is lost in case to a remote file system and supports incremental checkpoints • Use for applications with very large state Which backend to choose? 9 Vasiliki Kalavri | Boston University 2020 RocksDB 10 RocksDB0 码力 | 24 页 | 914.13 KB | 1 年前3
Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020during office hours. Vasiliki Kalavri | Boston University 2020 Dataset A subset of traces from a large (12.5k machines) Google cluster • https://github.com/google/cluster-data/blob/master/ ClusterData2011_2 challenging? 28 Vasiliki Kalavri | Boston University 2020 Using pseudocode (or the programming language of your choice), write a program that reads a stream of integers and computes: 29 1. the maximum0 码力 | 34 页 | 2.53 MB | 1 年前3
High-availability, recovery semantics, and guarantees - CS 591 K1: Data Stream Processing and Analytics Spring 2020maintains state: • rolling aggregations • window contents • input offsets • machine learning models State in dataflow computations 3 Vasiliki Kalavri | Boston University 2020 Logic Statemaintains state: • rolling aggregations • window contents • input offsets • machine learning models State in dataflow computations 3 Vasiliki Kalavri | Boston University 2020 Logic State maintains state: • rolling aggregations • window contents • input offsets • machine learning models State in dataflow computations 3 Vasiliki Kalavri | Boston University 2020 4 Distributed streaming 0 码力 | 49 页 | 2.08 MB | 1 年前3
Filtering and sampling streams - CS 591 K1: Data Stream Processing and Analytics Spring 2020Kalavri | Boston University 2020 A simple and efficient synopsis Suppose that our data consists of a large numeric time series. What summary would let us compute the statistical variance of this series Kalavri | Boston University 2020 A simple and efficient synopsis Suppose that our data consists of a large numeric time series. What summary would let us compute the statistical variance of this series Kalavri | Boston University 2020 A simple and efficient synopsis Suppose that our data consists of a large numeric time series. What summary would let us compute the statistical variance of this series0 码力 | 74 页 | 1.06 MB | 1 年前3
Cardinality and frequency estimation - CS 591 K1: Data Stream Processing and Analytics Spring 2020Boston University 2020 14 Combining estimates • Average won’t work: The expected value of 2R is too large. • Median won’t work: it is always a power of 2, thus, if the correct estimate is between two University 2020 18 Detect DNS DDoS attacks • Flooding the resources of the targeted system by sending a large number of query from a botnet • Group queries by their top-level domain and investigate most popular functions increases the collision probability • Counter overestimation is almost certain for very large data streams with high-frequency elements Counting Bloom Filter ??? Vasiliki Kalavri | Boston0 码力 | 69 页 | 630.01 KB | 1 年前3
PyFlink 1.15 Documentation. . . . . . . . . . . . . . . . . . . . . . 30 1.3.5.1 Q1: OverflowError: timeout value is too large . . . . . . . . . . . . . . . . . . . . 30 1.3.5.2 Q2: An error occurred while calling z:org.apache you to build scalable batch and streaming workloads, such as real-time data processing pipelines, large-scale exploratory data analysis, Machine Learning (ML) pipelines and ETL processes. If you’re already java: ˓→147) ... 39 more 1.3.5 Runtime issues 1.3.5.1 Q1: OverflowError: timeout value is too large File "D:\Anaconda3\envs\py37\lib\threading.py", line 926, in _bootstrap_inner self.run() File0 码力 | 36 页 | 266.77 KB | 1 年前3
PyFlink 1.16 Documentation. . . . . . . . . . . . . . . . . . . . . . 30 1.3.5.1 Q1: OverflowError: timeout value is too large . . . . . . . . . . . . . . . . . . . . 30 1.3.5.2 Q2: An error occurred while calling z:org.apache you to build scalable batch and streaming workloads, such as real-time data processing pipelines, large-scale exploratory data analysis, Machine Learning (ML) pipelines and ETL processes. If you’re already java: ˓→147) ... 39 more 1.3.5 Runtime issues 1.3.5.1 Q1: OverflowError: timeout value is too large File "D:\Anaconda3\envs\py37\lib\threading.py", line 926, in _bootstrap_inner self.run() File0 码力 | 36 页 | 266.80 KB | 1 年前3
共 18 条
- 1
- 2













