Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020single-pass Updates arbitrary append-only Update rates relatively low high, bursty Processing Model query-driven / pull-based data-driven / push-based Queries ad-hoc continuous Latency relatively a massive, dynamic, one-dimensional vector A[1…N]. The size N of the streaming vector is defined as the product of the attribute domain size(s). Note that N might be unknown. up-to-date frequencies University 2020 Time-Series Model: The jth update is (j, A[j]) and updates arrive in increasing order of j, i.e. we observe the entries of A by increasing index. This can model time-series data streams:0 码力 | 45 页 | 1.22 MB | 1 年前3
Scalable Stream Processing - Spark Streaming and Flinkproportional to the size of the state. ▶ mapWithState • It is executed only on set of keys that are available in the last micro batch. • The performance is proportional to the size of the batch. 46 / proportional to the size of the state. ▶ mapWithState • It is executed only on set of keys that are available in the last micro batch. • The performance is proportional to the size of the batch. 46 / proportional to the size of the state. ▶ mapWithState • It is executed only on set of keys that are available in the last micro batch. • The performance is proportional to the size of the batch. 46 /0 码力 | 113 页 | 1.22 MB | 1 年前3
Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 2020Sliding windows have fixed size but both their bounds advance for new events • last 10 events or event in the last minute • Tumble windows are non-overlapping fixed-size • events every hour • Custom Custom windows have neither fixed bounds nor fixed size • events in a period during which a user was active 17 Vasiliki Kalavri | Boston University 2020 Flow Management Operators (I) • Join operators be expressed using only non-blocking operators? 22 Vasiliki Kalavri | Boston University 2020 Model and formalization (I) A stream is a sequence of unbounded length, where tuples are ordered by their0 码力 | 53 页 | 532.37 KB | 1 年前3
Graph streaming algorithms - CS 591 K1: Data Stream Processing and Analytics Spring 2020graph directed graph 4 ??? Vasiliki Kalavri | Boston University 2020 Graph streams Graph streams model interactions as events that update an underlying graph structure 5 Edge events: A purchase 2020 8 Some algorithms model graph streams a sequence of vertex events. A vertex stream consists of events that contain a vertex and all of its neighbors. Although this model can enable a theoretical theoretical analysis of streaming algorithms, it cannot adequately model real-world unbounded streams, as the neighbors cannot be known in advance. Vertex streams (not today) ??? Vasiliki Kalavri | Boston0 码力 | 72 页 | 7.77 MB | 1 年前3
High-availability, recovery semantics, and guarantees - CS 591 K1: Data Stream Processing and Analytics Spring 2020processed? Was mo delivered downstream? Vasiliki Kalavri | Boston University 2020 A simple system model stream sources N1 NK N2 … input queue output queue primary nodes secondary nodes other apps The number of lost events depends on • failure detection delay • stream input rates • state size • No runtime overhead 13 Vasiliki Kalavri | Boston University 2020 Passive Standby • Each primary0 码力 | 49 页 | 2.08 MB | 1 年前3
Elasticity and state migration: Part I - CS 591 K1: Data Stream Processing and Analytics Spring 2020poles placement, sampling period, damping Cannot identify individual bottlenecks neither model 2-input operators ??? Vasiliki Kalavri | Boston University 2020 Heuristic models 11 • Metrics 1s o2 o1 ??? Vasiliki Kalavri | Boston University 2020 The DS2 model 17 ??? Vasiliki Kalavri | Boston University 2020 The DS2 model • Collect metrics per configurable observation window W • activity Rprc and records pushed to output Rpsd 17 ??? Vasiliki Kalavri | Boston University 2020 The DS2 model • Collect metrics per configurable observation window W • activity durations per worker • records0 码力 | 93 页 | 2.42 MB | 1 年前3
Notions of time and progress - CS 591 K1: Data Stream Processing and Analytics Spring 2020g-102 • Watermarks, Tables, Event Time, and the Dataflow Model: https:// www.confluent.jp/blog/watermarks-tables-event-time-dataflow-model/ Further reading 220 码力 | 22 页 | 2.22 MB | 1 年前3
PyFlink 1.15 Documentation/bin/flink run-application -t yarn-application \ -Djobmanager.memory.process.size=1024m \ -Dtaskmanager.memory.process.size=1024m \ -Dyarn.application.name=\ -pyclientexec /path/to/venv/bin/python3 /bin/flink run-application -t yarn-application \ -Djobmanager.memory.process.size=1024m \ -Dtaskmanager.memory.process.size=1024m \ -Dyarn.application.name= \ -Dyarn.ship-files=/path/to/shipfiles via -pyarch will be distributed to the TaskManagers through blob server where the file size limit is 2 GB. If the size of an archive file is more than 2 GB, you could upload it to a distributed file system 0 码力 | 36 页 | 266.77 KB | 1 年前3
PyFlink 1.16 Documentation/bin/flink run-application -t yarn-application \ -Djobmanager.memory.process.size=1024m \ -Dtaskmanager.memory.process.size=1024m \ -Dyarn.application.name=\ -pyclientexec /path/to/venv/bin/python3 /bin/flink run-application -t yarn-application \ -Djobmanager.memory.process.size=1024m \ -Dtaskmanager.memory.process.size=1024m \ -Dyarn.application.name= \ -Dyarn.ship-files=/path/to/shipfiles via -pyarch will be distributed to the TaskManagers through blob server where the file size limit is 2 GB. If the size of an archive file is more than 2 GB, you could upload it to a distributed file system 0 码力 | 36 页 | 266.80 KB | 1 年前3
Filtering and sampling streams - CS 591 K1: Data Stream Processing and Analytics Spring 2020we limit the sample size from growing indefinitely? ??? Vasiliki Kalavri | Boston University 2020 Instead of a fixed proportion, assume we can only store a sample S of fixed size, e.g. s elements. 14 proportion, assume we can only store a sample S of fixed size, e.g. s elements. 14 How can we continuously maintain a representative fixed-size sample of the stream so far? ??? Vasiliki Kalavri | Boston proportion, assume we can only store a sample S of fixed size, e.g. s elements. 14 How can we continuously maintain a representative fixed-size sample of the stream so far? At all times, we want the0 码力 | 74 页 | 1.06 MB | 1 年前3
共 21 条
- 1
- 2
- 3













