Elasticity and state migration: Part I - CS 591 K1: Data Stream Processing and Analytics Spring 2020poles placement, sampling period, damping Cannot identify individual bottlenecks neither model 2-input operators ??? Vasiliki Kalavri | Boston University 2020 Heuristic models 11 • Metrics 1s o2 o1 ??? Vasiliki Kalavri | Boston University 2020 The DS2 model 17 ??? Vasiliki Kalavri | Boston University 2020 The DS2 model • Collect metrics per configurable observation window W • activity Rprc and records pushed to output Rpsd 17 ??? Vasiliki Kalavri | Boston University 2020 The DS2 model • Collect metrics per configurable observation window W • activity durations per worker • records0 码力 | 93 页 | 2.42 MB | 1 年前3
Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020single-pass Updates arbitrary append-only Update rates relatively low high, bursty Processing Model query-driven / pull-based data-driven / push-based Queries ad-hoc continuous Latency relatively University 2020 Time-Series Model: The jth update is (j, A[j]) and updates arrive in increasing order of j, i.e. we observe the entries of A by increasing index. This can model time-series data streams: sequence of measurements from a temperature sensor • the volume of NASDAQ stock trades over time This model poses a severe limitation on the stream: updates cannot change past entries in A. 11 Useful in0 码力 | 45 页 | 1.22 MB | 1 年前3
Notions of time and progress - CS 591 K1: Data Stream Processing and Analytics Spring 20202020 2/06: Notions of time and progress Vasiliki Kalavri | Boston University 2020 Mobile game application • input stream: user activity • output: rewards based on how fast the user meets goals • along dataflow edges. They are special records generated by the sources or assigned by the application. A watermark for time T states that event time has progressed to T in that particular stream g-102 • Watermarks, Tables, Event Time, and the Dataflow Model: https:// www.confluent.jp/blog/watermarks-tables-event-time-dataflow-model/ Further reading 220 码力 | 22 页 | 2.22 MB | 1 年前3
Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 2020be expressed using only non-blocking operators? 22 Vasiliki Kalavri | Boston University 2020 Model and formalization (I) A stream is a sequence of unbounded length, where tuples are ordered by their t ∈ S to denote that, for some 1 ≤ i ≤ n, ti = t. 23 Vasiliki Kalavri | Boston University 2020 Model and formalization (II) Pre-sequence (prefix): Let S = [t1, … ,tn] be a sequence and 0 < k ≤ n. Then static data. Requirements (or why SQL is not enough) • Push-based model as opposed to the pull-based model of SQL, i.e. an application or client asks for the query results when they need them. • The0 码力 | 53 页 | 532.37 KB | 1 年前3
PyFlink 1.15 DocumentationConda install pandoc conda install pandoc 3. Build the docs python3 setup.py build_sphinx 4. Open the pyflink-docs/build/sphinx/html/index.html in the Browser 1.1 Getting Started This page summarizes environments to use. ./bin/flink run-application -t yarn-application \ -Djobmanager.memory.process.size=1024m \ -Dtaskmanager.memory.process.size=1024m \ -Dyarn.application.name=\ -pyclientexec could not meet. ./bin/flink run-application -t yarn-application \ -Djobmanager.memory.process.size=1024m \ -Dtaskmanager.memory.process.size=1024m \ -Dyarn.application.name= \ -Dyarn 0 码力 | 36 页 | 266.77 KB | 1 年前3
PyFlink 1.16 DocumentationConda install pandoc conda install pandoc 3. Build the docs python3 setup.py build_sphinx 4. Open the pyflink-docs/build/sphinx/html/index.html in the Browser 1.1 Getting Started This page summarizes environments to use. ./bin/flink run-application -t yarn-application \ -Djobmanager.memory.process.size=1024m \ -Dtaskmanager.memory.process.size=1024m \ -Dyarn.application.name=\ -pyclientexec could not meet. ./bin/flink run-application -t yarn-application \ -Djobmanager.memory.process.size=1024m \ -Dtaskmanager.memory.process.size=1024m \ -Dyarn.application.name= \ -Dyarn 0 码力 | 36 页 | 266.80 KB | 1 年前3
Introduction to Apache Flink and Apache Kafka - CS 591 K1: Data Stream Processing and Analytics Spring 2020Introduction to Apache Flink and Apache Kafka Vasiliki Kalavri | Boston University 2020 Apache Flink • An open-source, distributed data analysis framework • True streaming at its core • Streaming & Batch API Default parallelism for jobs. You can override this option by using env.setParallelism() in your application. taskmanager.numberOfTaskSlots: The number of parallel operator or user function instances that /bin/start-cluster.sh Stop Flink: ./bin/stop-cluster.sh Run an application with no arguments: ./bin/flink run ./examples/batch/WordCount.jar Run an application with input and output arguments: ./bin/flink run0 码力 | 26 页 | 3.33 MB | 1 年前3
Exactly-once fault-tolerance in Apache Flink - CS 591 K1: Data Stream Processing and Analytics Spring 2020pre-snapshot p1 p2 p3 m m’ C’ A B ??? Vasiliki Kalavri | Boston University 2020 System model: • No failures during snapshotting • FIFO reliable channels: no lost or duplicate messages • recorded in the snapshot but enforces the causal consistency. 3. Starts recording all data (application) messages it receives on all of its incoming channels. 20 ??? Vasiliki Kalavri | Boston University University 2020 Can we apply this algorithm to retrieve a consistent snapshot of a stream processing application? 31 ??? Vasiliki Kalavri | Boston University 2020 32 Epoch-Based Stream Execution Logged Input0 码力 | 81 页 | 13.18 MB | 1 年前3
Graph streaming algorithms - CS 591 K1: Data Stream Processing and Analytics Spring 2020graph directed graph 4 ??? Vasiliki Kalavri | Boston University 2020 Graph streams Graph streams model interactions as events that update an underlying graph structure 5 Edge events: A purchase 2020 8 Some algorithms model graph streams a sequence of vertex events. A vertex stream consists of events that contain a vertex and all of its neighbors. Although this model can enable a theoretical theoretical analysis of streaming algorithms, it cannot adequately model real-world unbounded streams, as the neighbors cannot be known in advance. Vertex streams (not today) ??? Vasiliki Kalavri | Boston0 码力 | 72 页 | 7.77 MB | 1 年前3
Scalable Stream Processing - Spark Streaming and FlinkBuilt on the Spark SQL engine. ▶ Perform database-like query optimizations. 56 / 79 Programming Model (1/2) ▶ Two main steps to develop a Spark stuctured streaming: ▶ 1. Defines a query on the input new data (new row in the input table), and incrementally updates the result. 57 / 79 Programming Model (1/2) ▶ Two main steps to develop a Spark stuctured streaming: ▶ 1. Defines a query on the input new data (new row in the input table), and incrementally updates the result. 57 / 79 Programming Model (1/2) ▶ Two main steps to develop a Spark stuctured streaming: ▶ 1. Defines a query on the input0 码力 | 113 页 | 1.22 MB | 1 年前3
共 19 条
- 1
- 2













