Scalable Stream Processing - Spark Streaming and Flink
= new StreamingContext(conf, Seconds(1)) ▶ It can also be created from an existing SparkContext object. val sc = ... // existing SparkContext val ssc = new StreamingContext(sc, Seconds(1)) 10 / 79 = new StreamingContext(conf, Seconds(1)) ▶ It can also be created from an existing SparkContext object. val sc = ... // existing SparkContext val ssc = new StreamingContext(sc, Seconds(1)) 10 / 79 Output operations 12 / 79 Input Operations ▶ Every input DStream is associated with a Receiver object. • It receives the data from a source and stores it in Spark’s memory for processing. ▶ Three categories0 码力 | 113 页 | 1.22 MB | 1 年前3Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020
single-pass Updates arbitrary append-only Update rates relatively low high, bursty Processing Model query-driven / pull-based data-driven / push-based Queries ad-hoc continuous Latency relatively University 2020 Time-Series Model: The jth update is (j, A[j]) and updates arrive in increasing order of j, i.e. we observe the entries of A by increasing index. This can model time-series data streams: sequence of measurements from a temperature sensor • the volume of NASDAQ stock trades over time This model poses a severe limitation on the stream: updates cannot change past entries in A. 11 Useful in0 码力 | 45 页 | 1.22 MB | 1 年前3PyFlink 1.15 Documentation
. . . . . . . . 31 1.3.6.1 Q1: ‘tuple’ object has no attribute ‘_values’ . . . . . . . . . . . . . . . . . . . . . 31 1.3.6.2 Q2: AttributeError: ‘int’ object has no attribute ‘encode’ . . . . . . . PyFlink jobs with Flink Kubernetes Operator. 1.1.2 QuickStart 1.1.2.1 QuickStart: Table API This document is a short introduction to the PyFlink Table API, which is used to help novice users quickly understand TableEnvironment at 0x7fcd1ad0c0f0> Table Creation Table is a core component of the Python Table API. A Table object describes a pipeline of data transformations. It does not contain the data itself in any way. Instead0 码力 | 36 页 | 266.77 KB | 1 年前3PyFlink 1.16 Documentation
. . . . . . . . 31 1.3.6.1 Q1: ‘tuple’ object has no attribute ‘_values’ . . . . . . . . . . . . . . . . . . . . . 31 1.3.6.2 Q2: AttributeError: ‘int’ object has no attribute ‘encode’ . . . . . . . PyFlink jobs with Flink Kubernetes Operator. 1.1.2 QuickStart 1.1.2.1 QuickStart: Table API This document is a short introduction to the PyFlink Table API, which is used to help novice users quickly understand TableEnvironment at 0x7fcd1ad0c0f0> Table Creation Table is a core component of the Python Table API. A Table object describes a pipeline of data transformations. It does not contain the data itself in any way. Instead0 码力 | 36 页 | 266.80 KB | 1 年前3Elasticity and state migration: Part I - CS 591 K1: Data Stream Processing and Analytics Spring 2020
poles placement, sampling period, damping Cannot identify individual bottlenecks neither model 2-input operators ??? Vasiliki Kalavri | Boston University 2020 Heuristic models 11 • Metrics 1s o2 o1 ??? Vasiliki Kalavri | Boston University 2020 The DS2 model 17 ??? Vasiliki Kalavri | Boston University 2020 The DS2 model • Collect metrics per configurable observation window W • activity Rprc and records pushed to output Rpsd 17 ??? Vasiliki Kalavri | Boston University 2020 The DS2 model • Collect metrics per configurable observation window W • activity durations per worker • records0 码力 | 93 页 | 2.42 MB | 1 年前3Graph streaming algorithms - CS 591 K1: Data Stream Processing and Analytics Spring 2020
graph directed graph 4 ??? Vasiliki Kalavri | Boston University 2020 Graph streams Graph streams model interactions as events that update an underlying graph structure 5 Edge events: A purchase 2020 8 Some algorithms model graph streams a sequence of vertex events. A vertex stream consists of events that contain a vertex and all of its neighbors. Although this model can enable a theoretical theoretical analysis of streaming algorithms, it cannot adequately model real-world unbounded streams, as the neighbors cannot be known in advance. Vertex streams (not today) ??? Vasiliki Kalavri | Boston0 码力 | 72 页 | 7.77 MB | 1 年前3Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 2020
be expressed using only non-blocking operators? 22 Vasiliki Kalavri | Boston University 2020 Model and formalization (I) A stream is a sequence of unbounded length, where tuples are ordered by their t ∈ S to denote that, for some 1 ≤ i ≤ n, ti = t. 23 Vasiliki Kalavri | Boston University 2020 Model and formalization (II) Pre-sequence (prefix): Let S = [t1, … ,tn] be a sequence and 0 < k ≤ n. Then streaming and static data. Requirements (or why SQL is not enough) • Push-based model as opposed to the pull-based model of SQL, i.e. an application or client asks for the query results when they need0 码力 | 53 页 | 532.37 KB | 1 年前3Notions of time and progress - CS 591 K1: Data Stream Processing and Analytics Spring 2020
g-102 • Watermarks, Tables, Event Time, and the Dataflow Model: https:// www.confluent.jp/blog/watermarks-tables-event-time-dataflow-model/ Further reading 220 码力 | 22 页 | 2.22 MB | 1 年前3Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020
Vasiliki Kalavri | Boston University 2020 map(String key, String value): // key: document name // value: document contents for each URL u in value: EmitIntermediate(u, "1"); reduce(String0 码力 | 54 页 | 2.83 MB | 1 年前3Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020
aster/ ClusterData2011_2.md Make sure to read and become familiar with the format and schema document: • https://drive.google.com/file/d/0B5g07T_gRDg9Z0lsSTEtTWtpOW8/view Download and play around0 码力 | 34 页 | 2.53 MB | 1 年前3
共 16 条
- 1
- 2