Elasticity and state migration: Part I - CS 591 K1: Data Stream Processing and Analytics Spring 2020dataflow with sources S1, S2, … Sn and rates λ1, λ2, … λn identify the minimum parallelism πi per operator i, such that the physical dataflow can sustain all source rates. S1 S2 λ1 λ2 S1 S2 π=2 serialization send message waiting waiting 13 ??? Vasiliki Kalavri | Boston University 2020 14 o1 src o2 back-pressure target: 40 rec/s 10 rec/s 100 rec/s Which operator is the bottleneck? What if Boston University 2020 14 o1 src o2 back-pressure target: 40 rec/s 10 rec/s 100 rec/s Which operator is the bottleneck? What if we scale ο1 x 4? How much to scale ο2? o1 cannot keep up waiting for0 码力 | 93 页 | 2.42 MB | 1 年前3
PyFlink 1.15 Documentation. . . . . . . . . . . . . . . . . 22 1.3.1.1 O1: Scala Dependency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.3.1.2 O2: Java gateway process exited before sending its port number . . . . . . . . . . . . . . . . . . . 24 1.3.2.1 O1: How to prepare Python Virtual Environment . . . . . . . . . . . . . . . . . . . 24 1.3.2.2 O2: How to add Python Files . . . . . . . . . . . . . . . . . . . . 25 1.3.3.1 O1: InaccessibleObjectException: Unable to make field private final byte[] java.lang.String.value accessible: module java.base does not “opens java.lang” to unnamed module @4e4aea350 码力 | 36 页 | 266.77 KB | 1 年前3
PyFlink 1.16 Documentation. . . . . . . . . . . . . . . . . 22 1.3.1.1 O1: Scala Dependency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.3.1.2 O2: Java gateway process exited before sending its port number . . . . . . . . . . . . . . . . . . . 24 1.3.2.1 O1: How to prepare Python Virtual Environment . . . . . . . . . . . . . . . . . . . 24 1.3.2.2 O2: How to add Python Files . . . . . . . . . . . . . . . . . . . . 25 1.3.3.1 O1: InaccessibleObjectException: Unable to make field private final byte[] java.lang.String.value accessible: module java.base does not “opens java.lang” to unnamed module @4e4aea350 码力 | 36 页 | 266.80 KB | 1 年前3
State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020variable that is accessed by a task’s business logic Operator state is scoped to an operator task, i.e. records processed by the same parallel task have access to the same state • It cannot be accessed OutOfMemoryError if large grows too large, GC pauses • Checkpoints sent to JobManager's heap memory, i.e. the state is lost in case of failure • Use only for development and debugging purposes! FsStateBackend T • ListState.add(value: T) • ListState.addAll(values: java.util.List[T]). • List State.get(): Iterable[T] • ListState.update(values: java.util.List[T]) Flink’s state primitives 13 Vasiliki Kalavri0 码力 | 24 页 | 914.13 KB | 1 年前3
Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020different type A series of transformations on streams in Stream SQL, Scala, Python, Rust, Java… ??? Vasiliki Kalavri | Boston University 2020 Logic State<#Brexit, 521> <#WorldCup of output elements produced per number of input elements • a map operator has a selectivity of 1, i.e. it produces one output element for each input element it processes • an operator that tokenizes estimate the cost of different strategies? • before execution or during runtime Query optimization (I) ??? Vasiliki Kalavri | Boston University 2020 10 Optimization strategies • enumerate equivalent 0 码力 | 54 页 | 2.83 MB | 1 年前3
Introduction to Apache Flink and Apache Kafka - CS 591 K1: Data Stream Processing and Analytics Spring 2020temperature”) } } Flink programs are defined in regular Scala/Java methods Set up the execution environment: local, cluster, I/O, time semantics, parallelism, … Example: Sensor Readings 9 Vasiliki t_out Run with a class entry point and arguments: ./bin/flink run -c org.apache.flink.examples.java.wordcount.WordCount \ ./examples/batch/WordCount.jar \ more topics and processes the stream of records published in them. Topics are multi-subscriber, i.e. a topic can have zero, one, or many consumers that subscribe to the data written to it. For each0 码力 | 26 页 | 3.33 MB | 1 年前3
Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020They are produced by external sources, i.e. the DSMS has no control over their arrival order or the data rate. • They have unknown, possibly unbounded length, i.e. the DSMS does not know when the stream 2020 Time-Series Model: The jth update is (j, A[j]) and updates arrive in increasing order of j, i.e. we observe the entries of A by increasing index. This can model time-series data streams: • a append-only • what if packets arrive late? • we might need to revise the computed total traffic, i.e. output stream might contain updates to previously emitted items 12:01 12:02 12:00 18 32 8 320 码力 | 45 页 | 1.22 MB | 1 年前3
Streaming in Apache Flinkserializer/deserializer for it) • Flink has a built-in type system which supports: • basic types, i.e., String, Long, Integer, Boolean, Array • composite types: Tuples, POJOs, and Scala case classes types Type Examples Tuples Tuple1 through Tuple25 types. POJOs A POJO (plain old Java object) is any Java class that • has an empty default constructor • all fields are either ◦public, or ◦have0 码力 | 45 页 | 3.00 MB | 1 年前3
Exactly-once fault-tolerance in Apache Flink - CS 591 K1: Data Stream Processing and Analytics Spring 2020secondary I1 O1 N’i I1 Vasiliki Kalavri | Boston University 2020 Passive Standby • Each primary periodically checkpoints its state and sends it to the secondary 6 Ni primary secondary I1 O1 N’i N’i update checkpoint send state ??? Vasiliki Kalavri | Boston University 2020 How can we make sure that checkpoints are meaningful and coherent? 7 ??? Vasiliki Kalavri | Boston University 2020 University 2020 –Leslie Lamport The distributed snapshot algorithm described here came about when I visited Chandy, who was then at the University of Texas in Austin. He posed the problem to me over0 码力 | 81 页 | 13.18 MB | 1 年前3
Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020shedding as an optimization problem N: query network I: set of input streams with known arrival rates C: system processing capacity H: headroom factor, i.e. a conservative estimate of the percentage of Load(N(I)): the load as a fraction of the total capacity C that network N(I) presents Uacc: the aggregate utility 6 Find a new network N' such that Load(N’(I))< H x C and Uacc(N(I)) - Uacc(N'I)) Uacc(N'I)) is minimized ??? Vasiliki Kalavri | Boston University 2020 Implementation • Load shedding is commonly implemented by a standalone component integrated with the stream processor • The load0 码力 | 43 页 | 2.42 MB | 1 年前3
共 25 条
- 1
- 2
- 3













