Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020
subscription. • DB query results depend on a snapshot and clients are not notified if their query result changes later. 13 Message delivery and ordering Acknowledgements are messages from the client • Data streaming from various processes or devices • a residential sensor can stream data to backend servers hosted in the cloud. 24 A publisher application creates a topic and sends messages0 码力 | 33 页 | 700.14 KB | 1 年前3Exactly-once fault-tolerance in Apache Flink - CS 591 K1: Data Stream Processing and Analytics Spring 2020
onPkoUvUQHeoiXxE0SN6Rq/ozZHOi/PufCxaC04+c4z+wPn8AedOjc= Local State Backend External State Backend ⇧ep2AB8nicbVA9T8MwEHXKVylf triggers a checkpoint of its local state at the state backend, and broadcasts barriers to all outgoing stream partitions. • The state backend notifies the task once its state checkpoint is complete checkpointing and recovery mechanism only resets the internal state of a streaming application • Some result records might be emitted multiple times to downstream systems 50 ??? Vasiliki Kalavri | Boston 0 码力 | 81 页 | 13.18 MB | 1 年前3State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020
accesses and fault tolerance • Limited to TaskManager’s memory and might suffer from GC pauses Which backend to choose? 8 Vasiliki Kalavri | Boston University 2020 RocksDBStateBackend • Stores all state system and supports incremental checkpoints • Use for applications with very large state Which backend to choose? 9 Vasiliki Kalavri | Boston University 2020 RocksDB 10 RocksDB is an LSM-tree storage conf/flink.conf.yaml: # Supported backends are 'jobmanager', 'filesystem', ‘rocksdb' # # state.backend: rocksdb # # Directory for checkpoints filesystem # # state.checkpoints.dir: path/to/checkpoint/folder/0 码力 | 24 页 | 914.13 KB | 1 年前3Graph streaming algorithms - CS 591 K1: Data Stream Processing and Analytics Spring 2020
the spanner? As an adjacency list? which state primitives are suitable? Is RocksDB a suitable backend for graph state? • How to compute the distance between edges? Do we need to do that for every0 码力 | 72 页 | 7.77 MB | 1 年前3High-availability, recovery semantics, and guarantees - CS 591 K1: Data Stream Processing and Analytics Spring 2020
produce output What can go wrong: • lost events • duplicate or lost state updates • wrong result 5 mi mo Was mi fully processed? Was mo delivered downstream? Vasiliki Kalavri | Boston University University 2020 Processing guarantees and result semantics 11 sum 4 3 2 1 0 … Vasiliki Kalavri | Boston University 2020 Processing guarantees and result semantics 11 sum 4 3 2 1 … 1 5 Vasiliki University 2020 Processing guarantees and result semantics 11 sum 4 3 1 3 3 … 5 6 Vasiliki Kalavri | Boston University 2020 Processing guarantees and result semantics 11 sum 5 4 3 6 6 … 6 7 10 码力 | 49 页 | 2.08 MB | 1 年前3Scalable Stream Processing - Spark Streaming and Flink
DStream of (word, 1). ▶ Get the frequency of words in each batch of data. ▶ Finally, print the result. val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts trigger fires, Spark checks for new data (new row in the input table), and incrementally updates the result. 57 / 79 Programming Model (1/2) ▶ Two main steps to develop a Spark stuctured streaming: ▶ 1 trigger fires, Spark checks for new data (new row in the input table), and incrementally updates the result. 57 / 79 Programming Model (1/2) ▶ Two main steps to develop a Spark stuctured streaming: ▶ 10 码力 | 113 页 | 1.22 MB | 1 年前3PyFlink 1.15 Documentation
word_count.py # You will see outputs as following: # Use --input to specify file input. # Printing result to stdout. Use --output to specify output path. # +I[To, 1] # +I[be,, 1] # +I[or, 1] # +I[not, 1] used in Table API & SQL [20]: from pyflink.table.udf import udf # create a general Python UDF @udf(result_type=DataTypes.BIGINT()) def plus_one(i): return i + 1 table.select(plus_one(col('id'))).to_pandas() pyflink-docs, Release release-1.15 [20]: _c0 0 2 1 3 [21]: # create a general Python UDF @udf(result_type=DataTypes.BIGINT(), func_type='pandas') def pandas_plus_one(series): return series + 1 table0 码力 | 36 页 | 266.77 KB | 1 年前3PyFlink 1.16 Documentation
word_count.py # You will see outputs as following: # Use --input to specify file input. # Printing result to stdout. Use --output to specify output path. # +I[To, 1] # +I[be,, 1] # +I[or, 1] # +I[not, 1] used in Table API & SQL [20]: from pyflink.table.udf import udf # create a general Python UDF @udf(result_type=DataTypes.BIGINT()) def plus_one(i): return i + 1 table.select(plus_one(col('id'))).to_pandas() pyflink-docs, Release release-1.16 [20]: _c0 0 2 1 3 [21]: # create a general Python UDF @udf(result_type=DataTypes.BIGINT(), func_type='pandas') def pandas_plus_one(series): return series + 1 table0 码力 | 36 页 | 266.80 KB | 1 年前3Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020
conditions does the optimization preserve correctness? • maintain state semantics • maintain result and selectivity semantics • Dynamism: can the optimization be applied during runtime or does it attributes A writes to. • Commutativity: the results of applying A and then B must be the same as the result of applying B and then A. • holds if both operators are stateless Operator re-ordering B A A attributes A writes to. • commutativity: the results of applying A and then B must be the same as the result of applying B and then A. • holds if both operators are stateless Re-ordering split and merge0 码力 | 54 页 | 2.83 MB | 1 年前3Windows and triggers - CS 591 K1: Data Stream Processing and Analytics Spring 2020
• They maintain a single value as window state and eventually emit the aggregated value as the result. • ReduceFunction and AggregateFunction • Full window functions collect all elements of a window accumulator); // compute the result from the accumulator and return it. OUT getResult(ACC accumulator); // merge two accumulators and return the result. ACC merge(ACC a, ACC b); } 16 Input type Accumulator type Output type Initialization Accumulate one element Compute the result Merge two partial accumulators Vasiliki Kalavri | Boston University 2020 Use the ProcessWindowFunction0 码力 | 35 页 | 444.84 KB | 1 年前3
共 19 条
- 1
- 2