Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020
bear a valid timestamp, Vs, after which they are considered valid and they can contribute to the result. • alternatively, events can have validity intervals. • The contents of the relation at time indexes and materialized views for high rates. • Incremental computation: do we recompute the result from scratch whenever a new record is appended to the stream table? Synopses: Maintain summaries streams and large state • Streams do not correspond to states one-on-one, i.e. state can be the result of one or more base and/or derived streams • Each query (operator) maintains its own state • Queries0 码力 | 45 页 | 1.22 MB | 1 年前3State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020
2020 A pluggable component that determines how state is stored, accessed, and maintained. State backends are responsible for: • local state management • checkpointing state to remote and persistent distributed filesystem or a database system • Available state backends in Flink: • In-memory • File system • RocksDB State backends 7 Vasiliki Kalavri | Boston University 2020 MemoryStateBackend RocksDB 11 Vasiliki Kalavri | Boston University 2020 In conf/flink.conf.yaml: # Supported backends are 'jobmanager', 'filesystem', ‘rocksdb' # # state.backend: rocksdb # # Directory for checkpoints0 码力 | 24 页 | 914.13 KB | 1 年前3High-availability, recovery semantics, and guarantees - CS 591 K1: Data Stream Processing and Analytics Spring 2020
produce output What can go wrong: • lost events • duplicate or lost state updates • wrong result 5 mi mo Was mi fully processed? Was mo delivered downstream? Vasiliki Kalavri | Boston University University 2020 Processing guarantees and result semantics 11 sum 4 3 2 1 0 … Vasiliki Kalavri | Boston University 2020 Processing guarantees and result semantics 11 sum 4 3 2 1 … 1 5 Vasiliki University 2020 Processing guarantees and result semantics 11 sum 4 3 1 3 3 … 5 6 Vasiliki Kalavri | Boston University 2020 Processing guarantees and result semantics 11 sum 5 4 3 6 6 … 6 7 10 码力 | 49 页 | 2.08 MB | 1 年前3Scalable Stream Processing - Spark Streaming and Flink
DStream of (word, 1). ▶ Get the frequency of words in each batch of data. ▶ Finally, print the result. val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts trigger fires, Spark checks for new data (new row in the input table), and incrementally updates the result. 57 / 79 Programming Model (1/2) ▶ Two main steps to develop a Spark stuctured streaming: ▶ 1 trigger fires, Spark checks for new data (new row in the input table), and incrementally updates the result. 57 / 79 Programming Model (1/2) ▶ Two main steps to develop a Spark stuctured streaming: ▶ 10 码力 | 113 页 | 1.22 MB | 1 年前3PyFlink 1.15 Documentation
word_count.py # You will see outputs as following: # Use --input to specify file input. # Printing result to stdout. Use --output to specify output path. # +I[To, 1] # +I[be,, 1] # +I[or, 1] # +I[not, 1] used in Table API & SQL [20]: from pyflink.table.udf import udf # create a general Python UDF @udf(result_type=DataTypes.BIGINT()) def plus_one(i): return i + 1 table.select(plus_one(col('id'))).to_pandas() pyflink-docs, Release release-1.15 [20]: _c0 0 2 1 3 [21]: # create a general Python UDF @udf(result_type=DataTypes.BIGINT(), func_type='pandas') def pandas_plus_one(series): return series + 1 table0 码力 | 36 页 | 266.77 KB | 1 年前3PyFlink 1.16 Documentation
word_count.py # You will see outputs as following: # Use --input to specify file input. # Printing result to stdout. Use --output to specify output path. # +I[To, 1] # +I[be,, 1] # +I[or, 1] # +I[not, 1] used in Table API & SQL [20]: from pyflink.table.udf import udf # create a general Python UDF @udf(result_type=DataTypes.BIGINT()) def plus_one(i): return i + 1 table.select(plus_one(col('id'))).to_pandas() pyflink-docs, Release release-1.16 [20]: _c0 0 2 1 3 [21]: # create a general Python UDF @udf(result_type=DataTypes.BIGINT(), func_type='pandas') def pandas_plus_one(series): return series + 1 table0 码力 | 36 页 | 266.80 KB | 1 年前3Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020
conditions does the optimization preserve correctness? • maintain state semantics • maintain result and selectivity semantics • Dynamism: can the optimization be applied during runtime or does it attributes A writes to. • Commutativity: the results of applying A and then B must be the same as the result of applying B and then A. • holds if both operators are stateless Operator re-ordering B A A attributes A writes to. • commutativity: the results of applying A and then B must be the same as the result of applying B and then A. • holds if both operators are stateless Re-ordering split and merge0 码力 | 54 页 | 2.83 MB | 1 年前3Windows and triggers - CS 591 K1: Data Stream Processing and Analytics Spring 2020
• They maintain a single value as window state and eventually emit the aggregated value as the result. • ReduceFunction and AggregateFunction • Full window functions collect all elements of a window accumulator); // compute the result from the accumulator and return it. OUT getResult(ACC accumulator); // merge two accumulators and return the result. ACC merge(ACC a, ACC b); } 16 Input type Accumulator type Output type Initialization Accumulate one element Compute the result Merge two partial accumulators Vasiliki Kalavri | Boston University 2020 Use the ProcessWindowFunction0 码力 | 35 页 | 444.84 KB | 1 年前3Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020
rate : throughput Why is it necessary? ??? Vasiliki Kalavri | Boston University 2020 • Ensure result correctness • reconfiguration mechanism often relies on fault-tolerance mechanism • State re-partitioning Re-partition and migrate state in a consistent manner • Block and unblock computations to ensure result correctness ??? Vasiliki Kalavri | Boston University 2020 Control: When and how much to adapt? Re-partition and migrate state in a consistent manner • Block and unblock computations to ensure result correctness ??? Vasiliki Kalavri | Boston University 2020 Control: When and how much to adapt?0 码力 | 41 页 | 4.09 MB | 1 年前3Streaming in Apache Flink
average average.add(item.f1); averageState.update(average); // return the smoothed result return new Tuple2(item.f0, average.getAverage()); } } Connected Streams DataStreamLong, SensorReading>(key, context.window().getEnd(), max)); } } Precombine Produce final result Lateness • By default, when using event-time windows, late events are dropped. stream. .keyBy( SingleOutputStreamOperator result = stream. .keyBy(...) .window(...) .sideOutputLateData(late) .process(...); DataStream lateStream = result.getSideOutput(lateTag); Lab 3 0 码力 | 45 页 | 3.00 MB | 1 年前3
共 18 条
- 1
- 2