Cardinality and frequency estimation - CS 591 K1: Data Stream Processing and Analytics Spring 2020
h be a hash function that maps each stream element into M = log2N bits, where N is the domain of input elements: For each element x, let rank(x) be the number of 0s in the end of h(x): • e.g. elements in the input stream so far and let R be the maximum value of rank(.) seen so far. ??? Vasiliki Kalavri | Boston University 2020 5 Let n be the number of distinct elements in the input stream so far to use multiple hash functions and combine their estimates: • Using many hash functions for a high-rate stream is expensive • Finding many random and independent hash functions is difficult ??? Vasiliki0 码力 | 69 页 | 630.01 KB | 1 年前3Filtering and sampling streams - CS 591 K1: Data Stream Processing and Analytics Spring 2020
the synopsis rather than the entire dataset. 2 Synopsis: a lossy, compact summary of the input stream input stream synopsis maintenance component user queries approximate results ??? Vasiliki Kalavri a set of data elements selected via some random process Samples: the most fundamental synopses input stream add to sample or discard ??? Vasiliki Kalavri | Boston University 2020 How can we select • We can use a random generator that produces an integer ri between 0 and 9. We then select an input element i if ri=0. 8 Q: How many queries did users repeat last month? ??? Vasiliki Kalavri | Boston0 码力 | 74 页 | 1.06 MB | 1 年前3Windows and triggers - CS 591 K1: Data Stream Processing and Analytics Spring 2020
Triggers Vasiliki Kalavri | Boston University 2020 • Practical way to perform operations on unbounded input • e.g. joins, holistic aggregates • Compute on most recent events only • when providing real-time need to specify two window components: • A window assigner determines how the elements of the input stream are grouped into windows. A window assigner produces a WindowedStream (or All WindowedStream Vasiliki Kalavri | Boston University 2020 Window functions define the computation that is performed on the elements of a window • Incremental aggregation functions are applied when an element is added to a0 码力 | 35 页 | 444.84 KB | 1 年前3Scalable Stream Processing - Spark Streaming and Flink
actions): 1. Input operations 2. Transformation 3. Output operations 11 / 79 Operations on DStreams ▶ Input operations ▶ Transformation ▶ Output operations 12 / 79 Input Operations ▶ Every input DStream Flume, Kinesis, Twitter. 3. Custom sources, e.g., user-provided sources. 13 / 79 Input Operations ▶ Every input DStream is associated with a Receiver object. • It receives the data from a source and e.g., Kafka, Flume, Kinesis, Twitter. 3. Custom sources, e.g., user-provided sources. 13 / 79 Input Operations - Basic Sources ▶ Socket connection • Creates a DStream from text data received over0 码力 | 113 页 | 1.22 MB | 1 年前3Skew mitigation - CS 591 K1: Data Stream Processing and Analytics Spring 2020
Kalavri | Boston University 2020 Notation (I) Input: a stream of items N: number of items in the stream fe: true frequency of the item e in the input stream f: estimated frequency of item δ: user-defined 5 1 1 2 3 3 3 3 1 2 0 1 1 3 5 input stream ε=0.2 w1 w4 w3 w2 ??? Vasiliki Kalavri | Boston University 2020 Example 8 1 2 2 3 5 5 1 1 2 3 3 3 3 1 2 0 1 1 3 5 input stream ε=0.2 w1 w4 w3 w2 1 2 Vasiliki Kalavri | Boston University 2020 Example 8 1 2 2 3 5 5 1 1 2 3 3 3 3 1 2 0 1 1 3 5 input stream ε=0.2 w1 w4 w3 w2 1 2 2 3 5 w1 1 2 3 5 1 0 2 0 1 0 1 0 f1 ε1 f2 ε2 f3 ε3 f5 ε5 ??? Vasiliki0 码力 | 31 页 | 1.47 MB | 1 年前3Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020
events/s time rate decrease events/s time throughput degradation events/s time rate increase : input rate : throughput Why is it necessary? ??? Vasiliki Kalavri | Boston University 2020 • Ensure result • computation: load in terms of computation • communication: load in terms of flow size in the input channel of each parallel task • Partitioning function performance • space required to implement nodes. n4 In practice, each node is mapped to multiple points on the ring using multiple hash functions. Consistent hashing ??? Vasiliki Kalavri | Boston University 2020 n1 n3 n2 0 2128 When a0 码力 | 41 页 | 4.09 MB | 1 年前3Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020
of data: • The system buffers excess data for later processing, once input rates stabilize. • Requires a persistent input source. • Suitable for transient load increase. Scale resource allocation: resources are left idle when the input load decreases. ??? Vasiliki Kalavri | Boston University 2020 Load shedding • Load shedding is the process of discarding data when input rates increase beyond system | Boston University 2020 Load shedding as an optimization problem N: query network I: set of input streams with known arrival rates C: system processing capacity H: headroom factor, i.e. a conservative0 码力 | 43 页 | 2.42 MB | 1 年前3PyFlink 1.15 Documentation
py -o word_count.py python3 word_count.py # You will see outputs as following: # Use --input to specify file input. # Printing result to stdout. Use --output to specify output path. # +I[To, 1] # +I[be PyFlink supports various UDFs and APIs to allow users to execute Python native functions. See also the latest User- defined Functions and Row-based Operations. The first example is UDFs used in Table API & GenericJdbcSinkFunction. ˓→open(GenericJdbcSinkFunction.java:52) at org.apache.flink.api.common.functions.util.FunctionUtils. ˓→openFunction(FunctionUtils.java:34) at org.apache.flink.streaming.api.operators0 码力 | 36 页 | 266.77 KB | 1 年前3PyFlink 1.16 Documentation
py -o word_count.py python3 word_count.py # You will see outputs as following: # Use --input to specify file input. # Printing result to stdout. Use --output to specify output path. # +I[To, 1] # +I[be PyFlink supports various UDFs and APIs to allow users to execute Python native functions. See also the latest User- defined Functions and Row-based Operations. The first example is UDFs used in Table API & GenericJdbcSinkFunction. ˓→open(GenericJdbcSinkFunction.java:52) at org.apache.flink.api.common.functions.util.FunctionUtils. ˓→openFunction(FunctionUtils.java:34) at org.apache.flink.streaming.api.operators0 码力 | 36 页 | 266.80 KB | 1 年前3Streaming in Apache Flink
REST API Rich Functions • open(Configuration c) • close() • getRuntimeContext() DataStream> input = … DataStream > smoothed = input.keyBy(0).map(new seconds(10)) ◦EventTimeSessionWindows.withGap(Time.minutes(30)) DataStream input = ... input .keyBy(“key”) .window(TumblingEventTimeWindows.of(Time.minutes(1))) .process(new MyWastefulMax()); public static class MyWastefulMax extends ProcessWindowFunction< SensorReading, // input type Tuple3 , // output type String, // key 0 码力 | 45 页 | 3.00 MB | 1 年前3
共 21 条
- 1
- 2
- 3