Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 2020through querying tables Declarative language: CQL 4 Vasiliki Kalavri | Boston University 2020 Select IStream(*) From S1 [Rows 5], S2 [Rows 10] Where S1.A = S2.A Last 5 elements of stream S1 and last when I is not detected. 10 Vasiliki Kalavri | Boston University 2020 Logic Operators Example Select IStream(S1.A, S2.B) From S1 [Rows 50], S2 [Rows 50] (A & B) || (C & D) Explicit conjunction and Id • round-robin assignment 19 Vasiliki Kalavri | Boston University 2020 CQL GroupBy Example Select IStream(Count(*)) From S1 [Rows 1000] Group By S1.B Count the number or events in the last0 码力 | 53 页 | 532.37 KB | 1 年前3
Filtering and sampling streams - CS 591 K1: Data Stream Processing and Analytics Spring 2020input stream add to sample or discard ??? Vasiliki Kalavri | Boston University 2020 How can we select a representative sample of an unbounded stream? • we want to ask queries and get statistically proportion of the stream, e.g. 1/10th 7 ??? Vasiliki Kalavri | Boston University 2020 How can we select a representative sample of an unbounded stream? • we want to ask queries and get statistically store 1/10th of the stream, we select a stream element i with probability 10%. • We can use a random generator that produces an integer ri between 0 and 9. We then select an input element i if ri=0.0 码力 | 74 页 | 1.06 MB | 1 年前3
PyFlink 1.15 DocumentationExpressions can be used to select the columns from a Table. For example, Table.select() takes the column Expression instances that returns another Table. [16]: table.select(table.id).to_pandas() [16]: [16]: id 0 1 1 2 [17]: table.select(col('id')).to_pandas() [17]: id 0 1 1 2 Assign new Column Expression instance. [18]: table.add_columns(col('data').upper_case.alias('upper_data')).to_pandas() [18]: [18]: id data upper_data 0 1 Hi HI 1 2 Hello HELLO To select a subset of rows, use Table.filter(). [19]: table.filter(col('id') == 1).to_pandas() [19]: id data 0 1 Hi Applying a Function on Table PyFlink0 码力 | 36 页 | 266.77 KB | 1 年前3
PyFlink 1.16 DocumentationExpressions can be used to select the columns from a Table. For example, Table.select() takes the column Expression instances that returns another Table. [16]: table.select(table.id).to_pandas() [16]: [16]: id 0 1 1 2 [17]: table.select(col('id')).to_pandas() [17]: id 0 1 1 2 Assign new Column Expression instance. [18]: table.add_columns(col('data').upper_case.alias('upper_data')).to_pandas() [18]: [18]: id data upper_data 0 1 Hi HI 1 2 Hello HELLO To select a subset of rows, use Table.filter(). [19]: table.filter(col('id') == 1).to_pandas() [19]: id data 0 1 Hi Applying a Function on Table PyFlink0 码力 | 36 页 | 266.80 KB | 1 年前3
Cardinality and frequency estimation - CS 591 K1: Data Stream Processing and Analytics Spring 2020Sm-1 For every element x, we compute h(x) and use the p first bits of the M-bit hash value to select a sub-stream and the next M-p bits to compute the rank(.): Stochastic averaging Use one hash function Sm-1 For every element x, we compute h(x) and use the p first bits of the M-bit hash value to select a sub-stream and the next M-p bits to compute the rank(.): Stochastic averaging Use one hash function hash value into two parts h(x) = (i0i1 . . . iM−1)2, ik ∈ {0,1} j = (i0i1 . . . ip−1)2 For we select one of m counters COUNT[j], where ??? Vasiliki Kalavri | Boston University 2020 11 Stochastic0 码力 | 69 页 | 630.01 KB | 1 年前3
Scalable Stream Processing - Spark Streaming and Flink60 / 79 Structured Streaming Example (2/3) ▶ We could express it as the following SQL query. SELECT action, WINDOW(time, "1 hour"), COUNT * FROM events GROUP BY action, WINDOW(time, "1 hour") 61 / readStream.json("s3://logs") val ds: Dataset[Call] = df.as[Call] // Selection and projection df.select("action").where("id > 10") // using untyped APIs ds.filter(_.id > 10).map(_.action) // using typed groupByKey(_.action) // using typed API // SQL commands df.createOrReplaceTempView("dfView") spark.sql("select count(*) from dfView") // returns another streaming DF 63 / 79 Window Operation ▶ Aggregations0 码力 | 113 页 | 1.22 MB | 1 年前3
Apache Flink的过去、现在和未来5 | 12:06 | | ------------------------- | ---------------------------- Stream Mode: 12:01> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name; Flink 在阿里的服务情况 集群规模 超万台 状态数据 PetaBytes0 码力 | 33 页 | 3.36 MB | 1 年前3
Skew mitigation - CS 591 K1: Data Stream Processing and Analytics Spring 2020counters ??? Vasiliki Kalavri | Boston University 2020 The power of two choices • Instead, we select d destination bins, each uniformly at random, and place the ball at the least full bin: • when0 码力 | 31 页 | 1.47 MB | 1 年前3
共 8 条
- 1













