Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020University 2020 9 Identify the most efficient way to execute a query • There may exist several ways to execute a computation • query plans, e.g. order of operators • scheduling and placement decisions • How can we estimate the cost of different strategies? • before execution or during runtime Query optimization (I) ??? Vasiliki Kalavri | Boston University 2020 10 Optimization strategies • enumerate (if running in the cloud) Query optimization (II) ??? Vasiliki Kalavri | Boston University 2020 Cost-based optimization 11 Parsed program representation Optimizer statistics input plan A plan0 码力 | 54 页 | 2.83 MB | 1 年前3
Filtering and sampling streams - CS 591 K1: Data Stream Processing and Analytics Spring 2020• we can store a fixed proportion of the stream, e.g. 1/10th 7 search enginequery, timestamp> query stream Example use-case: Web search user behavior study Q: How many queries did users issued n queries in the last month: • s of those are unique • d of those are duplicates • no query was issued more than twice 9 How many of Ted’s queries will be in the 1/10th sample, S? Each of a flag indicating whether they belong to the sample or not • When a query arrives: • if the user is sampled: add the query to S • if we haven’t seen the user before: generate a random integer ru 0 码力 | 74 页 | 1.06 MB | 1 年前3
Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 20205 ??? Vasiliki Kalavri | Boston University 2020 Load shedding as an optimization problem N: query network I: set of input streams with known arrival rates C: system processing capacity H: headroom continuously monitors input rates or other system metrics and can access information about the running query plan • It detects overload and decides what actions to take in order to maintain acceptable latency Fast approximate answers … S1 S2 Sr Input Manager Scheduler QoS Monitor Load Shedder Query Execution Engine Qm Q2 Q1 Ad-hoc or continuous queries Input streams … ??? Vasiliki Kalavri0 码力 | 43 页 | 2.42 MB | 1 年前3
Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020single-pass Updates arbitrary append-only Update rates relatively low high, bursty Processing Model query-driven / pull-based data-driven / push-based Queries ad-hoc continuous Latency relatively high low• Derived stream: produced by a continuous query and its operators, e.g. total traffic from a source every minute ins_r(P:i) = insert(i, {j | j ∈ ins_r(P) ^ j.A ≠ i.A}). 28 Vasiliki Kalavri | Boston University 2020 Query processing challenges • Memory requirements: we cannot store the whole stream history. • Data rate: 0 码力 | 45 页 | 1.22 MB | 1 年前3
Scalable Stream Processing - Spark Streaming and Flinkengine. ▶ Perform database-like query optimizations. 56 / 79 Programming Model (1/2) ▶ Two main steps to develop a Spark stuctured streaming: ▶ 1. Defines a query on the input table, as a static table table. • Spark automatically converts this batch-like query to a streaming execution plan. ▶ 2. Specify triggers to control when to update the results. • Each time a trigger fires, Spark checks for develop a Spark stuctured streaming: ▶ 1. Defines a query on the input table, as a static table. • Spark automatically converts this batch-like query to a streaming execution plan. ▶ 2. Specify triggers0 码力 | 113 页 | 1.22 MB | 1 年前3
Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 2020• A Blocking query operator can only return answers when it detects the end of its input. • NOT IN, set difference and division, traditional SQL aggregates • A Non-blocking query operator can produce operator, iff F is monotonic with respect to the partial ordering ⊆. A query Q on a stream S can be implemented by a non-blocking query operator iff Q(S) is monotonic with respect to ⊆. The traditional introduction of a new empl can only expand the set of departments that satisfy this query However this sum query cannot be expressed without the use of aggregates! 31 Non-blocking SQL Vasiliki Kalavri0 码力 | 53 页 | 532.37 KB | 1 年前3
Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020search while MBs only offer topic-based subscription. • DB query results depend on a snapshot and clients are not notified if their query result changes later. 13 Message delivery and ordering Acknowledgements0 码力 | 33 页 | 700.14 KB | 1 年前3
Cardinality and frequency estimation - CS 591 K1: Data Stream Processing and Analytics Spring 2020Detect DNS DDoS attacks • Flooding the resources of the targeted system by sending a large number of query from a botnet • Group queries by their top-level domain and investigate most popular domains remove({y, fy}) return X* Computing top-k ??? Vasiliki Kalavri | Boston University 2020 26 • Query approximation error • Error probability Guarantee: The estimation error for frequencies will0 码力 | 69 页 | 630.01 KB | 1 年前3
PyFlink 1.15 Documentationspecific TableEnvironment. It is not possible to combine tables from different TableEnvironments in same query, e.g., to join or union them. Firstly, you can create a Table from a Python List Object [3]: table Python function in SQL API table_env.create_temporary_function("plus_one", plus_one) table_env.sql_query("SELECT plus_one(id) FROM {}".format(table)).to_pandas() Another example is UDFs used in Row-based0 码力 | 36 页 | 266.77 KB | 1 年前3
PyFlink 1.16 Documentationspecific TableEnvironment. It is not possible to combine tables from different TableEnvironments in same query, e.g., to join or union them. Firstly, you can create a Table from a Python List Object [3]: table Python function in SQL API table_env.create_temporary_function("plus_one", plus_one) table_env.sql_query("SELECT plus_one(id) FROM {}".format(table)).to_pandas() Another example is UDFs used in Row-based0 码力 | 36 页 | 266.80 KB | 1 年前3
共 13 条
- 1
- 2













