Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020processing optimizations ??? Vasiliki Kalavri | Boston University 2020 2 • Costs of streaming operator execution • state, parallelism, selectivity • Dataflow optimizations • plan translation alternatives Distributed execution in Flink ??? Vasiliki Kalavri | Boston University 2020 9 Identify the most efficient way to execute a query • There may exist several ways to execute a computation • query plans, strategies? • before execution or during runtime Query optimization (I) ??? Vasiliki Kalavri | Boston University 2020 10 Optimization strategies • enumerate equivalent execution plans • minimize intermediate0 码力 | 54 页 | 2.83 MB | 1 年前3
Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020wikimedia.org/wiki/File:Adaptive_streaming_overview_daseddon_2011_07_28.png 5 ??? Vasiliki Kalavri | Boston University 2020 Load shedding as an optimization problem N: query network I: set of input continuously monitors input rates or other system metrics and can access information about the running query plan • It detects overload and decides what actions to take in order to maintain acceptable latency approximate answers … S1 S2 Sr Input Manager Scheduler QoS Monitor Load Shedder Query Execution Engine Qm Q2 Q1 Ad-hoc or continuous queries Input streams … ??? Vasiliki Kalavri |0 码力 | 43 页 | 2.42 MB | 1 年前3
Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020JobManager is a single point of failure Flink applications • It keeps metadata about application execution, such as pointers to completed checkpoints. • A high-availability mode migrates the responsibility increased load • scale in to save resources • Fix bugs or change business logic • Optimize execution plan • Change operator placement • skew and straggler mitigation • Migrate to a different across existing and new nodes • Random I/O and high network communication • Not suitable for adaptive applications 26 Uniform hashing ??? Vasiliki Kalavri | Boston University 2020 27 ??? Vasiliki0 码力 | 41 页 | 4.09 MB | 1 年前3
Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020single-pass Updates arbitrary append-only Update rates relatively low high, bursty Processing Model query-driven / pull-based data-driven / push-based Queries ad-hoc continuous Latency relatively high low• Derived stream: produced by a continuous query and its operators, e.g. total traffic from a source every minute ins_r(P:i) = insert(i, {j | j ∈ ins_r(P) ^ j.A ≠ i.A}). 28 Vasiliki Kalavri | Boston University 2020 Query processing challenges • Memory requirements: we cannot store the whole stream history. • Data rate: 0 码力 | 45 页 | 1.22 MB | 1 年前3
Scalable Stream Processing - Spark Streaming and Flinkengine. ▶ Perform database-like query optimizations. 56 / 79 Programming Model (1/2) ▶ Two main steps to develop a Spark stuctured streaming: ▶ 1. Defines a query on the input table, as a static table table. • Spark automatically converts this batch-like query to a streaming execution plan. ▶ 2. Specify triggers to control when to update the results. • Each time a trigger fires, Spark checks for Spark stuctured streaming: ▶ 1. Defines a query on the input table, as a static table. • Spark automatically converts this batch-like query to a streaming execution plan. ▶ 2. Specify triggers to control0 码力 | 113 页 | 1.22 MB | 1 年前3
PyFlink 1.15 Documentationto set up PyFlink development environment in your local machine. This is usually used for local execution or development in an IDE. Set up Python environment It requires Python 3.6 or above with PyFlink given Python virtual environment at client side (for job compiling) and server side (for Python UDF execution) separately. 1.1. Getting Started 7 pyflink-docs, Release release-1.15 • Specify the Python virtual cluster nodes during job execution. It should be noted that option -pyexec is also required to specify the Python virtual environment to use at server side (for Python UDF execution). For the Python virtual0 码力 | 36 页 | 266.77 KB | 1 年前3
PyFlink 1.16 Documentationto set up PyFlink development environment in your local machine. This is usually used for local execution or development in an IDE. Set up Python environment It requires Python 3.6 or above with PyFlink given Python virtual environment at client side (for job compiling) and server side (for Python UDF execution) separately. 1.1. Getting Started 7 pyflink-docs, Release release-1.16 • Specify the Python virtual cluster nodes during job execution. It should be noted that option -pyexec is also required to specify the Python virtual environment to use at server side (for Python UDF execution). For the Python virtual0 码力 | 36 页 | 266.80 KB | 1 年前3
Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 2020streams. • Declarative languages specify the expected results of the computation rather than the execution flow. • Imperative languages are used to describe plans of operators the streams must flow through • A Blocking query operator can only return answers when it detects the end of its input. • NOT IN, set difference and division, traditional SQL aggregates • A Non-blocking query operator can produce operator, iff F is monotonic with respect to the partial ordering ⊆. A query Q on a stream S can be implemented by a non-blocking query operator iff Q(S) is monotonic with respect to ⊆. The traditional0 码力 | 53 页 | 532.37 KB | 1 年前3
Filtering and sampling streams - CS 591 K1: Data Stream Processing and Analytics Spring 2020• we can store a fixed proportion of the stream, e.g. 1/10th 7 search enginequery, timestamp> query stream Example use-case: Web search user behavior study Q: How many queries did users issued n queries in the last month: • s of those are unique • d of those are duplicates • no query was issued more than twice 9 How many of Ted’s queries will be in the 1/10th sample, S? Each of a flag indicating whether they belong to the sample or not • When a query arrives: • if the user is sampled: add the query to S • if we haven’t seen the user before: generate a random integer ru 0 码力 | 74 页 | 1.06 MB | 1 年前3
Exactly-once fault-tolerance in Apache Flink - CS 591 K1: Data Stream Processing and Analytics Spring 2020HOk8+K8Ox+z1iWnmDmAP3A+fwCD9I4G We need to retrieve a distributed cut in a system execution that yields a system configuration Validity (safety): Termination (liveness): Obtain a valid CfjxXg3PqatJaOY2QV/yvj8AfLTl3A= We need to retrieve a distributed cut in a system execution that yields a system configuration Validity (safety): Termination (liveness): Obtain a valid m m’ System Possible Execution ??? Vasiliki Kalavri | Boston University 2020 Validity Explained p1 p2 p3 p1 p2 p3 m m’ C events in cut System Possible Execution ??? Vasiliki Kalavri | Boston0 码力 | 81 页 | 13.18 MB | 1 年前3
共 18 条
- 1
- 2













