PyFlink 1.15 Documentation
to set up PyFlink development environment in your local machine. This is usually used for local execution or development in an IDE. Set up Python environment It requires Python 3.6 or above with PyFlink given Python virtual environment at client side (for job compiling) and server side (for Python UDF execution) separately. 1.1. Getting Started 7 pyflink-docs, Release release-1.15 • Specify the Python virtual cluster nodes during job execution. It should be noted that option -pyexec is also required to specify the Python virtual environment to use at server side (for Python UDF execution). For the Python virtual0 码力 | 36 页 | 266.77 KB | 1 年前3PyFlink 1.16 Documentation
to set up PyFlink development environment in your local machine. This is usually used for local execution or development in an IDE. Set up Python environment It requires Python 3.6 or above with PyFlink given Python virtual environment at client side (for job compiling) and server side (for Python UDF execution) separately. 1.1. Getting Started 7 pyflink-docs, Release release-1.16 • Specify the Python virtual cluster nodes during job execution. It should be noted that option -pyexec is also required to specify the Python virtual environment to use at server side (for Python UDF execution). For the Python virtual0 码力 | 36 页 | 266.80 KB | 1 年前3Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020
processing optimizations ??? Vasiliki Kalavri | Boston University 2020 2 • Costs of streaming operator execution • state, parallelism, selectivity • Dataflow optimizations • plan translation alternatives B A C A B D A A B split Pipeline: A || B Task: B || C Data: A || A ??? Vasiliki Kalavri | Boston University 2020 8 Distributed execution in Flink ??? Vasiliki Kalavri | Boston University strategies? • before execution or during runtime Query optimization (I) ??? Vasiliki Kalavri | Boston University 2020 10 Optimization strategies • enumerate equivalent execution plans • minimize intermediate0 码力 | 54 页 | 2.83 MB | 1 年前3Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020
approximate answers … S1 S2 Sr Input Manager Scheduler QoS Monitor Load Shedder Query Execution Engine Qm Q2 Q1 Ad-hoc or continuous queries Input streams … ??? Vasiliki Kalavri | Boston unnecessary result degradation! • Load shedding components rely on statistics gathered during execution: • A statistics manager module monitors processing and input rates and periodically estimates continuously or by running the system for a designated period of time, prior to regular query execution. 10 ??? Vasiliki Kalavri | Boston University 2020 Estimating cost and selectivity 11 • Selectivity:0 码力 | 43 页 | 2.42 MB | 1 年前3Exactly-once fault-tolerance in Apache Flink - CS 591 K1: Data Stream Processing and Analytics Spring 2020
input streams 2. Wait for all in-flight data to be completely processed 3. Copy the state of each task to a remote, persistent storage 4. Wait until all tasks have finished their copies 5. Resume processing HOk8+K8Ox+z1iWnmDmAP3A+fwCD9I4G We need to retrieve a distributed cut in a system execution that yields a system configuration Validity (safety): Termination (liveness): Obtain a valid CfjxXg3PqatJaOY2QV/yvj8AfLTl3A= We need to retrieve a distributed cut in a system execution that yields a system configuration Validity (safety): Termination (liveness): Obtain a valid0 码力 | 81 页 | 13.18 MB | 1 年前3Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020
JobManager is a single point of failure Flink applications • It keeps metadata about application execution, such as pointers to completed checkpoints. • A high-availability mode migrates the responsibility increased load • scale in to save resources • Fix bugs or change business logic • Optimize execution plan • Change operator placement • skew and straggler mitigation • Migrate to a different subtask state from the checkpoint in all sub-tasks and filter out the matching keys for each sub-task • Sequential read pattern • Tasks read unnecessary data and the distributed file system receives0 码力 | 41 页 | 4.09 MB | 1 年前3Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020
Kalavri | Boston University 2020 Dataflow Systems Distributed execution Partitioned state Exact results Out-of-order support Single-node execution Synopses and sketches Approximate results In-order data processing 2020 • No particular basic stream model (time-series, turnstile…) is imposed by the dataflow execution engine. • The burden of representation and denotations if left to the application developer/user approximate exact Language SQL extensions, CQL Java, Scala, Python, SQL Execution centralized distributed Parallelism pipeline pipeline, task, data State limited, in-memory partitioned, virtually unlimited0 码力 | 45 页 | 1.22 MB | 1 年前3Scalable Stream Processing - Spark Streaming and Flink
table, as a static table. • Spark automatically converts this batch-like query to a streaming execution plan. ▶ 2. Specify triggers to control when to update the results. • Each time a trigger fires table, as a static table. • Spark automatically converts this batch-like query to a streaming execution plan. ▶ 2. Specify triggers to control when to update the results. • Each time a trigger fires table, as a static table. • Spark automatically converts this batch-like query to a streaming execution plan. ▶ 2. Specify triggers to control when to update the results. • Each time a trigger fires0 码力 | 113 页 | 1.22 MB | 1 年前3Elasticity and state migration: Part I - CS 591 K1: Data Stream Processing and Analytics Spring 2020
Boston University 2020 Scaling approaches Metrics • service time and waiting time per tuple and per task • total time spent processing a tuple and all its derived results • CPU utilization, congestion University 2020 Queuing theory models 9 • Metrics • service time and waiting time per tuple and per task • total time spent processing a tuple and all its derived results • Policy • each operator as University 2020 Queuing theory models 9 • Metrics • service time and waiting time per tuple and per task • total time spent processing a tuple and all its derived results • Policy • each operator as0 码力 | 93 页 | 2.42 MB | 1 年前3监控Apache Flink应用程序(入门)
7/dev/stream/operators/#task-chaining-and-resource-groups 4 进度和吞吐量监控 知道您的应用程序正在运行并且检查点正常工作是件好事,但是它并不能告诉您应用程序是否正在实际取得进 展并与上游系统保持同步。 4.1 吞吐量 Flink提供了多个metrics来衡量应用程序的吞吐量。对于每个operator或task(请记住:一个task可以包含多个 chain chained-task3),Flink会对进出系统的记录和字节进行计数。在这些metrics中,每个operator输出记录的速率 通常是最直观和最容易理解的。 4.2 关键指标 Metric Scope Description numRecordsOutPerSecond task The number of records this operator/task sends When enabled, Flink will insert so-called latency markers periodically at all sources. For each sub-task, a latency distribution from each source to this operator will be reported. The granularity of these0 码力 | 23 页 | 148.62 KB | 1 年前3
共 18 条
- 1
- 2