Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 2020CS 591 K1: Data Stream Processing and Analytics Spring 2020 2/04: Streaming languages and operator semantics Vasiliki Kalavri | Boston University 2020 Vasiliki Kalavri | Boston University 2020 Kalavri | Boston University 2020 Streaming Operators 9 Vasiliki Kalavri | Boston University 2020 Operator types (I) • Single-Item Operators process stream elements one-by-one. • selection, filtering Consider events from stream S1 and stream S2 11 Vasiliki Kalavri | Boston University 2020 Operator types (II) • Sequence Operators capture the arrival of an ordered set of events. • common in0 码力 | 53 页 | 532.37 KB | 1 年前3
Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020stabilize. • Requires a persistent input source. • Suitable for transient load increase. Scale resource allocation: • Addresses the case of increased load and additionally ensures no resources are input rates and periodically estimates operator selectivities. • The load shedder assigns a cost, ci, in cycles per tuple, and a selectivity, si, to each operator i. • The statistics manager collects records does the operator produce per record in its input? • map: 1 in 1 out • filter: 1 in, 1 or 0 out • flatMap, join: 1 in 0, 1, or more out • Cost: how many records can an operator process in a0 码力 | 43 页 | 2.42 MB | 1 年前3
PyFlink 1.15 Documentationpossible to set up the Python environments in advance on the cluster nodes or when there are some special requirements where the pre-installed Python environments could not meet. ./bin/flink run \ --jobmanager cluster nodes of the standalone cluster and use custom Python virtual environment when there are some special requirements. Submit PyFlink jobs to a standalone Flink cluster You could submit PyFlink jobs to py See Submitting PyFlink jobs for more details. 1.1.1.4 YARN Apache Hadoop YARN is a cluster resource management framework for managing the resources and scheduling jobs in a Hadoop cluster. It’s supported0 码力 | 36 页 | 266.77 KB | 1 年前3
PyFlink 1.16 Documentationpossible to set up the Python environments in advance on the cluster nodes or when there are some special requirements where the pre-installed Python environments could not meet. ./bin/flink run \ --jobmanager cluster nodes of the standalone cluster and use custom Python virtual environment when there are some special requirements. Submit PyFlink jobs to a standalone Flink cluster You could submit PyFlink jobs to py See Submitting PyFlink jobs for more details. 1.1.1.4 YARN Apache Hadoop YARN is a cluster resource management framework for managing the resources and scheduling jobs in a Hadoop cluster. It’s supported0 码力 | 36 页 | 266.80 KB | 1 年前3
Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020processing optimizations ??? Vasiliki Kalavri | Boston University 2020 2 • Costs of streaming operator execution • state, parallelism, selectivity • Dataflow optimizations • plan translation alternatives Vasiliki Kalavri | Boston University 2020 Operator selectivity 6 • The number of output elements produced per number of input elements • a map operator has a selectivity of 1, i.e. it produces one output element for each input element it processes • an operator that tokenizes sentences into words has selectivity > 1 • a filter operator typically has selectivity < 1 Is selectivity always known0 码力 | 54 页 | 2.83 MB | 1 年前3
监控Apache Flink应用程序(入门)7/dev/stream/operators/#task-chaining-and-resource-groups 4 进度和吞吐量监控 知道您的应用程序正在运行并且检查点正常工作是件好事,但是它并不能告诉您应用程序是否正在实际取得进 展并与上游系统保持同步。 4.1 吞吐量 Flink提供了多个metrics来衡量应用程序的吞吐量。对于每个operator或task(请记住:一个task可以包含多个 chaine 系统的记录和字节进行计数。在这些metrics中,每个operator输出记录的速率 通常是最直观和最容易理解的。 4.2 关键指标 Metric Scope Description numRecordsOutPerSecond task The number of records this operator/task sends per second. n numRecordsOutPerSecond operator The number of records this operator sends per second. caolei – 监控Apache Flink应用程序(入门) 进度和吞吐量监控 – 11 4.3 仪表盘示例 Figure 3: Mean Records Out per Second per Operator 4.4 可能的报警条件0 码力 | 23 页 | 148.62 KB | 1 年前3
Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020in to save resources • Fix bugs or change business logic • Optimize execution plan • Change operator placement • skew and straggler mitigation • Migrate to a different cluster or software version • minimize performance disruption, e.g. latency spikes • avoid introducing load imbalance • Resource management • utilization, isolation • Automation • continuous monitoring • bottleneck detection are scaled by repartitioning keys • Operators with operator list state are scaled by redistributing the list entries. • Operators with operator broadcast state are scaled up by copying the state to0 码力 | 41 页 | 4.09 MB | 1 年前3
Apache Flink的过去、现在和未来Time Window 2015 年阿里巴巴开始使用 Flink 并持续贡献社区 重构分布式架构 Client Dispatcher Job Manager Task Manager Resource Manager Cluster Manager Task Manager 1. Submit job 2. Start job 3. Request slots 4. Allocate JVM Cloud GCE, EC2 Cluster Standalone, YARN DataStream Physical 统一 Operator 抽象 Pull-based operator Push-based operator 算子可自定义读取顺序 Table API & SQL 1.9 新特性 全新的 SQL 类型系统 DDL 初步支持 Table API0 码力 | 33 页 | 3.36 MB | 1 年前3
Notions of time and progress - CS 591 K1: Data Stream Processing and Analytics Spring 2020watermark record timestamp records 3 Watermarks (in Flink) flow along dataflow edges. They are special records generated by the sources or assigned by the application. A watermark for time T states are essential to both event-time windows and operators handling out-of-order events: • When an operator receives a watermark with time T, it can assume that no further events with timestamp less than timestamp. Punctuated: check for a watermark in each passing record, e.g. if the stream contains special records that encode watermark information. val env = StreamExecutionEnvironment.getExecutionEnvironment0 码力 | 22 页 | 2.22 MB | 1 年前3
Scalable Stream Processing - Spark Streaming and Flinkdata to external systems, e.g., a database or a file system. ▶ foreachRDD: the most generic output operator • Applies a function to each RDD generated from the stream. • The function is executed in the Output Operations (3/4) ▶ What’s wrong with this code? ▶ Creating a connection object has time and resource overheads. ▶ Creating and destroying a connection object for each record can incur unnecessarily0 码力 | 113 页 | 1.22 MB | 1 年前3
共 20 条
- 1
- 2













