Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 2020CS 591 K1: Data Stream Processing and Analytics Spring 2020 2/04: Streaming languages and operator semantics Vasiliki Kalavri | Boston University 2020 Vasiliki Kalavri | Boston University 2020 Kalavri | Boston University 2020 Streaming Operators 9 Vasiliki Kalavri | Boston University 2020 Operator types (I) • Single-Item Operators process stream elements one-by-one. • selection, filtering Consider events from stream S1 and stream S2 11 Vasiliki Kalavri | Boston University 2020 Operator types (II) • Sequence Operators capture the arrival of an ordered set of events. • common in0 码力 | 53 页 | 532.37 KB | 1 年前3
Introduction to Apache Flink and Apache Kafka - CS 591 K1: Data Stream Processing and Analytics Spring 2020Kalavri | Boston University 2020 Basic API Concept Source Data Stream Operator Data Stream Sink Source Data Set Operator Data Set Sink Writing a Flink Program 1.Bootstrap Sources 2.Apply Operators programs are defined in regular Scala/Java methods Set up the execution environment: local, cluster, I/O, time semantics, parallelism, … Example: Sensor Readings 9 Vasiliki Kalavri | Boston University env.setParallelism() in your application. taskmanager.numberOfTaskSlots: The number of parallel operator or user function instances that a single TaskManager can run. This value is typically proportional0 码力 | 26 页 | 3.33 MB | 1 年前3
PyFlink 1.15 Documentationjobs in a standalone Flink cluster. Set up Python environment It requires Python 3.6 or above with PyFlink pre-installed to be available on the nodes of the standalone cluster. It’s sug- gested to use is available, it needs to be deployed on the cluster. There are the following options to deploy it: • Install Python virtual environments on all the cluster nodes in advance You could install Python virtual virtual environments on all the cluster nodes with PyFlink pre-installed before submitting PyFlink jobs. Note that if you have a lot of jobs which use different Python versions and Flink versions, you0 码力 | 36 页 | 266.77 KB | 1 年前3
PyFlink 1.16 Documentationjobs in a standalone Flink cluster. Set up Python environment It requires Python 3.6 or above with PyFlink pre-installed to be available on the nodes of the standalone cluster. It’s sug- gested to use is available, it needs to be deployed on the cluster. There are the following options to deploy it: • Install Python virtual environments on all the cluster nodes in advance You could install Python virtual virtual environments on all the cluster nodes with PyFlink pre-installed before submitting PyFlink jobs. Note that if you have a lot of jobs which use different Python versions and Flink versions, you0 码力 | 36 页 | 266.80 KB | 1 年前3
Apache Flink的过去、现在和未来年阿里巴巴开始使用 Flink 并持续贡献社区 重构分布式架构 Client Dispatcher Job Manager Task Manager Resource Manager Cluster Manager Task Manager 1. Submit job 2. Start job 3. Request slots 4. Allocate Container 5. Start Distributed Streaming Dataflow Query Processor DAG & StreamOperator Local Single JVM Cloud GCE, EC2 Cluster Standalone, YARN Runtime Distributed Streaming Dataflow DataStream API Stream Processing DataSet Relational Local Single JVM Cloud GCE, EC2 Cluster Standalone, YARN DataStream Physical 统一 Operator 抽象 Pull-based operator Push-based operator 算子可自定义读取顺序 Table API & SQL 1.9 新特性 全新的 SQL 类型系统0 码力 | 33 页 | 3.36 MB | 1 年前3
监控Apache Flink应用程序(入门)展并与上游系统保持同步。 4.1 吞吐量 Flink提供了多个metrics来衡量应用程序的吞吐量。对于每个operator或task(请记住:一个task可以包含多个 chained-task3),Flink会对进出系统的记录和字节进行计数。在这些metrics中,每个operator输出记录的速率 通常是最直观和最容易理解的。 4.2 关键指标 Metric Scope Description numRecordsOutPerSecond task The number of records this operator/task sends per second. numRecordsOutPerSecond operator The number of records this operator sends per second. caolei – 监控Apache Flink应用程序(入门) 进度和吞吐量监控 – 11 4.3 仪表盘示例 Figure 3: Mean Records Out per Second per Operator 4.4 可能的报警条件 • recordsOutPerSecond = 0 (for a non-Sink operator) 请注意:目前由于metrics体系只考虑Flink的内部通信,所以source operators的输入记录数是0,而sink0 码力 | 23 页 | 148.62 KB | 1 年前3
Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020change business logic • Optimize execution plan • Change operator placement • skew and straggler mitigation • Migrate to a different cluster or software version 9 Reconfiguration cases ??? Vasiliki are scaled by repartitioning keys • Operators with operator list state are scaled by redistributing the list entries. • Operators with operator broadcast state are scaled up by copying the state to range boundaries. • The maximum parallelism parameter of an operator defines the number of key groups into which the keyed state of the operator is split. • The number of key groups limits the maximum0 码力 | 41 页 | 4.09 MB | 1 年前3
Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020load management Scalability and elasticity Fault-tolerance and guarantees State management Operator semantics Window optimizations Filtering, counting, sampling Graph streaming algorithms Vasiliki University 2020 Dataset A subset of traces from a large (12.5k machines) Google cluster • https://github.com/google/cluster-data/blob/master/ ClusterData2011_2.md Make sure to read and become familiar0 码力 | 34 页 | 2.53 MB | 1 年前3
Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020processing optimizations ??? Vasiliki Kalavri | Boston University 2020 2 • Costs of streaming operator execution • state, parallelism, selectivity • Dataflow optimizations • plan translation alternatives Vasiliki Kalavri | Boston University 2020 Operator selectivity 6 • The number of output elements produced per number of input elements • a map operator has a selectivity of 1, i.e. it produces one output element for each input element it processes • an operator that tokenizes sentences into words has selectivity > 1 • a filter operator typically has selectivity < 1 Is selectivity always known0 码力 | 54 页 | 2.83 MB | 1 年前3
Elasticity and state migration: Part I - CS 591 K1: Data Stream Processing and Analytics Spring 2020dataflow with sources S1, S2, … Sn and rates λ1, λ2, … λn identify the minimum parallelism πi per operator i, such that the physical dataflow can sustain all source rates. S1 S2 λ1 λ2 S1 S2 π=2 => scale out • Analytical dataflow-based models Action • Speculative: small changes at one operator at a time • Predictive: at-once for all operators 8 ??? Vasiliki Kalavri | Boston University per task • total time spent processing a tuple and all its derived results • Policy • each operator as a single-server queuing system • generalized Jackson networks • Action • predictive, at-once0 码力 | 93 页 | 2.42 MB | 1 年前3
共 18 条
- 1
- 2













