State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020
Processing and Analytics Vasiliki (Vasia) Kalavri vkalavri@bu.edu Spring 2020 2/25: State Management Vasiliki Kalavri | Boston University 2020 Logic State<#Brexit, 520> <#WorldCup, 480> key of the current record so that all records with the same key access the same state State management in Apache Flink 5 Vasiliki Kalavri | Boston University 2020 Operator state Keyed state State state is stored, accessed, and maintained. State backends are responsible for: • local state management • checkpointing state to remote and persistent storage, e.g. a distributed filesystem or a database 0 码力 | 24 页 | 914.13 KB | 1 年前3PyFlink 1.15 Documentation
1.2.2 QuickStart: DataStream API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.2 User Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . It’s supported to use Python virtual environment in your PyFlink jobs, see PyFlink Dependency Management for more details. Create a virtual environment using virtualenv To create a virtual environment Submitting PyFlink jobs for more details. 1.1.1.4 YARN Apache Hadoop YARN is a cluster resource management framework for managing the resources and scheduling jobs in a Hadoop cluster. It’s supported to0 码力 | 36 页 | 266.77 KB | 1 年前3PyFlink 1.16 Documentation
1.2.2 QuickStart: DataStream API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.2 User Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . It’s supported to use Python virtual environment in your PyFlink jobs, see PyFlink Dependency Management for more details. Create a virtual environment using virtualenv To create a virtual environment Submitting PyFlink jobs for more details. 1.1.1.4 YARN Apache Hadoop YARN is a cluster resource management framework for managing the resources and scheduling jobs in a Hadoop cluster. It’s supported to0 码力 | 36 页 | 266.80 KB | 1 年前3Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020
Algorithms Architecture and design Scheduling and load management Scalability and elasticity Fault-tolerance and guarantees State management Operator semantics Window optimizations Filtering experts with decades of hands-on experience in building and using distributed systems and data management platforms • Have fun! 10 Vasiliki Kalavri | Boston University 2020 Important dates Deliverable 2020 Software requirements • All assignments assume a UNIX-based setup. • If you are a Windows user, you are advised to use Windows subsystem for Linux (WSL), Cygwin, or a Linux virtual machine to0 码力 | 34 页 | 2.53 MB | 1 年前3Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020
optimizations • plan translation alternatives • Runtime optimizations • load management, scheduling, state management • Optimization semantics, correctness, profitability Topics covered in this lecture uk Connection: keep-alive Accept: text/html,application/ xhtml+xml,application/ xml;q=0.9,*/*;q=0.8 User-Agent: Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.22 (KHTML, like Gecko) Ubuntu Chromium/25 uk Connection: keep-alive Accept: text/html,application/ xhtml+xml,application/ xml;q=0.9,*/*;q=0.8 User-Agent: Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.22 (KHTML, like Gecko) Ubuntu Chromium/250 码力 | 54 页 | 2.83 MB | 1 年前3Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 2020
bounds nor fixed size • events in a period during which a user was active 17 Vasiliki Kalavri | Boston University 2020 Flow Management Operators (I) • Join operators merge two streams by matching blocking and must be defined over a window 18 Vasiliki Kalavri | Boston University 2020 Flow Management Operators (II) • Duplicate/Copy Operator replicates a stream, commonly to be used as input to > 1000 Derived stream as an append- only table. 35 Vasiliki Kalavri | Boston University 2020 User-Defined Aggregates (UDAs) Constructs that allow the definition of custom aggregations using three0 码力 | 53 页 | 532.37 KB | 1 年前3Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020
DBMS SDW DSMS Database Management System • ad-hoc queries, data manipulation tasks • insertions, updates, deletions of single row or groups of rows Data Stream Management System • continuous materialized view updates • pre-aggregated, pre-processed streams and historical data Data Management Approaches 4 storage analytics static data streaming data Vasiliki Kalavri | Boston University execution engine. • The burden of representation and denotations if left to the application developer/user. • The programmer needs to design and maintain appropriate state synopses. • In order to parallelize0 码力 | 45 页 | 1.22 MB | 1 年前3Apache Flink的过去、现在和未来
增量 snapshot 基于 credit 的流控机制 Streaming SQL ------------------------- | USER_SCORES | ------------------------- | User | Score | Time | ------------------------- | Julie | 7 | 12:01 | | Frank -- | ---------------------------- Stream Mode: 12:01> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name; Flink 在阿里的服务情况 集群规模 超万台 状态数据 PetaBytes 事件处理 十万亿/天 峰值能力 17亿/秒 Flink 的过去 P_2 S_0 S_1 Order Inventory Payment Shipping Flow-Control Async Call Auto Scale State Management Event Driven Flink 的未来 offline Real-time Batch Processing Continuous Processing & Streaming0 码力 | 33 页 | 3.36 MB | 1 年前3Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020
publisher subscriber notify() subscriber notify() subscriber notify() Subscription management Event Service publish publish notify() subscribe() unsubscribe() subscribe notify unsubscribe instances. • Distributing event notifications • a service that accepts user signups can send notifications whenever a new user registers, and downstream services can subscribe to receive notifications0 码力 | 33 页 | 700.14 KB | 1 年前3Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020
minimize performance disruption, e.g. latency spikes • avoid introducing load imbalance • Resource management • utilization, isolation • Automation • continuous monitoring • bottleneck detection and removed by Flink. • Savepoints are never automatically removed. 14 savepoint Savepoints: user-triggered checkpoints ??? Vasiliki Kalavri | Boston University 2020 • To decrease or increase the0 码力 | 41 页 | 4.09 MB | 1 年前3
共 22 条
- 1
- 2
- 3