storage management - IT文库_程序员IT互联网编程电子书和文档免费下载，助您码力十足！

首页文库资料文章资讯上传文档发布文章登录账户

State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Processing and Analytics Vasiliki (Vasia) Kalavri  vkalavri@bu.edu Spring 2020 2/25: State Management Vasiliki Kalavri | Boston University 2020 Logic State <#Brexit, 520> <#WorldCup, 480> key of the current record so that all records with the same key access the same state State management in Apache Flink 5 Vasiliki Kalavri | Boston University 2020 Operator state Keyed state State maintained. State backends are responsible for: • local state management • checkpointing state to remote and persistent storage, e.g. a distributed filesystem or a database system • Available

0 码力 | 24 页 | 914.13 KB | 1 年前
3
Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Algorithms Architecture and design Scheduling and load management Scalability and elasticity Fault-tolerance and guarantees State management Operator semantics Window optimizations Filtering experts with decades of hands-on experience in building and using distributed systems and data management platforms • Have fun! 10 Vasiliki Kalavri | Boston University 2020 Important dates Deliverable Netbeans with appropriate plugins installed. • gsutil for accessing datasets in Google Cloud Storage. More details: vasia.github.io/dspa20/exercises.html 14 Vasiliki Kalavri | Boston University

0 码力 | 34 页 | 2.53 MB | 1 年前
3
Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020

DBMS SDW DSMS Database Management System • ad-hoc queries, data manipulation tasks • insertions, updates, deletions of single row or groups of rows Data Stream Management System • continuous view updates • pre-aggregated, pre-processed streams and historical data Data Management Approaches 4 storage analytics static data streaming data Vasiliki Kalavri | Boston University 2020 DBMS data State limited, in-memory partitioned, virtually unlimited, persisted to backends Load management shedding backpressure, elasticity Fault tolerance limited support, high availability full support

0 码力 | 45 页 | 1.22 MB | 1 年前
3
Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020

JobGraph and all required metadata, such as the application’s JAR file, into a remote persistent storage system • Zookeeper also holds state handles and checkpoint locations 5 JobManager failures following steps: 1. It requests the storage locations from ZooKeeper to fetch the JobGraph, the JAR file, and the state handles of the last checkpoint from remote storage. 2. It requests processing slots minimize performance disruption, e.g. latency spikes • avoid introducing load imbalance • Resource management • utilization, isolation • Automation • continuous monitoring • bottleneck detection

0 码力 | 41 页 | 4.09 MB | 1 年前
3
Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Message queues • Asynchronous point-to-point communication • Lightweight buffer for temporary storage • Messages stored on the queue until they are processed and deleted • transactional, timing, and explicitly deleted while MBs delete messages once consumed. • Use a database for long-term data storage! • MBs assume a small working set. If consumers are slow, throughput might degrade. • DBs support publisher subscriber notify() subscriber notify() subscriber notify() Subscription management Event Service publish publish notify() subscribe() unsubscribe() subscribe notify unsubscribe

0 码力 | 33 页 | 700.14 KB | 1 年前
3
Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020

producer (back-pressure, flow control) 2 ??? Vasiliki Kalavri | Boston University 2020 Load management approaches 3 ! Load shedder (a) Load shedding (b) Back-pressure (c) Elasticity Selectively sources. • To ensure no data loss, a persistent input message queue, such as Kafka, and enough storage is required. 21 o1 src o2 back-pressure target: 40 rec/s 10 rec/s 100 rec/s ??? Vasiliki Kalavri Credit-based flow control • This classic networking technique turns out to be very useful for load management in modern, highly-parallel stream processors and is implemented in Apache Flink. • Each task

0 码力 | 43 页 | 2.42 MB | 1 年前
3
Exactly-once fault-tolerance in Apache Flink - CS 591 K1: Data Stream Processing and Analytics Spring 2020

in-flight data to be completely processed 3. Copy the state of each task to a remote, persistent storage 4. Wait until all tasks have finished their copies 5. Resume processing and stream ingestion University 2020 32 Epoch-Based Stream Execution Logged Input Streams Committed Output Streams Stable Storage ⇧epi AB8nicbVA9T8MwEL2Ur1K+CowsFi0SU5V0AbYKFsYiEajURJHjOq1 incremental checkpoints: • take a local snapshot and use a background thread to copy the state to remote storage • compute state deltas to reduce data transfer ??? Vasiliki Kalavri | Boston University 2020

0 码力 | 81 页 | 13.18 MB | 1 年前
3
Elasticity and state migration: Part I - CS 591 K1: Data Stream Processing and Analytics Spring 2020

State is scoped to a single task • Each stateful task is responsible for processing and state management 31 ??? Vasiliki Kalavri | Boston University 2020 Pause-and-restart state migration • State State is scoped to a single task • Each stateful task is responsible for processing and state management 31 block channels and upstream operators ??? Vasiliki Kalavri | Boston University 2020 Pause-and-restart State is scoped to a single task • Each stateful task is responsible for processing and state management 31 snapshot snapshot block channels and upstream operators buffer incoming records

0 码力 | 93 页 | 2.42 MB | 1 年前
3
PyFlink 1.15 Documentation

It’s supported to use Python virtual environment in your PyFlink jobs, see PyFlink Dependency Management for more details. Create a virtual environment using virtualenv To create a virtual environment Submitting PyFlink jobs for more details. 1.1.1.4 YARN Apache Hadoop YARN is a cluster resource management framework for managing the resources and scheduling jobs in a Hadoop cluster. It’s supported to popular container-orchestration system for automating computer application deployment, scaling, and management. This page shows you how to set up Python environment and exeucte PyFlink jobs in a Kubernetes

0 码力 | 36 页 | 266.77 KB | 1 年前
3
PyFlink 1.16 Documentation

It’s supported to use Python virtual environment in your PyFlink jobs, see PyFlink Dependency Management for more details. Create a virtual environment using virtualenv To create a virtual environment Submitting PyFlink jobs for more details. 1.1.1.4 YARN Apache Hadoop YARN is a cluster resource management framework for managing the resources and scheduling jobs in a Hadoop cluster. It’s supported to popular container-orchestration system for automating computer application deployment, scaling, and management. This page shows you how to set up Python environment and exeucte PyFlink jobs in a Kubernetes

0 码力 | 36 页 | 266.80 KB | 1 年前
3

共 16 条前往

页

分类

语言

格式

State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Exactly-once fault-tolerance in Apache Flink - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Elasticity and state migration: Part I - CS 591 K1: Data Stream Processing and Analytics Spring 2020

PyFlink 1.15 Documentation

PyFlink 1.16 Documentation