Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020DSMS Database Management System • ad-hoc queries, data manipulation tasks • insertions, updates, deletions of single row or groups of rows Data Stream Management System • continuous queries • Useful in theory for the development of streaming algorithms With limited practical value in distributed, real-world settings Vasiliki Kalavri | Boston University 2020 Cash-Register Model: In this University 2020 Dataflow Streaming Model Vasiliki Kalavri | Boston University 2020 Dataflow Systems Distributed execution Partitioned state Exact results Out-of-order support Single-node execution Synopses0 码力 | 45 页 | 1.22 MB | 1 年前3
PyFlink 1.15 Documentationenvironment during submitting PyFlink jobs. In this way, the Python virtual environment will be distributed to the cluster nodes where PyFlink jobs are running on during job starting up. This is more flexible the above example, the Python virtual environment is specified via option -pyarch. It will be distributed to the cluster nodes during job execution. It should be noted that option -pyexec is also required environment during submitting PyFlink jobs. In this way, the Python virtual environment will be distributed to the cluster nodes where PyFlink jobs are running on during job starting up. This is more flexible0 码力 | 36 页 | 266.77 KB | 1 年前3
PyFlink 1.16 Documentationenvironment during submitting PyFlink jobs. In this way, the Python virtual environment will be distributed to the cluster nodes where PyFlink jobs are running on during job starting up. This is more flexible the above example, the Python virtual environment is specified via option -pyarch. It will be distributed to the cluster nodes during job execution. It should be noted that option -pyexec is also required environment during submitting PyFlink jobs. In this way, the Python virtual environment will be distributed to the cluster nodes where PyFlink jobs are running on during job starting up. This is more flexible0 码力 | 36 页 | 266.80 KB | 1 年前3
Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020Message queues and brokers Where do stream processors read data from? 2 Challenges • can be distributed • out-of-sync sources may produce out-of-order streams • can be connected to the network broker: a system that connects event producers with event consumers. • It receives messages from the producers and pushes them to the consumers. • A TCP connection is a simple messaging system which which connects one sender with one recipient. • A general messaging system connects multiple producers to multiple consumers by organizing messages into topics. 7 Message Broker producer producer0 码力 | 33 页 | 700.14 KB | 1 年前3
High-availability, recovery semantics, and guarantees - CS 591 K1: Data Stream Processing and Analytics Spring 2020Vasiliki Kalavri | Boston University 2020 Today’s topics • High-availability and fault-tolerance in distributed stream processing • Recovery semantics and guarantees • Exactly-once processing in Apache Beam learning models State in dataflow computations 3 Vasiliki Kalavri | Boston University 2020 4 Distributed streaming systems will fail • how can we guard state against failures and guarantee correct results fully processed? Was mo delivered downstream? Vasiliki Kalavri | Boston University 2020 A simple system model stream sources N1 NK N2 … input queue output queue primary nodes secondary nodes other0 码力 | 49 页 | 2.08 MB | 1 年前3
Elasticity and state migration: Part I - CS 591 K1: Data Stream Processing and Analytics Spring 2020Mechanism: How to apply the re-configuration? 3 • Detect environment changes: external workload and system performance • Identify bottleneck operators, straggler workers, skew • Enumerate scaling actions processing a tuple and all its derived results • Policy • each operator as a single-server queuing system • generalized Jackson networks • Action • predictive, at-once for all operators ??? Vasiliki processing a tuple and all its derived results • Policy • each operator as a single-server queuing system • generalized Jackson networks • Action • predictive, at-once for all operators Too fine-grained0 码力 | 93 页 | 2.42 MB | 1 年前3
Exactly-once fault-tolerance in Apache Flink - CS 591 K1: Data Stream Processing and Analytics Spring 2020and stream ingestion 12 ??? Vasiliki Kalavri | Boston University 2020 –Leslie Lamport The distributed snapshot algorithm described here came about when I visited Chandy, who was then at the University to retrieve a distributed cut in a system execution that yields a system configuration Validity (safety): Termination (liveness): Obtain a valid system configuration A full system configuration is eventually captured A snapshot algorithm attempts to capture a coherent global state of a distributed system ??? Vasiliki Kalavri | Boston University 2020 Snapshotting Protocols p1 p2 p3 C m0 码力 | 81 页 | 13.18 MB | 1 年前3
Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020& reconfiguration ??? Vasiliki Kalavri | Boston University 2020 • To recover from failures, the system needs to • restart failed processes • restart the application and recover its state 2 Checkpointing and all required metadata, such as the application’s JAR file, into a remote persistent storage system • Zookeeper also holds state handles and checkpoint locations 5 JobManager failures ??? Vasiliki Vasiliki Kalavri | Boston University 2020 12 • Detect environment changes: external workload and system performance • Identify bottleneck operators, straggler workers, skew • Enumerate scaling actions0 码力 | 41 页 | 4.09 MB | 1 年前3
State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020types • The system is unaware of which parts of an operator constitute state Streaming state 3 • Explicit state primitives including state types and interfaces • The system is aware of state state to remote and persistent storage, e.g. a distributed filesystem or a database system • Available state backends in Flink: • In-memory • File system • RocksDB State backends 7 Vasiliki Kalavri purposes! FsStateBackend • Stores state on TaskManager’s heap but checkpoints it to a remote file system • In-memory speed for local accesses and fault tolerance • Limited to TaskManager’s memory and might0 码力 | 24 页 | 914.13 KB | 1 年前3
Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020| Boston University 2020 What is this course about? The design and architecture of modern distributed streaming 4 Fundamental for representing, summarizing, and analyzing data streams Systems in industry • Learn from experts with decades of hands-on experience in building and using distributed systems and data management platforms • Have fun! 10 Vasiliki Kalavri | Boston University alerts for abnormal system metrics • Detect invariant violations • Identify outlier tasks Inspired by this paper : “SAQL: A Stream-based Query System for Real- Time Abnormal System Behavior Detection”0 码力 | 34 页 | 2.53 MB | 1 年前3
共 23 条
- 1
- 2
- 3













