Scalable Stream Processing - Spark Streaming and Flink
Stream Processing - Spark Streaming and Flink Amir H. Payberah payberah@kth.se 05/10/2018 The Course Web Page https://id2221kth.github.io 1 / 79 Where Are We? 2 / 79 Stream Processing Systems Design stateful streams. val ssc = new StreamingContext(conf, Seconds(1)) ssc.checkpoint("path/to/persistent/storage") 45 / 79 Stateful Stream Operations ▶ Spark API proposes two functions for statful processing: since the last trigger will be written to the external storage. 2. Complete: the entire updated result table will be written to external storage. 3. Update: only the rows that were updated in the result0 码力 | 113 页 | 1.22 MB | 1 年前3Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020
Netbeans with appropriate plugins installed. • gsutil for accessing datasets in Google Cloud Storage. More details: vasia.github.io/dspa20/exercises.html 14 Vasiliki Kalavri | Boston University detection • call frequency • top-K cell towers used 25 Vasiliki Kalavri | Boston University 2020 Web activity analysis • Visualization and aggregation • impressions, clicks, transactions, likes, comments Continuously arriving, possibly unbounded data f read write Complete data accessible in persistent storage 30 Vasiliki Kalavri | Boston University 2020 Consider a set of 1000 sensors deployed in different0 码力 | 34 页 | 2.53 MB | 1 年前3Introduction to Apache Flink and Apache Kafka - CS 591 K1: Data Stream Processing and Analytics Spring 2020
Vasiliki Kalavri | Boston University 2020 Distributed architecture client Flink program JobManager web dashboard TaskManager TaskManager TaskManager 5 Vasiliki Kalavri | Boston University 2020 DataStream distributed and fault-tolerant publish-subscribe messaging system and serves as the ingestion, storage, and messaging layer for large production streaming pipelines. Kafka is commonly deployed on a0 码力 | 26 页 | 3.33 MB | 1 年前3Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020
updates • pre-aggregated, pre-processed streams and historical data Data Management Approaches 4 storage analytics static data streaming data Vasiliki Kalavri | Boston University 2020 DBMS vs. DSMS the total packets exchanged between two IP addresses • the collection of IP addresses accessing a web server 12 With some practical value for use-cases with append-only data It preserves all history0 码力 | 45 页 | 1.22 MB | 1 年前3Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020
JobGraph and all required metadata, such as the application’s JAR file, into a remote persistent storage system • Zookeeper also holds state handles and checkpoint locations 5 JobManager failures following steps: 1. It requests the storage locations from ZooKeeper to fetch the JobGraph, the JAR file, and the state handles of the last checkpoint from remote storage. 2. It requests processing slots0 码力 | 41 页 | 4.09 MB | 1 年前3Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020
Message queues • Asynchronous point-to-point communication • Lightweight buffer for temporary storage • Messages stored on the queue until they are processed and deleted • transactional, timing, and explicitly deleted while MBs delete messages once consumed. • Use a database for long-term data storage! • MBs assume a small working set. If consumers are slow, throughput might degrade. • DBs support0 码力 | 33 页 | 700.14 KB | 1 年前3State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020
are responsible for: • local state management • checkpointing state to remote and persistent storage, e.g. a distributed filesystem or a database system • Available state backends in Flink: • In-memory backend to choose? 9 Vasiliki Kalavri | Boston University 2020 RocksDB 10 RocksDB is an LSM-tree storage engine with key/value interface, where keys and values are arbitrary byte streams. https://rocksdb0 码力 | 24 页 | 914.13 KB | 1 年前3High-availability, recovery semantics, and guarantees - CS 591 K1: Data Stream Processing and Analytics Spring 2020
to address non-determinism • Each output is checkpointed together with its unique ID to stable storage before being delivered to the next stage • Retries simply replay the output that has been checkpointed false the record is not a duplicate • if it returns true, the worker sends a lookup to stable storage 20 Vasiliki Kalavri | Boston University 2020 21 http://streamingbook.net/fig/5-5 Bloom filter:0 码力 | 49 页 | 2.08 MB | 1 年前3Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020
sources. • To ensure no data loss, a persistent input message queue, such as Kafka, and enough storage is required. 21 o1 src o2 back-pressure target: 40 rec/s 10 rec/s 100 rec/s ??? Vasiliki Kalavri0 码力 | 43 页 | 2.42 MB | 1 年前3Exactly-once fault-tolerance in Apache Flink - CS 591 K1: Data Stream Processing and Analytics Spring 2020
in-flight data to be completely processed 3. Copy the state of each task to a remote, persistent storage 4. Wait until all tasks have finished their copies 5. Resume processing and stream ingestion University 2020 32 Epoch-Based Stream Execution Logged Input Streams Committed Output Streams Stable Storage ⇧epiAB8nicbVA9T8MwEL2Ur1K+CowsFi0SU5V0AbYKFsYiEajURJHjOq1 incremental checkpoints: • take a local snapshot and use a background thread to copy the state to remote storage • compute state deltas to reduce data transfer ??? Vasiliki Kalavri | Boston University 2020 0 码力 | 81 页 | 13.18 MB | 1 年前3
共 12 条
- 1
- 2
相关搜索词