Web Storage - IT文库_程序员IT互联网编程电子书和文档免费下载，助您码力十足！

首页文库资料文章资讯上传文档发布文章登录账户

Scalable Stream Processing - Spark Streaming and Flink

Stream Processing - Spark Streaming and Flink Amir H. Payberah payberah@kth.se 05/10/2018 The Course Web Page https://id2221kth.github.io 1 / 79 Where Are We? 2 / 79 Stream Processing Systems Design stateful streams. val ssc = new StreamingContext(conf, Seconds(1)) ssc.checkpoint("path/to/persistent/storage") 45 / 79 Stateful Stream Operations ▶ Spark API proposes two functions for statful processing: since the last trigger will be written to the external storage. 2. Complete: the entire updated result table will be written to external storage. 3. Update: only the rows that were updated in the result

0 码力 | 113 页 | 1.22 MB | 1 年前
3
Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Netbeans with appropriate plugins installed. • gsutil for accessing datasets in Google Cloud Storage. More details: vasia.github.io/dspa20/exercises.html 14 Vasiliki Kalavri | Boston University detection • call frequency • top-K cell towers used 25 Vasiliki Kalavri | Boston University 2020 Web activity analysis • Visualization and aggregation • impressions, clicks, transactions, likes, comments Continuously arriving, possibly unbounded data f read write Complete data accessible in persistent storage 30 Vasiliki Kalavri | Boston University 2020 Consider a set of 1000 sensors deployed in different

0 码力 | 34 页 | 2.53 MB | 1 年前
3
Introduction to Apache Flink and Apache Kafka - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Vasiliki Kalavri | Boston University 2020 Distributed architecture client Flink program JobManager web dashboard TaskManager TaskManager TaskManager 5 Vasiliki Kalavri | Boston University 2020 DataStream distributed and fault-tolerant publish-subscribe messaging system and serves as the ingestion, storage, and messaging layer for large production streaming pipelines. Kafka is commonly deployed on a

0 码力 | 26 页 | 3.33 MB | 1 年前
3
Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020

updates • pre-aggregated, pre-processed streams and historical data Data Management Approaches 4 storage analytics static data streaming data Vasiliki Kalavri | Boston University 2020 DBMS vs. DSMS the total packets exchanged between two IP addresses • the collection of IP addresses accessing a web server 12 With some practical value for use-cases with append-only data It preserves all history

0 码力 | 45 页 | 1.22 MB | 1 年前
3
Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020

JobGraph and all required metadata, such as the application’s JAR file, into a remote persistent storage system • Zookeeper also holds state handles and checkpoint locations 5 JobManager failures following steps: 1. It requests the storage locations from ZooKeeper to fetch the JobGraph, the JAR file, and the state handles of the last checkpoint from remote storage. 2. It requests processing slots

0 码力 | 41 页 | 4.09 MB | 1 年前
3
Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Message queues • Asynchronous point-to-point communication • Lightweight buffer for temporary storage • Messages stored on the queue until they are processed and deleted • transactional, timing, and explicitly deleted while MBs delete messages once consumed. • Use a database for long-term data storage! • MBs assume a small working set. If consumers are slow, throughput might degrade. • DBs support

0 码力 | 33 页 | 700.14 KB | 1 年前
3
State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020

are responsible for: • local state management • checkpointing state to remote and persistent storage, e.g. a distributed filesystem or a database system • Available state backends in Flink: • In-memory backend to choose? 9 Vasiliki Kalavri | Boston University 2020 RocksDB 10 RocksDB is an LSM-tree storage engine with key/value interface, where keys and values are arbitrary byte streams. https://rocksdb

0 码力 | 24 页 | 914.13 KB | 1 年前
3
High-availability, recovery semantics, and guarantees - CS 591 K1: Data Stream Processing and Analytics Spring 2020

to address non-determinism • Each output is checkpointed together with its unique ID to stable storage before being delivered to the next stage • Retries simply replay the output that has been checkpointed false the record is not a duplicate • if it returns true, the worker sends a lookup to stable storage 20 Vasiliki Kalavri | Boston University 2020 21 http://streamingbook.net/fig/5-5 Bloom filter:

0 码力 | 49 页 | 2.08 MB | 1 年前
3
Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020

sources. • To ensure no data loss, a persistent input message queue, such as Kafka, and enough storage is required. 21 o1 src o2 back-pressure target: 40 rec/s 10 rec/s 100 rec/s ??? Vasiliki Kalavri

0 码力 | 43 页 | 2.42 MB | 1 年前
3
Exactly-once fault-tolerance in Apache Flink - CS 591 K1: Data Stream Processing and Analytics Spring 2020

in-flight data to be completely processed 3. Copy the state of each task to a remote, persistent storage 4. Wait until all tasks have finished their copies 5. Resume processing and stream ingestion University 2020 32 Epoch-Based Stream Execution Logged Input Streams Committed Output Streams Stable Storage ⇧epi AB8nicbVA9T8MwEL2Ur1K+CowsFi0SU5V0AbYKFsYiEajURJHjOq1 incremental checkpoints: • take a local snapshot and use a background thread to copy the state to remote storage • compute state deltas to reduce data transfer ??? Vasiliki Kalavri | Boston University 2020

0 码力 | 81 页 | 13.18 MB | 1 年前
3

共 12 条前往

页

分类

语言

格式

Scalable Stream Processing - Spark Streaming and Flink

Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Introduction to Apache Flink and Apache Kafka - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020

State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020

High-availability, recovery semantics, and guarantees - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Exactly-once fault-tolerance in Apache Flink - CS 591 K1: Data Stream Processing and Analytics Spring 2020