Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020traditional data processing applications, we know the entire dataset in advance, e.g. tables stored in a database. A data stream is a data set that is produced incrementally over time, rather than being available DBMS SDW DSMS Database Management System • ad-hoc queries, data manipulation tasks • insertions, updates, deletions of single row or groups of rows Data Stream Management System • continuous Useful in theory for the development of streaming algorithms With limited practical value in distributed, real-world settings Vasiliki Kalavri | Boston University 2020 Cash-Register Model: In this0 码力 | 45 页 | 1.22 MB | 1 年前3
Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020Message queues and brokers Where do stream processors read data from? 2 Challenges • can be distributed • out-of-sync sources may produce out-of-order streams • can be connected to the network userSentPayment 4 Connecting producers to consumers • Indirectly • Producer writes to a file or database • Consumer periodically polls and retrieves new data • polling overhead, latency? • Consumer broker: a system that connects event producers with event consumers. • It receives messages from the producers and pushes them to the consumers. • A TCP connection is a simple messaging system which0 码力 | 33 页 | 700.14 KB | 1 年前3
PyFlink 1.15 Documentationenvironment during submitting PyFlink jobs. In this way, the Python virtual environment will be distributed to the cluster nodes where PyFlink jobs are running on during job starting up. This is more flexible the above example, the Python virtual environment is specified via option -pyarch. It will be distributed to the cluster nodes during job execution. It should be noted that option -pyexec is also required environment during submitting PyFlink jobs. In this way, the Python virtual environment will be distributed to the cluster nodes where PyFlink jobs are running on during job starting up. This is more flexible0 码力 | 36 页 | 266.77 KB | 1 年前3
PyFlink 1.16 Documentationenvironment during submitting PyFlink jobs. In this way, the Python virtual environment will be distributed to the cluster nodes where PyFlink jobs are running on during job starting up. This is more flexible the above example, the Python virtual environment is specified via option -pyarch. It will be distributed to the cluster nodes during job execution. It should be noted that option -pyexec is also required environment during submitting PyFlink jobs. In this way, the Python virtual environment will be distributed to the cluster nodes where PyFlink jobs are running on during job starting up. This is more flexible0 码力 | 36 页 | 266.80 KB | 1 年前3
State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020types • The system is unaware of which parts of an operator constitute state Streaming state 3 • Explicit state primitives including state types and interfaces • The system is aware of state state to remote and persistent storage, e.g. a distributed filesystem or a database system • Available state backends in Flink: • In-memory • File system • RocksDB State backends 7 Vasiliki Kalavri purposes! FsStateBackend • Stores state on TaskManager’s heap but checkpoints it to a remote file system • In-memory speed for local accesses and fault tolerance • Limited to TaskManager’s memory and might0 码力 | 24 页 | 914.13 KB | 1 年前3
Scalable Stream Processing - Spark Streaming and Flink30 / 79 Output Operations (1/4) ▶ Push out DStream’s data to external systems, e.g., a database or a file system. ▶ foreachRDD: the most generic output operator • Applies a function to each RDD generated stream as a table that is being continuously appended. ▶ Built on the Spark SQL engine. ▶ Perform database-like query optimizations. 56 / 79 Programming Model (1/2) ▶ Two main steps to develop a Spark minutes", "5 minutes"), col("word")).count() 67 / 79 Flink 68 / 79 Flink ▶ Distributed data flow processing system ▶ Unified real-time stream and batch processing ▶ Process unbounded and bounded0 码力 | 113 页 | 1.22 MB | 1 年前3
监控Apache Flink应用程序(入门)....................................................................................... 22 4.14 System Resources....................................................................................... queue, before it is processed by Apache Flink, which then writes the results to a database or calls a downstream system. In such a pipeline, latency can be introduced at each stage and for various reasons TaskManager (in case of a containerized setup), or by providing more TaskManagers. In general, a system already running under very high load during normal operations, will need much more time to catch-up0 码力 | 23 页 | 148.62 KB | 1 年前3
Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020Pipeline: A || B Task: B || C Data: A || A ??? Vasiliki Kalavri | Boston University 2020 8 Distributed execution in Flink ??? Vasiliki Kalavri | Boston University 2020 9 Identify the most efficient context of streaming? • queries run continuously • streams are unbounded • In traditional ad-hoc database queries, the query plan is generated on- the-fly. Different plans can be used for two consecutive • Muhammad Anis Uddin Nasir et. al. The power of both choices: Practical load balancing for distributed stream processing engines. ICDE 2015. • Nikos R. Katsipoulakis et. al. A holistic view of stream0 码力 | 54 页 | 2.83 MB | 1 年前3
High-availability, recovery semantics, and guarantees - CS 591 K1: Data Stream Processing and Analytics Spring 2020Vasiliki Kalavri | Boston University 2020 Today’s topics • High-availability and fault-tolerance in distributed stream processing • Recovery semantics and guarantees • Exactly-once processing in Apache Beam learning models State in dataflow computations 3 Vasiliki Kalavri | Boston University 2020 4 Distributed streaming systems will fail • how can we guard state against failures and guarantee correct results fully processed? Was mo delivered downstream? Vasiliki Kalavri | Boston University 2020 A simple system model stream sources N1 NK N2 … input queue output queue primary nodes secondary nodes other0 码力 | 49 页 | 2.08 MB | 1 年前3
Elasticity and state migration: Part I - CS 591 K1: Data Stream Processing and Analytics Spring 2020Mechanism: How to apply the re-configuration? 3 • Detect environment changes: external workload and system performance • Identify bottleneck operators, straggler workers, skew • Enumerate scaling actions processing a tuple and all its derived results • Policy • each operator as a single-server queuing system • generalized Jackson networks • Action • predictive, at-once for all operators ??? Vasiliki processing a tuple and all its derived results • Policy • each operator as a single-server queuing system • generalized Jackson networks • Action • predictive, at-once for all operators Too fine-grained0 码力 | 93 页 | 2.42 MB | 1 年前3
共 24 条
- 1
- 2
- 3













