Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020591 K1: Data Stream Processing and Analytics Spring 2020 ## 1 /23: Stream Processing Fundamentals Vasiliki (Vasia) Kalavri vkalavri@bu.edu ## What is a stream? - In traditional data processing applications database. A data stream is a data set that is produced incrementally over time, rather than being available in full before its processing begins. • Data streams are high-volume, real-time data that might unbounded • we cannot store the entire stream in an accessible way • we have to process stream elements on-the-fly using limited memory ## Properties of data streams • They arrive continuously instead0 码力 | 45 页 | 1.22 MB | 2 年前3
Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020# CS 591 K1: Data Stream Processing and Analytics Spring 2020 ## 1 /28: Stream ingestion and pub/sub systems Vasiliki (Vasia) Kalavri vkalavri@bu.edu ## Streaming sources  Where do stream processors read data from? Files, e.g. transaction logs Sockets IoT devices and sensors Databases and KV stores Message queues unpredictable delays • might be producing too fast • stream processor needs to keep up and not shed load • might be producing too slow or become idle • stream processor should be able to make progress • might0 码力 | 33 页 | 700.14 KB | 2 年前3
Scalable Stream Processing - Spark Streaming and Flink## Scalable Stream Processing - Spark Streaming and Flink Amir H. Payberah payberah@kth.se 05/10/2018 https://id2221kth.github.io ## Data Processing Graph Data Pregel, GraphLab, PowerGraph GraphX GraphX, X-Stream, Chaos Batch Data MapReduce, Dryad FlumeJava, Spark Structured Data Spark SQL Machine Learning Mliib Tensorflow Streaming Data Storm, SEEP, Naiad, Spark Streaming, Flink, Millwheel File Systems ## Data Storage GFS, Flat FS NoSQL Databases Dynamo, BigTable, Cassandra ## Distributed Messaging Systems Kafka Resource Management Mesos. YARN ## Stream Processing Systems Design Issues0 码力 | 113 页 | 1.22 MB | 2 年前3
Skew mitigation - CS 591 K1: Data Stream Processing and Analytics Spring 2020# CS 591 K1: Data Stream Processing and Analytics Spring 2020 4/16: Skew mitigation Vasiliki (Vasia) Kalavri vkalavri@bu.edu ## Key partitioning  > δ*N, where N is the number of stream elements • The solution will not contain any item y with frequency: \\delta^{\*}N $|| ## Notation (I) Input: a stream of items N: number of items in the stream $ f_{e} $ : true frequency of the item e in the input stream f: estimated frequency of item δ: user-defined0 码力 | 31 页 | 1.47 MB | 2 年前3
State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020# CS 591 K1: Data Stream Processing and Analytics Spring 2020 2/25: State Management Vasiliki (Vasia) Kalavri vkalavri@bu.edu ## State in dataflow computations Any non-trivial streaming computation state types can you think of? • Count, sum, list, map, ... ## State management in Apache Flink All data maintained by a task and used to compute results: a local or instance variable that is accessed by com/blog/manage-rocksdb-memory-size-apache-flink ## RocksDB - RocksDB is a persistent key value store: data lives on disk, state can grow larger than available memory and will not be lost upon failure. - Keys0 码力 | 24 页 | 914.13 KB | 2 年前3
Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020## CS 591 K1: Data Stream Processing and Analytics Spring 2020 ## 4 /14: Stream processing optimizations Vasiliki (Vasia) Kalavri vkalavri@bu.edu ## Topics covered in this lecture • Costs of streaming on streams in Stream SQL, Scala, Python, Rust, Java...  ## Dataflow graph • operators are nodes, data channels are edges edges • channels have FIFO semantics • streams of data elements flow continuously along edges ## Operators • receive one or more input streams • perform tuple-at-a-time, window, logic, pattern matching0 码力 | 54 页 | 2.83 MB | 2 年前3
Windows and triggers - CS 591 K1: Data Stream Processing and Analytics Spring 2020## CS 591 K1: Data Stream Processing and Analytics Spring 2020 2/11: Windows and Triggers Vasiliki (Vasia) Kalavri vkalavri@bu.edu ## Window operators • Practical way to perform operations on unbounded ingest sensor stream val sensorData: DataStream[SensorReading] = env.addSource(...) } } ### Keyed vs. non-keyed windows Window operators can be applied on a keyed or a non-keyed stream: • Window need to specify two window components: • A window assigner determines how the elements of the input stream are grouped into windows. A window assigner produces a WindowedStream (or All WindowedStream if applied0 码力 | 35 页 | 444.84 KB | 2 年前3
Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020## CS 591 K1: Data Stream Processing and Analytics Spring 2020 1/21: Introduction Vasiliki (Vasia) Kalavri vkalavri@bu.edu ## Course Information • Instructor: Vasiliki Kalavri • Office: MCS 206 • 6d827594bab5473/p4_1.jpg) Fundamental Algorithms for representing, summarizing, and analyzing data streams ## Tools  the course, you will hopefully: • know when to use stream processing vs other technology • be able to comprehensively compare features and processing guarantees of streaming systems • be proficient in0 码力 | 34 页 | 2.53 MB | 2 年前3
Notions of time and progress - CS 591 K1: Data Stream Processing and Analytics Spring 2020## CS 591 K1: Data Stream Processing and Analytics Spring 2020 2/06: Notions of time and progress Vasiliki (Vasia) Kalavri vkalavri@bu.edu ## Mobile game application • input stream: user activity 5752b4fce9120d6/p4_1.jpg) ## Notions of time ## • Processing time • the time of the local clock where an event is being processed • a processing-time window wouldn’t account for game activity while while the train is in the tunnel • results depend on the processing speed and aren’t deterministic ## • Event time • the time when an event actually happened • an event-time window would give you the extra0 码力 | 22 页 | 2.22 MB | 2 年前3
Cardinality and frequency estimation - CS 591 K1: Data Stream Processing and Analytics Spring 2020## CS 591 K1: Data Stream Processing and Analytics Spring 2020 ## 4 /23: Cardinality and frequency estimation Vasiliki (Vasia) Kalavri vkalavri@bu.edu ## Counting distinct elements ## How can we count elements seen so far in a stream? Example use-case: Distinct users visiting one or multiple webpages # How can we count the number of distinct elements seen so far in a stream? Example use-case: Distinct solution: maintain a hash table ## How can we count the number of distinct elements seen so far in a stream? Example use-case: Distinct users visiting one or multiple webpages Naive solution: maintain a0 码力 | 69 页 | 630.01 KB | 2 年前3
共 1000 条
- 1
- 2
- 3
- 4
- 5
- 6
- 100
相关搜索词
stream processingdata streamstream modelstream applicationreal-time流数据处理发布/订阅系统Pub/Sub数据流处理消息队列Spark StreamingFlink微批处理窗口语义分布式文件系统Skew MitigationPartitioningLoad BalancingHybrid PartitioningLossy Countingstate managementkeyed stateoperator state流处理优化数据流图状态管理并行性编译器优化Window operatorsTime windowsWindow assignersTriggersKeyed vs non-keyed windows流处理系统分布式系统Apache FlinkApache KafkaProcessing timeEvent timeWatermarksStream progressAcknowledgment基数估计频率估计哈希函数计数器子流













