Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020## CS 591 K1: Data Stream Processing and Analytics Spring 2020 ## 1 /23: Stream Processing Fundamentals Vasiliki (Vasia) Kalavri vkalavri@bu.edu ## What is a stream? - In traditional data processing processing applications, we know the entire dataset in advance, e.g. tables stored in a database. A data stream is a data set that is produced incrementally over time, rather than being available in full before high-volume, real-time data that might be unbounded • we cannot store the entire stream in an accessible way • we have to process stream elements on-the-fly using limited memory ## Properties of data streams0 码力 | 45 页 | 1.22 MB | 2 年前3
Scalable Stream Processing - Spark Streaming and Flink## Scalable Stream Processing - Spark Streaming and Flink Amir H. Payberah payberah@kth.se 05/10/2018 https://id2221kth.github.io ## Data Processing Graph Data Pregel, GraphLab, PowerGraph GraphX GraphX, X-Stream, Chaos Batch Data MapReduce, Dryad FlumeJava, Spark Structured Data Spark SQL Machine Learning Mliib Tensorflow Streaming Data Storm, SEEP, Naiad, Spark Streaming, Flink, Millwheel BigTable, Cassandra ## Distributed Messaging Systems Kafka Resource Management Mesos. YARN ## Stream Processing Systems Design Issues ▶ Continuous vs. micro-batch processing Record-at-a-Time vs. declarative0 码力 | 113 页 | 1.22 MB | 2 年前3
Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020# CS 591 K1: Data Stream Processing and Analytics Spring 2020 ## 1 /28: Stream ingestion and pub/sub systems Vasiliki (Vasia) Kalavri vkalavri@bu.edu ## Streaming sources  Where do stream processors read data from? Files, e.g. transaction logs Sockets IoT devices and sensors Databases and KV stores Message queues unpredictable delays • might be producing too fast • stream processor needs to keep up and not shed load • might be producing too slow or become idle • stream processor should be able to make progress • might0 码力 | 33 页 | 700.14 KB | 2 年前3
【04 RocketMQ 王鑫】Stream Processing with Apache RocketMQ and Apache FlinkStream Processing with Apache RocketMQ and Apache Flink 王鑫 · The Apache Software Foundation Nov.4, 2018, Shanghai, Apache Flink China Meetup  Kalavri vkalavri@bu.edu ## Key partitioning  > δ*N, where N is the number of stream elements • The solution will not contain any item y with frequency: \\delta^{\*}N $|| ## Notation (I) Input: a stream of items N: number of items in the stream $ f_{e} $ : true frequency of the item e in the input stream f: estimated frequency of item δ: user-defined0 码力 | 31 页 | 1.47 MB | 2 年前3
State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020# CS 591 K1: Data Stream Processing and Analytics Spring 2020 2/25: State Management Vasiliki (Vasia) Kalavri vkalavri@bu.edu ## State in dataflow computations Any non-trivial streaming computation and key the stream on the sensor ID val keyedData: KeyedStream[Reading, String] = sensorData .keyBy(_ .id) KeyedStream // apply a stateful FlatMapFunction on the keyed stream val alerts: resources • Working with State: https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/stream/state/state.html • Managing State in Apache Flink - Tzu-Li (Gordon) Tai: https://www.youtube.com/watch0 码力 | 24 页 | 914.13 KB | 2 年前3
Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020## CS 591 K1: Data Stream Processing and Analytics Spring 2020 ## 4 /14: Stream processing optimizations Vasiliki (Vasia) Kalavri vkalavri@bu.edu ## Topics covered in this lecture • Costs of streaming 2db614d10387ee7/p3_1.jpg) ## Revisiting the basics A series of transformations on streams in Stream SQL, Scala, Python, Rust, Java...  - Stateful operators maintain state that reflect part of the stream history they have seen • windows, continuous aggregations, distinct... - State is commonly partitioned0 码力 | 54 页 | 2.83 MB | 2 年前3
Windows and triggers - CS 591 K1: Data Stream Processing and Analytics Spring 2020## CS 591 K1: Data Stream Processing and Analytics Spring 2020 2/11: Windows and Triggers Vasiliki (Vasia) Kalavri vkalavri@bu.edu ## Window operators • Practical way to perform operations on unbounded ingest sensor stream val sensorData: DataStream[SensorReading] = env.addSource(...) } } ### Keyed vs. non-keyed windows Window operators can be applied on a keyed or a non-keyed stream: • Window need to specify two window components: • A window assigner determines how the elements of the input stream are grouped into windows. A window assigner produces a WindowedStream (or All WindowedStream if applied0 码力 | 35 页 | 444.84 KB | 2 年前3
Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020## CS 591 K1: Data Stream Processing and Analytics Spring 2020 1/21: Introduction Vasiliki (Vasia) Kalavri vkalavri@bu.edu ## Course Information • Instructor: Vasiliki Kalavri • Office: MCS 206 • ab5473/p5_4.jpg) ## Outcomes At the end of the course, you will hopefully: • know when to use stream processing vs other technology • be able to comprehensively compare features and processing guarantees build end-to-end, scalable, and reliable streaming applications • have a solid understanding of how stream processing systems work and what factors affect their performance • be aware of the challenges and0 码力 | 34 页 | 2.53 MB | 2 年前3
Notions of time and progress - CS 591 K1: Data Stream Processing and Analytics Spring 2020## CS 591 K1: Data Stream Processing and Analytics Spring 2020 2/06: Notions of time and progress Vasiliki (Vasia) Kalavri vkalavri@bu.edu ## Mobile game application • input stream: user activity • online? • How long do we have to wait before we decide that we have seen all events? ## Watermarks ## Stream progress  http://streamingbook 11_1.jpg) A watermark for time T states that event time has progressed to T in that particular stream (or partition). ## Watermark propagation 












