Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 2020conditions are met. • Conditions are commonly described as patterns that can match input stream events on type, content, timing constraints. • Actions define how to produce results from the matches CQL Consider events from stream S1 and stream S2 11 Vasiliki Kalavri | Boston University 2020 Operator types (II) • Sequence Operators capture the arrival of an ordered set of events. • common in in pattern languages • events must have associated timestamps • Iteration Operators define sequences of events or processing that satisfies a loop condition. • not commonly supported • a termination0 码力 | 53 页 | 532.37 KB | 1 年前3
Streaming in Apache Flinkpipelines • Flink managed state • Event time Streaming in Apache Flink • Streams are natural • Events of any type like sensors, click streams, logs • Batch processing as a subset of stream processing processing Processing Data Dataflows Let's Talk About Time • Processing Time • Event Time • Events may arrive out of order! What Can Be Streamed? • Anything (if you write a serializer/deserializer for ververica.com/trainingData/nycTaxiFares.gz • Walkthrough an example Taxi Rides Dataset Taxi Ride Events rideId Long a unique id for each ride taxiId Long a unique id0 码力 | 45 页 | 3.00 MB | 1 年前3
Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020generate events in a higher rate than the rate consumers can process events. 2 ??? Vasiliki Kalavri | Boston University 2020 Keeping up with the producers • Producers can generate events in a higher process events. • What happens if consumers cannot keep up with the event rate? 2 ??? Vasiliki Kalavri | Boston University 2020 Keeping up with the producers • Producers can generate events in a higher higher rate than the rate consumers can process events. • What happens if consumers cannot keep up with the event rate? • drop messages 2 ??? Vasiliki Kalavri | Boston University 2020 Keeping up0 码力 | 43 页 | 2.42 MB | 1 年前3
Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020or recipients) • Events are commonly grouped into the same topic • in a similar way batch data belonging to the same file are grouped together • topics are commonly events of the same type: userCreated producer/consumer can be implemented by multiple physical tasks running in parallel • Ιf a producer generates events with high rate, we can balance the load by spawning several consumer processes • The broker can subscribe() unsubscribe() subscribe notify unsubscribe advertise(): information reg. future events Publish/Subscribe Systems 17 Pub/Sub levels of de-coupling • Space: interacting parties do not0 码力 | 33 页 | 700.14 KB | 1 年前3
Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020files, data cleaning and standardization 6 Vasiliki Kalavri | Boston University 2020 1. Process events online without storing them 2. Support a high-level language (e.g. StreamSQL) 3. Handle missing observed in IP connections that are currently active 10 The vector is updated by a continuous stream events where the jth update has the general form (k, c[j]) and modifies the kth entry of A with the without the option to discard old events Vasiliki Kalavri | Boston University 2020 Turnstile Model: The jth update (k, c[j]), can be either positive or negative. Events can be continuously inserted and0 码力 | 45 页 | 1.22 MB | 1 年前3
Graph streaming algorithms - CS 591 K1: Data Stream Processing and Analytics Spring 2020University 2020 Graph streams Graph streams model interactions as events that update an underlying graph structure 5 Edge events: A purchase, a movie rating, a like on an online post, a bitcoin bitcoin transaction, a packet routed from a source to destination Vertex events: A new product, a new movie, a user ??? Vasiliki Kalavri | Boston University 2020 6 ??? Vasiliki Kalavri | Boston University For every t > 0, we receive one event: • Insert-only edge stream: events indicate edge additions • Fully-dynamic edge stream: events indicate edge additions or deletions A t+1, the graph is obtained0 码力 | 72 页 | 7.77 MB | 1 年前3
Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020amortizing operator firing and communication costs over more data items • Batching hurts latency as events can only be processed once the entire batch is complete Batching Profitability A A’ Spark batch jobs • the maximum every 100 events? 48 t t+1 t+3 t+4 t+5 t+6 t+7 t+2 3 events 4 events 2 events? How would you compute… • the maximum every 100 events? • clicks per user session? 49 t+1 t+3 t+4 t+5 t+6 t+7 t+2 logged in logged out How would you compute… • the maximum every 100 events? • clicks per user session? • faster than the batch size? • alerts when patterns occur? 500 码力 | 54 页 | 2.83 MB | 1 年前3
Notions of time and progress - CS 591 K1: Data Stream Processing and Analytics Spring 2020never came back online? • How long do we have to wait before we decide that we have seen all events? How do we know what event time it is? 7 Vasiliki Kalavri | Boston University 2020 Watermarks progress metric that indicates a certain point in time when we are confident that no more delayed events will arrive. • Watermarks provide a logical clock which informs the system about the current event-time windows and operators handling out-of-order events: • When an operator receives a watermark with time T, it can assume that no further events with timestamp less than T will be received. •0 码力 | 22 页 | 2.22 MB | 1 年前3
监控Apache Flink应用程序(入门)and for various reasons including the following: 1. It might take a varying amount of time until events are persisted in the message queue. caolei – 监控Apache Flink应用程序(入门) 进度和吞吐量监控 – 15 4 https://ci during recovery, events might spend some time in the message queue until they are processed by Flink (see previous section). 3. Some operators in a streaming topology need to buffer events for some time checkpointing interval for each record. In practice, it has proven invaluable to add timestamps to your events at multiple stages (at least at creation, persistence, ingestion by Flink, publication by Flink;0 码力 | 23 页 | 148.62 KB | 1 年前3
Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020Conditions might change • State is accumulated over time 10 events/s time rate decrease events/s time throughput degradation events/s time rate increase : input rate : throughput Why is it operators requires preserving the key semantics: • Existing state for a particular key and all future events with this key must be routed to the same parallel instance • Some kind of hashing is typically0 码力 | 41 页 | 4.09 MB | 1 年前3
共 17 条
- 1
- 2













