Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020merge cannot receive data because another channel is empty Operator fission Data parallelism, replication A A A split merge ??? Vasiliki Kalavri | Boston University 2020 33 • if operator is costly computations on small time intervals • Keep intermediate state in memory • Use Spark's RDDs instead of replication • Parallel recovery mechanism in case of failures 44 input stream time-based micro-batches micro-batches D-Streams • During an interval, input data received is stored using RDDs • A D-Stream is a group of such RDDs which can be processed using common operators 45 Example • pageViews is a D-Stream grouped0 码力 | 54 页 | 2.83 MB | 1 年前3
Introduction to Apache Flink and Apache Kafka - CS 591 K1: Data Stream Processing and Analytics Spring 2020Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate separate processes or on separate machines. If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances. If all the consumer A consumer instance sees records in the order they are stored in the log. • For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any records committed to the0 码力 | 26 页 | 3.33 MB | 1 年前3
Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 2020Operator replicates a stream, commonly to be used as input to multiple downstream operators. • Group by / Partition Operators split a stream into sub-streams according to a function or the event contents Kalavri | Boston University 2020 CQL GroupBy Example Select IStream(Count(*)) From S1 [Rows 1000] Group By S1.B Count the number or events in the last 1000 rows for each value of B 20 Vasiliki Kalavri University 2020 Some queries expressed using aggregates are monotonic: SELECT DeptNo FROM empl GROUP BY DeptNo HAVING SUM(empl.Sal) > 10000 The introduction of a new empl can only expand the set0 码力 | 53 页 | 532.37 KB | 1 年前3
Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020topic T can be viewed as becoming a member of a group T. • Publishing an event on topic T can be viewed as broadcasting the event to all members of group T. • Topic hierarchies allow topic organization0 码力 | 33 页 | 700.14 KB | 1 年前3
Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020as ranges • On restore, reads are sequential within each key-group, and often across multiple key-groups • The metadata of key-group-to-subtask assignments are small. No need to maintain explicit0 码力 | 41 页 | 4.09 MB | 1 年前3
Scalable Stream Processing - Spark Streaming and FlinkKinesis, ... TwitterUtils.createStream(ssc, None) KafkaUtils.createStream(ssc, [ZK quorum], [consumer group id], [number of partitions]) 15 / 79 Input Operations - Custom Sources (1/3) ▶ To create a custom express it as the following SQL query. SELECT action, WINDOW(time, "1 hour"), COUNT * FROM events GROUP BY action, WINDOW(time, "1 hour") 61 / 79 Structured Streaming Example (3/3) val inputDF = spark0 码力 | 113 页 | 1.22 MB | 1 年前3
Apache Flink的过去、现在和未来---------------------------- Stream Mode: 12:01> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name; Flink 在阿里的服务情况 集群规模 超万台 状态数据 PetaBytes 事件处理 十万亿/天 峰值能力 17亿/秒 Flink 的过去 offline Real-time0 码力 | 33 页 | 3.36 MB | 1 年前3
Windows and triggers - CS 591 K1: Data Stream Processing and Analytics Spring 2020val sensorData: DataStream[SensorReading] = ... val avgTemp = sensorData .keyBy(_.id) // group readings in 1s event-time windows .window(TumblingEventTimeWindows.of(Time.seconds(1))) .process(new0 码力 | 35 页 | 444.84 KB | 1 年前3
监控Apache Flink应用程序(入门)partition in this window. An increasing value over time is your best indication that the consumer group is not keeping up with the producers. millisBehindLatest user applies to FlinkKinesisConsumer0 码力 | 23 页 | 148.62 KB | 1 年前3
Cardinality and frequency estimation - CS 591 K1: Data Stream Processing and Analytics Spring 2020Flooding the resources of the targeted system by sending a large number of query from a botnet • Group queries by their top-level domain and investigate most popular domains • Alert if we detect many0 码力 | 69 页 | 630.01 KB | 1 年前3
共 10 条
- 1













