Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020Spring 2020 1/28: Stream ingestion and pub/sub systems Streaming sources Files, e.g. transaction logs Sockets IoT devices and sensors Databases and KV stores Message queues and brokers Where that have changed. • Logging to multiple systems • a Google Compute Engine instance can write logs to the monitoring system, to a database for later querying, and so on. • Data streaming from various subscriber, it is removed from the subscription's queue of messages. 25 Log-structured brokers Logs as message brokers • In typical message brokers, once a message is consumed it is deleted • Log-based0 码力 | 33 页 | 700.14 KB | 1 年前3
Scalable Stream Processing - Spark Streaming and Flink"1 hour") 61 / 79 Structured Streaming Example (3/3) val inputDF = spark.readStream.json("s3://logs") inputDF.groupBy(col("action"), window(col("time"), "1 hour")).count() .writeStream.format("jdbc") class Call(action: String, time: Timestamp, id: Int) val df: DataFrame = spark.readStream.json("s3://logs") val ds: Dataset[Call] = df.as[Call] // Selection and projection df.select("action").where("id >0 码力 | 113 页 | 1.22 MB | 1 年前3
Exactly-once fault-tolerance in Apache Flink - CS 591 K1: Data Stream Processing and Analytics Spring 202037 ??? Vasiliki Kalavri | Boston University 2020 1c Prepared/Aborted Snapshot Coordinator Input Logs (already committed) committed ⇧ep3AB8nic 1a 1c 2a 1b 1b 2b 2a Commit pre-committed 2b Mark Committed The Epoch Commit Protocol Output Logs 38 ??? Vasiliki Kalavri | Boston University 2020 Asynchronous checkpoints in Apache Flink 39 are reset to the position up to which they were consumed when the checkpoint was taken. • Event logs like Apache Kafka can provide records from a previous offset of the stream. 43 ??? Vasiliki Kalavri 0 码力 | 81 页 | 13.18 MB | 1 年前3
Introduction to Apache Flink and Apache Kafka - CS 591 K1: Data Stream Processing and Analytics Spring 2020at its core • Streaming & Batch API Historic data Kafka, RabbitMQ, ... HDFS, JDBC, ... Event logs ETL, Graphs, Machine Learning Relational, … Low latency, windowing, aggregations, ... 2 Vasiliki0 码力 | 26 页 | 3.33 MB | 1 年前3
Streaming in Apache FlinkStreaming in Apache Flink • Streams are natural • Events of any type like sensors, click streams, logs • Batch processing as a subset of stream processing Processing Data Dataflows Let's Talk About0 码力 | 45 页 | 3.00 MB | 1 年前3
Graph streaming algorithms - CS 591 K1: Data Stream Processing and Analytics Spring 2020analyze a snapshot of the real graph: • the Facebook social network on January 30 2016 • user web logs gathered between March 1st 12:00 and 16:00 • retweets and replies for 24h after the announcement0 码力 | 72 页 | 7.77 MB | 1 年前3
共 6 条
- 1













