Filtering and sampling streams - CS 591 K1: Data Stream Processing and Analytics Spring 2020or high speed data streams. • Queries are executed against the synopsis rather than the entire dataset. 2 Synopsis: a lossy, compact summary of the input stream input stream synopsis maintenance estimations • For many queries, an exact answer would require storing and analyzing the entire dataset • Instead, we can relax this requirement and provide a good enough approximation • A small few tuples from the dataset • Providing an estimate via a sample can be much more expensive than estimation via other methods: • Evaluating a query over a 5% sample of a dataset may take 5% of the0 码力 | 74 页 | 1.06 MB | 1 年前3
Scalable Stream Processing - Spark Streaming and Flink(3/3) ▶ Stream-dataset joins val dataset: RDD[String, String] = ... val windowedStream = stream.window(Seconds(20))... val joinedStream = windowedStream.transform { rdd => rdd.join(dataset) } 29 / 79 operations on DataFrame/Dataset are supported for streaming. case class Call(action: String, time: Timestamp, id: Int) val df: DataFrame = spark.readStream.json("s3://logs") val ds: Dataset[Call] = df.as[Call]0 码力 | 113 页 | 1.22 MB | 1 年前3
Streaming in Apache Flinkcarbone@ri.se> Senior Researcher @ RISE Committer @ Apache Flink @SenorCarbone Contents • DataSet API • DataStream API • Concepts • Set up an environment to develop Flink programs • Implement http://training.ververica.com/trainingData/nycTaxiFares.gz • Walkthrough an example Taxi Rides Dataset Taxi Ride Events rideId Long a unique id for each ride taxiId Long the ride end location passengerCnt Short number of passengers on the ride Taxi Fare Dataset Taxi Fare Events rideId Long a unique id for each ride taxiId Long0 码力 | 45 页 | 3.00 MB | 1 年前3
Apache Flink的过去、现在和未来Flink 0.6.0 Flink 0.7 Runtime Distributed Streaming Dataflow DataStream API Stream Processing DataSet API Batch Processing 2014 年 12 月份 发布 – 开始正式支持 DataStream Flink 0.9 Sink Source Offset Computation Cluster Standalone, YARN Runtime Distributed Streaming Dataflow DataStream API Stream Processing DataSet API Batch Processing Table API & SQL Relational Table API & SQL Relational Local Single JVM0 码力 | 33 页 | 3.36 MB | 1 年前3
【05 计算平台 蓉荣】Flink 批处理及其应⽤TB-PB TB-PB 未经⼤大规模⽣生产验证 性能 ⼀一般(分钟⼩小时级别) 快(秒级) 优秀 x2 稳定性 好 ⼀一般 已在阿⾥里里内部验证 API 差(MR) 最丰富 (RDD/DataSet/DataFrame) Python/Scala/R/Java 丰富 (TableAPI) Scala/Java SQL HiveSQL SparkSQL ANSI SQL 易易⽤用性0 码力 | 12 页 | 1.44 MB | 1 年前3
Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020research-oriented project? Let’s discuss it during office hours. Vasiliki Kalavri | Boston University 2020 Dataset A subset of traces from a large (12.5k machines) Google cluster • https://github.com/google/cl0 码力 | 34 页 | 2.53 MB | 1 年前3
Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020University 2020 What is a stream? • In traditional data processing applications, we know the entire dataset in advance, e.g. tables stored in a database. A data stream is a data set that is produced incrementally0 码力 | 45 页 | 1.22 MB | 1 年前3
共 7 条
- 1













