Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020
Operator selectivity 6 • The number of output elements produced per number of input elements • a map operator has a selectivity of 1, i.e. it produces one output element for each input element it processes merge X merge A A X merge A1 merge A2 A2 A1 X X ??? Vasiliki Kalavri | Boston University 2020 map(String key, String value): // key: document name // value: document contents for each URL → list(v2) (k1, v1) → list(k2, v2) map() reduce() 25 ??? Vasiliki Kalavri | Boston University 2020 MapReduce combiners example: URL access frequency 26 map() reduce() GET /dumprequest HTTP/10 码力 | 54 页 | 2.83 MB | 1 年前3Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020
NiagaraCQ Aurora TelegraphCQ STREAM Naiad Spark Streaming Samza Flink Millwheel Storm S4 Google Dataflow Now Evolution of Stream Processing 35 Vasiliki Kalavri | Boston University 2020 Distributed associated keys. 38 Distributed dataflow model Vasiliki Kalavri | Boston University 2020 topK map print source w1 w2 w3 w6 w4 w5 w8 w7 Twitter source Extract hashtags Count topics Trends getExecutionEnvironment val sensorData = env.addSource(new SensorSource) val maxTemp = sensorData .map(r => Reading(r.id,r.time,(r.temp-32)*(5.0/9.0))) .keyBy(_.id) .max("temp") maxTemp0 码力 | 45 页 | 1.22 MB | 1 年前3Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020
flink.apache.org Apache Kafka: kafka.apache.org Apache Beam: beam.apache.org Google Cloud Platform: cloud.google.com 5 Vasiliki Kalavri | Boston University 2020 Outcomes At the end of the course machines) Google cluster • https://github.com/google/cluster-data/blob/master/ ClusterData2011_2.md Make sure to read and become familiar with the format and schema document: • https://drive.google.com Eclipse, or Netbeans with appropriate plugins installed. • gsutil for accessing datasets in Google Cloud Storage. More details: vasia.github.io/dspa20/exercises.html 14 Vasiliki Kalavri | Boston0 码力 | 34 页 | 2.53 MB | 1 年前3High-availability, recovery semantics, and guarantees - CS 591 K1: Data Stream Processing and Analytics Spring 2020
stream processing • Recovery semantics and guarantees • Exactly-once processing in Apache Beam / Google Cloud Dataflow 2 Vasiliki Kalavri | Boston University 2020 Logic State<#Brexit, 520> fail at the same time? 18 Vasiliki Kalavri | Boston University 2020 Exactly-once in Google Cloud Dataflow Google Dataflow uses RPC for data communication • the sender will retry RPCs until it receives stored in a scalable key/value store. Vasiliki Kalavri | Boston University 2020 Exactly-once in Google Cloud Dataflow Checkpointing to address non-determinism • Each output is checkpointed together 0 码力 | 49 页 | 2.08 MB | 1 年前3Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020
company == ‘Uber’ and price < 100 • Predecessors of Complex Event Processing (CEP) systems 22 Google Cloud Pub/Sub Publishers and Subscribers are applications. 23 Use-cases • Balancing workloads workloads in network clusters • tasks can be efficiently distributed among multiple workers, such as Google Compute Engine instances. • Distributing event notifications • a service that accepts user signups invalidation events to update the IDs of objects that have changed. • Logging to multiple systems • a Google Compute Engine instance can write logs to the monitoring system, to a database for later querying0 码力 | 33 页 | 700.14 KB | 1 年前3Scalable Stream Processing - Spark Streaming and Flink
20 / 79 Transformations (2/4) ▶ map • Returns a new DStream by passing each element of the source DStream through a given function. ▶ flatMap • Similar to map, but each input item can be mapped to 21 / 79 Transformations (2/4) ▶ map • Returns a new DStream by passing each element of the source DStream through a given function. ▶ flatMap • Similar to map, but each input item can be mapped to 21 / 79 Transformations (2/4) ▶ map • Returns a new DStream by passing each element of the source DStream through a given function. ▶ flatMap • Similar to map, but each input item can be mapped to0 码力 | 113 页 | 1.22 MB | 1 年前3Introduction to Apache Flink and Apache Kafka - CS 591 K1: Data Stream Processing and Analytics Spring 2020
Kalavri | Boston University 2020 Streaming word count textStream .flatMap {_.split("\\W+")} .map {(_, 1)} .keyBy(0) .sum(1) .print() “live and let live” “live” “and” “let” “live” (live nt val sensorData = env.addSource(new SensorSource) val maxTemp = sensorData .map(r => Reading(r.id,r.time,(r.temp-32)*(5.0/9.0))) .keyBy(_.id) .max("temp") maxTemp nt val sensorData = env.addSource(new SensorSource) val maxTemp = sensorData .map(r => Reading(r.id,r.time,(r.temp-32)*(5.0/9.0))) .keyBy(_.id) .max("temp") maxTemp0 码力 | 26 页 | 3.33 MB | 1 年前3PyFlink 1.15 Documentation
FIELD("data", DataTypes.STRING())])) def func(data: Row): return Row(data.id, data.data * 2) table.map(func).execute().print() +----+----------------------+--------------------------------+ | op | id | class MyFlatMapFunction(FlatMapFunction): def flat_map(self, value): for s in str(value.data).split('|'): yield Row(value.id, s) list(ds.flat_map(MyFlatMapFunction(), output_type=Types.ROW([Types.INT() part_size=1024 ** 3, rollover_interval=15 * 60 * 1000, inactivity_interval=5 *␣ ˓→60 * 1000)) .build()) ds.map(lambda i: (i[0] + 1, i[1]), Types.TUPLE([Types.INT(), Types.STRING()])).sink_ ˓→to(sink) # the result0 码力 | 36 页 | 266.77 KB | 1 年前3PyFlink 1.16 Documentation
FIELD("data", DataTypes.STRING())])) def func(data: Row): return Row(data.id, data.data * 2) table.map(func).execute().print() +----+----------------------+--------------------------------+ | op | id | class MyFlatMapFunction(FlatMapFunction): def flat_map(self, value): for s in str(value.data).split('|'): yield Row(value.id, s) list(ds.flat_map(MyFlatMapFunction(), output_type=Types.ROW([Types.INT() part_size=1024 ** 3, rollover_interval=15 * 60 * 1000, inactivity_interval=5 *␣ ˓→60 * 1000)) .build()) ds.map(lambda i: (i[0] + 1, i[1]), Types.TUPLE([Types.INT(), Types.STRING()])).sink_ ˓→to(sink) # the result0 码力 | 36 页 | 266.80 KB | 1 年前3Streaming in Apache Flink
} } Map Function DataStreamrides = env.addSource(new TaxiRideSource(...)); DataStream enrichedNYCRides = rides .filter(new RideCleansing.NYCFilter()) .map(new Enrichment()); class Enrichment implements MapFunction { @Override public EnrichedRide map(TaxiRide taxiRide) throws Exception { return new EnrichedRide(taxiRide); } } FlatMap Function DataStream > input = … DataStream > smoothed = input.keyBy(0).map(new Smoother()); public static class Smoother extends RichMapFunction , Tuple2 0 码力 | 45 页 | 3.00 MB | 1 年前3
共 14 条
- 1
- 2