Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020Operator selectivity 6 • The number of output elements produced per number of input elements • a map operator has a selectivity of 1, i.e. it produces one output element for each input element it processes merge X merge A A X merge A1 merge A2 A2 A1 X X ??? Vasiliki Kalavri | Boston University 2020 map(String key, String value): // key: document name // value: document contents for each URL → list(v2) (k1, v1) → list(k2, v2) map() reduce() 25 ??? Vasiliki Kalavri | Boston University 2020 MapReduce combiners example: URL access frequency 26 map() reduce() GET /dumprequest HTTP/10 码力 | 54 页 | 2.83 MB | 1 年前3
Scalable Stream Processing - Spark Streaming and Flink20 / 79 Transformations (2/4) ▶ map • Returns a new DStream by passing each element of the source DStream through a given function. ▶ flatMap • Similar to map, but each input item can be mapped to 21 / 79 Transformations (2/4) ▶ map • Returns a new DStream by passing each element of the source DStream through a given function. ▶ flatMap • Similar to map, but each input item can be mapped to 21 / 79 Transformations (2/4) ▶ map • Returns a new DStream by passing each element of the source DStream through a given function. ▶ flatMap • Similar to map, but each input item can be mapped to0 码力 | 113 页 | 1.22 MB | 1 年前3
Introduction to Apache Flink and Apache Kafka - CS 591 K1: Data Stream Processing and Analytics Spring 2020Kalavri | Boston University 2020 Streaming word count textStream .flatMap {_.split("\\W+")} .map {(_, 1)} .keyBy(0) .sum(1) .print() “live and let live” “live” “and” “let” “live” (live nt val sensorData = env.addSource(new SensorSource) val maxTemp = sensorData .map(r => Reading(r.id,r.time,(r.temp-32)*(5.0/9.0))) .keyBy(_.id) .max("temp") maxTemp nt val sensorData = env.addSource(new SensorSource) val maxTemp = sensorData .map(r => Reading(r.id,r.time,(r.temp-32)*(5.0/9.0))) .keyBy(_.id) .max("temp") maxTemp0 码力 | 26 页 | 3.33 MB | 1 年前3
PyFlink 1.15 DocumentationFIELD("data", DataTypes.STRING())])) def func(data: Row): return Row(data.id, data.data * 2) table.map(func).execute().print() +----+----------------------+--------------------------------+ | op | id | class MyFlatMapFunction(FlatMapFunction): def flat_map(self, value): for s in str(value.data).split('|'): yield Row(value.id, s) list(ds.flat_map(MyFlatMapFunction(), output_type=Types.ROW([Types.INT() part_size=1024 ** 3, rollover_interval=15 * 60 * 1000, inactivity_interval=5 *␣ ˓→60 * 1000)) .build()) ds.map(lambda i: (i[0] + 1, i[1]), Types.TUPLE([Types.INT(), Types.STRING()])).sink_ ˓→to(sink) # the result0 码力 | 36 页 | 266.77 KB | 1 年前3
PyFlink 1.16 DocumentationFIELD("data", DataTypes.STRING())])) def func(data: Row): return Row(data.id, data.data * 2) table.map(func).execute().print() +----+----------------------+--------------------------------+ | op | id | class MyFlatMapFunction(FlatMapFunction): def flat_map(self, value): for s in str(value.data).split('|'): yield Row(value.id, s) list(ds.flat_map(MyFlatMapFunction(), output_type=Types.ROW([Types.INT() part_size=1024 ** 3, rollover_interval=15 * 60 * 1000, inactivity_interval=5 *␣ ˓→60 * 1000)) .build()) ds.map(lambda i: (i[0] + 1, i[1]), Types.TUPLE([Types.INT(), Types.STRING()])).sink_ ˓→to(sink) # the result0 码力 | 36 页 | 266.80 KB | 1 年前3
Streaming in Apache Flink} } Map Function DataStreamrides = env.addSource(new TaxiRideSource(...)); DataStream enrichedNYCRides = rides .filter(new RideCleansing.NYCFilter()) .map(new Enrichment()); class Enrichment implements MapFunction { @Override public EnrichedRide map(TaxiRide taxiRide) throws Exception { return new EnrichedRide(taxiRide); } } FlatMap Function DataStream > input = … DataStream > smoothed = input.keyBy(0).map(new Smoother()); public static class Smoother extends RichMapFunction , Tuple2 0 码力 | 45 页 | 3.00 MB | 1 年前3
Cardinality and frequency estimation - CS 591 K1: Data Stream Processing and Analytics Spring 2020to count cardinalities up to 1 billion or 230 with an accuracy of 4%. • The hash value needs to map elements to M = log2(230) = 30 bits. Space requirements ??? Vasiliki Kalavri | Boston University to count cardinalities up to 1 billion or 230 with an accuracy of 4%. • The hash value needs to map elements to M = log2(230) = 30 bits. • We need 1024 counters, so m = 210 and we need p = log2m = 10 to count cardinalities up to 1 billion or 230 with an accuracy of 4%. • The hash value needs to map elements to M = log2(230) = 30 bits. • We need 1024 counters, so m = 210 and we need p = log2m = 100 码力 | 69 页 | 630.01 KB | 1 年前3
Windows and triggers - CS 591 K1: Data Stream Processing and Analytics Spring 2020val sensorData = env.addSource(new SensorSource) val maxTemp = sensorData .map(r => Reading(r.id,r.time,(r.temp-32)*(5.0/9.0))) .keyBy(_.id) .timeWindow(Time.minutes(1)) Kalavri | Boston University 2020 val minTempPerWindow: DataStream[(String, Double)] = sensorData .map(r => (r.id, r.temperature)) .keyBy(_._1) .timeWindow(Time.seconds(15)) .reduce((r1, r2) Kalavri | Boston University 2020 val avgTempPerWindow: DataStream[(String, Double)] = sensorData .map(r => (r.id, r.temperature)) .keyBy(_._1) .timeWindow(Time.seconds(15)) .aggregate(new AvgTempFunction)0 码力 | 35 页 | 444.84 KB | 1 年前3
State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020interface. What operations should state support? What state types can you think of? • Count, sum, list, map, … Vasiliki Kalavri | Boston University 2020 All data maintained by a task and used to compute results: List[T]) Flink’s state primitives 13 Vasiliki Kalavri | Boston University 2020 • MapState[K, V]: a map of keys and values • get(key: K), put(key: K, value: V), contains(key: K), remove(key: K) • iterators0 码力 | 24 页 | 914.13 KB | 1 年前3
Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020selectivity 11 • Selectivity: how many records does the operator produce per record in its input? • map: 1 in 1 out • filter: 1 in, 1 or 0 out • flatMap, join: 1 in 0, 1, or more out • Cost: how many connected to multiple queries. 14 ??? Vasiliki Kalavri | Boston University 2020 Load Shedding Road Map (LSRM) • A pre-computed table that contains materialized load shedding plans ordered by how much0 码力 | 43 页 | 2.42 MB | 1 年前3
共 11 条
- 1
- 2













