Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020
multicast, TCP • HTTP or RPC if the consumer exposes a service on the network • Failure handling: application needs to be aware of message loss, producers and consumers always online 5 Message queues message is processed only once, by a single consumer • Event retrieval is not defined by content / structure but its order • FIFO, priority producer consumer queue 6 Message brokers Message broker: multiple consumers can retrieve the same message - many-to-many communication - message content / structure matters for delivery 8 MB architecture advantages • Multiple producers/consumers as concurrent0 码力 | 33 页 | 700.14 KB | 1 年前3PyFlink 1.15 Documentation
environments to use. ./bin/flink run-application -t yarn-application \ -Djobmanager.memory.process.size=1024m \ -Dtaskmanager.memory.process.size=1024m \ -Dyarn.application.name=\ -pyclientexec could not meet. ./bin/flink run-application -t yarn-application \ -Djobmanager.memory.process.size=1024m \ -Dtaskmanager.memory.process.size=1024m \ -Dyarn.application.name= \ -Dyarn zip and word_count.py for the above example. As it executes the job on the JobManager in YARN application mode, the paths specified in -pyarch and -py are paths relative to shipfiles which is the directory 0 码力 | 36 页 | 266.77 KB | 1 年前3PyFlink 1.16 Documentation
environments to use. ./bin/flink run-application -t yarn-application \ -Djobmanager.memory.process.size=1024m \ -Dtaskmanager.memory.process.size=1024m \ -Dyarn.application.name=\ -pyclientexec could not meet. ./bin/flink run-application -t yarn-application \ -Djobmanager.memory.process.size=1024m \ -Dtaskmanager.memory.process.size=1024m \ -Dyarn.application.name= \ -Dyarn zip and word_count.py for the above example. As it executes the job on the JobManager in YARN application mode, the paths specified in -pyarch and -py are paths relative to shipfiles which is the directory 0 码力 | 36 页 | 266.80 KB | 1 年前3Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020
University 2020 Stream denotation An abstract interpretation of the stream as a mathematical structure, e.g. a sequence of (finite) relation states over a common schema R: [r1(R), r2(R), ..., ], the dataflow execution engine. • The burden of representation and denotations if left to the application developer/user. • The programmer needs to design and maintain appropriate state synopses0 码力 | 45 页 | 1.22 MB | 1 年前3Filtering and sampling streams - CS 591 K1: Data Stream Processing and Analytics Spring 2020
University 2020 Filtering streams 21 ??? Vasiliki Kalavri | Boston University 2020 22 What data structure would you use to: • Filter out all emails that are sent from a suspected spam address? • Filter upstream backup? The membership problem ??? Vasiliki Kalavri | Boston University 2020 22 What data structure would you use to: • Filter out all emails that are sent from a suspected spam address? • Filter Kalavri | Boston University 2020 23 • Introduced by Burton Bloom in 1970. • A probabilistic data structure for representing a (possibly growing) dataset of elements that supports: • adding an element0 码力 | 74 页 | 1.06 MB | 1 年前3Cardinality and frequency estimation - CS 591 K1: Data Stream Processing and Analytics Spring 2020
Filter ??? Vasiliki Kalavri | Boston University 2020 20 • A space-efficient probabilistic data structure that can be used to estimate frequencies and heavy hitters in data streams • It was introduced = 10−6 The recommended number of counters is . m = 2.71828 10−6 ≈ 2,718,280 The sketch data structure requires a counter array of size 5 * 2,718,280. Space requirements ??? Vasiliki Kalavri | Boston = 10−6 The recommended number of counters is . m = 2.71828 10−6 ≈ 2,718,280 The sketch data structure requires a counter array of size 5 * 2,718,280. Considering 32-bit counters, the count-min sketch0 码力 | 69 页 | 630.01 KB | 1 年前3Graph streaming algorithms - CS 591 K1: Data Stream Processing and Analytics Spring 2020
2020 Graph streams Graph streams model interactions as events that update an underlying graph structure 5 Edge events: A purchase, a movie rating, a like on an online post, a bitcoin transaction Boston University 2020 Streaming Connected Components • State: a disjoint set (union-find) data structure for the components • it stores a set of elements partitioned in disjoint subsets • Single-pass0 码力 | 72 页 | 7.77 MB | 1 年前3Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020
output Redundancy elimination variations How can no-op or idempotent operators appear in an application? ??? Vasiliki Kalavri | Boston University 2020 23 Ensure the combination of A1, A2 is equivalent GET /dumprequest HTTP/1.1 Host: rve.org.uk Connection: keep-alive Accept: text/html,application/ xhtml+xml,application/ xml;q=0.9,*/*;q=0.8 User-Agent: Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.22 GET /dumprequest HTTP/1.1 Host: rve.org.uk Connection: keep-alive Accept: text/html,application/ xhtml+xml,application/ xml;q=0.9,*/*;q=0.8 User-Agent: Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.220 码力 | 54 页 | 2.83 MB | 1 年前3Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020
• To recover from failures, the system needs to • restart failed processes • restart the application and recover its state 2 Checkpointing guards the state from failures, but what about process sufficient number of processing slots in order to execute all tasks of an application. • The JobManager cannot restart the application until enough slots become available. • Restart is automatic if there standalone mode • The restart strategy determines how often the JobManager tries to restart the application and how long it waits between restart attempts. 4 TaskManager failures ??? Vasiliki Kalavri0 码力 | 41 页 | 4.09 MB | 1 年前3监控Apache Flink应用程序(入门)
(framework or user code), as well as each network shuffle, takes time and adds to latency. 5. If the application emits through a transactional sink, the sink will only commit and publish transactions upon successful So far we have only looked at Flink-specific metrics. As long as latency & throughput of your application are in line with your expectations and it is checkpointing consistently, this is probably everything to the size of your application state (check the checkpointing metrics5 for an estimated size of the on-heap state). The possible reasons for growing state are very application-specific. Typically, an0 码力 | 23 页 | 148.62 KB | 1 年前3
共 17 条
- 1
- 2