Cardinality and frequency estimation - CS 591 K1: Data Stream Processing and Analytics Spring 2020Kalavri | Boston University 2020 14 Combining estimates • Average won’t work: The expected value of 2R is too large. • Median won’t work: it is always a power of 2, thus, if the correct estimate is between Filter ??? Vasiliki Kalavri | Boston University 2020 20 • A space-efficient probabilistic data structure that can be used to estimate frequencies and heavy hitters in data streams • It was introduced = 10−6 The recommended number of counters is . m = 2.71828 10−6 ≈ 2,718,280 The sketch data structure requires a counter array of size 5 * 2,718,280. Space requirements ??? Vasiliki Kalavri | Boston0 码力 | 69 页 | 630.01 KB | 1 年前3
Graph streaming algorithms - CS 591 K1: Data Stream Processing and Analytics Spring 20202020 Graph streams Graph streams model interactions as events that update an underlying graph structure 5 Edge events: A purchase, a movie rating, a like on an online post, a bitcoin transaction Boston University 2020 Streaming Connected Components • State: a disjoint set (union-find) data structure for the components • it stores a set of elements partitioned in disjoint subsets • Single-pass Edge endpoints must have different signs • When merging components, if flipping all signs doesn’t work => the graph is not bipartite Bipartite graph checking ??? Vasiliki Kalavri | Boston University0 码力 | 72 页 | 7.77 MB | 1 年前3
PyFlink 1.15 Documentationdeclared pipeline can be printed, optimized, and eventually executed in a cluster. The pipeline can work with bounded or unbounded streams which enables both streaming and batch scenarios. A Table is always declared pipeline can be printed, optimized, and eventually executed in a cluster. The pipeline can work with bounded or unbounded streams which enables both streaming and batch scenarios. A DataStream following command: ls -lh /Users/duanchen/miniconda3/lib/python3.7/site-packages/pyflink The structure would be as following: total 144 -rw-r--r-- 1 duanchen staff 1.3K Oct 19 16:01 README.txt -rw-r--r--0 码力 | 36 页 | 266.77 KB | 1 年前3
PyFlink 1.16 Documentationdeclared pipeline can be printed, optimized, and eventually executed in a cluster. The pipeline can work with bounded or unbounded streams which enables both streaming and batch scenarios. A Table is always declared pipeline can be printed, optimized, and eventually executed in a cluster. The pipeline can work with bounded or unbounded streams which enables both streaming and batch scenarios. A DataStream following command: ls -lh /Users/duanchen/miniconda3/lib/python3.7/site-packages/pyflink The structure would be as following: total 144 -rw-r--r-- 1 duanchen staff 1.3K Oct 19 16:01 README.txt -rw-r--r--0 码力 | 36 页 | 266.80 KB | 1 年前3
Filtering and sampling streams - CS 591 K1: Data Stream Processing and Analytics Spring 2020University 2020 Filtering streams 21 ??? Vasiliki Kalavri | Boston University 2020 22 What data structure would you use to: • Filter out all emails that are sent from a suspected spam address? • Filter upstream backup? The membership problem ??? Vasiliki Kalavri | Boston University 2020 22 What data structure would you use to: • Filter out all emails that are sent from a suspected spam address? • Filter Kalavri | Boston University 2020 23 • Introduced by Burton Bloom in 1970. • A probabilistic data structure for representing a (possibly growing) dataset of elements that supports: • adding an element0 码力 | 74 页 | 1.06 MB | 1 年前3
Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020message is processed only once, by a single consumer • Event retrieval is not defined by content / structure but its order • FIFO, priority producer consumer queue 6 Message brokers Message broker: multiple consumers can retrieve the same message - many-to-many communication - message content / structure matters for delivery 8 MB architecture advantages • Multiple producers/consumers as concurrent0 码力 | 33 页 | 700.14 KB | 1 年前3
Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020University 2020 Stream denotation An abstract interpretation of the stream as a mathematical structure, e.g. a sequence of (finite) relation states over a common schema R: [r1(R), r2(R), ..., ],0 码力 | 45 页 | 1.22 MB | 1 年前3
Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020and reliable streaming applications • have a solid understanding of how stream processing systems work and what factors affect their performance • be aware of the challenges and trade-offs one needs0 码力 | 34 页 | 2.53 MB | 1 年前3
State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020checkpointId, long timestamp) void restoreState(Liststate) Operator state 22 • A function can work with operator list state by implementing the ListCheckpointed interface • snapshotState() is invoked 0 码力 | 24 页 | 914.13 KB | 1 年前3
Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020operators can be placed at any location in the query plan • Dropping near the source avoids wasting work but it might affect results of multiple queries if the source is connected to multiple queries.0 码力 | 43 页 | 2.42 MB | 1 年前3
共 12 条
- 1
- 2













