Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020the day of the respective deadline. • Late submissions are only eligible for up to 50% of the original score. 15 Vasiliki Kalavri | Boston University 2020 Quiz #0 Vasiliki Kalavri | Boston University bank-apache-flink 24 Vasiliki Kalavri | Boston University 2020 Call monitoring • Service monitoring, e.g. source and destination phone numbers, their first and last cell towers Examples: • Location-based0 码力 | 34 页 | 2.53 MB | 1 年前3
PyFlink 1.15 Documentation1.1 Preparation This page shows you how to install PyFlink using pip, conda, installing from the source, etc. Python Version Supported PyFlink Version Python Version Supported PyFlink 1.16 Python 3 virtual environment needs to be activated before to use it. To activate the virtual environment, run: source venv/bin/activate That is, execute the activate script under the bin directory of your virtual environment miniconda.sh # install miniconda ./miniconda.sh -b -p miniconda # Activate the miniconda environment source miniconda/bin/activate # Create conda virtual environment under a directory, e.g. venv conda create0 码力 | 36 页 | 266.77 KB | 1 年前3
PyFlink 1.16 Documentation1.1 Preparation This page shows you how to install PyFlink using pip, conda, installing from the source, etc. Python Version Supported PyFlink Version Python Version Supported PyFlink 1.16 Python 3 virtual environment needs to be activated before to use it. To activate the virtual environment, run: source venv/bin/activate That is, execute the activate script under the bin directory of your virtual environment miniconda.sh # install miniconda ./miniconda.sh -b -p miniconda # Activate the miniconda environment source miniconda/bin/activate # Create conda virtual environment under a directory, e.g. venv conda create0 码力 | 36 页 | 266.80 KB | 1 年前3
Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020covered in this lecture ??? Vasiliki Kalavri | Boston University 2020 Revisiting the basics 3 source sink input port output port dataflow graph ??? Vasiliki Kalavri | Boston University 2020 Revisiting operators according to the number of available cores / threads • Fused operators can share the address space but use separate threads of control • avoid communication cost without losing pipeline pipeline parallelism • use a shared buffer for communication • Fused filters / projections at the source can significantly reduce I/O and intermediate results size Synergies with scheduling and other optimizations0 码力 | 54 页 | 2.83 MB | 1 年前3
Cardinality and frequency estimation - CS 591 K1: Data Stream Processing and Analytics Spring 2020stream into m = 2p = 4 sub-streams. Consider the input elements {5, 14, 5, 2, 8, 1, …} Substream Address Counter S0 00 S1 01 S2 10 S3 11 ??? Vasiliki Kalavri | Boston University 2020 11 Stochastic stream into m = 2p = 4 sub-streams. Consider the input elements {5, 14, 5, 2, 8, 1, …} Substream Address Counter S0 00 S1 01 S2 10 S3 11 • x1=5, h5(5) = 00101 • x2=14, h5(14) = 10110 • x3=5 stream into m = 2p = 4 sub-streams. Consider the input elements {5, 14, 5, 2, 8, 1, …} Substream Address Counter S0 00 S1 01 S2 10 S3 11 • x1=5, h5(5) = 00101 • x2=14, h5(14) = 10110 • x3=50 码力 | 69 页 | 630.01 KB | 1 年前3
Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020high-availability mode migrates the responsibility and metadata for a job to another JobManager in case the original JobManager disappears. • Flink relies on Apache ZooKeeper for high-availability • coordination0 码力 | 41 页 | 4.09 MB | 1 年前3
Filtering and sampling streams - CS 591 K1: Data Stream Processing and Analytics Spring 2020What data structure would you use to: • Filter out all emails that are sent from a suspected spam address? • Filter out all URLs that contain malware? • Filter out all compromised passwords? • Remove What data structure would you use to: • Filter out all emails that are sent from a suspected spam address? • Filter out all URLs that contain malware? • Filter out all compromised passwords? • Remove0 码力 | 74 页 | 1.06 MB | 1 年前3
High-availability, recovery semantics, and guarantees - CS 591 K1: Data Stream Processing and Analytics Spring 2020Vasiliki Kalavri | Boston University 2020 Exactly-once in Google Cloud Dataflow Checkpointing to address non-determinism • Each output is checkpointed together with its unique ID to stable storage before0 码力 | 49 页 | 2.08 MB | 1 年前3
Skew mitigation - CS 591 K1: Data Stream Processing and Analytics Spring 2020cause imbalance w2 w1 w3 ??? Vasiliki Kalavri | Boston University 2020 Addressing skew • To address skew, the system needs to track the frequencies of the partitioning key values. • We can then0 码力 | 31 页 | 1.47 MB | 1 年前3
Scalable Stream Processing - Spark Streaming and FlinkOperations ▶ Every input DStream is associated with a Receiver object. • It receives the data from a source and stores it in Spark’s memory for processing. ▶ Three categories of streaming sources: 1. Basic Operations ▶ Every input DStream is associated with a Receiver object. • It receives the data from a source and stores it in Spark’s memory for processing. ▶ Three categories of streaming sources: 1. Basic [number of partitions]) 15 / 79 Input Operations - Custom Sources (1/3) ▶ To create a custom source: extend the Receiver class. ▶ Implement onStart() and onStop(). ▶ Call store(data) to store received0 码力 | 113 页 | 1.22 MB | 1 年前3
共 21 条
- 1
- 2
- 3













