Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020Stream Processing and Analytics Vasiliki (Vasia) Kalavri vkalavri@bu.edu Spring 2020 4/09: Flow control and load shedding ??? Vasiliki Kalavri | Boston University 2020 Keeping up with the producers what if the queue grows larger than available memory? • block the producer (back-pressure, flow control) 2 ??? Vasiliki Kalavri | Boston University 2020 Load management approaches 3 ! Load shedder latency constraints that can tolerate approximate results. Slow down the flow of data: • The system buffers excess data for later processing, once input rates stabilize. • Requires a persistent0 码力 | 43 页 | 2.42 MB | 1 年前3
Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020& reconfiguration ??? Vasiliki Kalavri | Boston University 2020 • To recover from failures, the system needs to • restart failed processes • restart the application and recover its state 2 Checkpointing and all required metadata, such as the application’s JAR file, into a remote persistent storage system • Zookeeper also holds state handles and checkpoint locations 5 JobManager failures ??? Vasiliki operator placement • skew and straggler mitigation • Migrate to a different cluster or software version 9 Reconfiguration cases ??? Vasiliki Kalavri | Boston University 2020 Streaming applications0 码力 | 41 页 | 4.09 MB | 1 年前3
PyFlink 1.15 DocumentationPython Version Supported PyFlink Version Python Version Supported PyFlink 1.16 Python 3.6 to 3.9 PyFlink 1.15 Python 3.6 to 3.8 PyFlink 1.14 Python 3.6 to 3.8 You could check your Python version as following: following: 3 pyflink-docs, Release release-1.15 python3 --version Create a Python virtual environment Virtual environment gives you the ability to isolate the Python dependencies of different projects g. venv virtualenv venv # You can also create Python virtual environment with a specific Python version virtualenv --python /path/to/python/executable venv The virtual environment needs to be activated0 码力 | 36 页 | 266.77 KB | 1 年前3
PyFlink 1.16 DocumentationPython Version Supported PyFlink Version Python Version Supported PyFlink 1.16 Python 3.6 to 3.9 PyFlink 1.15 Python 3.6 to 3.8 PyFlink 1.14 Python 3.6 to 3.8 You could check your Python version as following: following: 3 pyflink-docs, Release release-1.16 python3 --version Create a Python virtual environment Virtual environment gives you the ability to isolate the Python dependencies of different projects g. venv virtualenv venv # You can also create Python virtual environment with a specific Python version virtualenv --python /path/to/python/executable venv The virtual environment needs to be activated0 码力 | 36 页 | 266.80 KB | 1 年前3
Elasticity and state migration: Part I - CS 591 K1: Data Stream Processing and Analytics Spring 2020Kalavri | Boston University 2020 Control: When and how much to adapt? Mechanism: How to apply the re-configuration? 3 • Detect environment changes: external workload and system performance • Identify to ensure result correctness ??? Vasiliki Kalavri | Boston University 2020 Automatic Scaling Control 4 ??? Vasiliki Kalavri | Boston University 2020 The automatic scaling problem 5 Given a logical congestion, back pressure, throughput Policy • Queuing theory models: for latency objectives • Control theory models: e.g., PID controller • Rule-based models, e.g. if CPU utilization > 70% => scale0 码力 | 93 页 | 2.42 MB | 1 年前3
Scalable Stream Processing - Spark Streaming and FlinkOutput Operations (1/4) ▶ Push out DStream’s data to external systems, e.g., a database or a file system. ▶ foreachRDD: the most generic output operator • Applies a function to each RDD generated from automatically converts this batch-like query to a streaming execution plan. ▶ 2. Specify triggers to control when to update the results. • Each time a trigger fires, Spark checks for new data (new row in the automatically converts this batch-like query to a streaming execution plan. ▶ 2. Specify triggers to control when to update the results. • Each time a trigger fires, Spark checks for new data (new row in the0 码力 | 113 页 | 1.22 MB | 1 年前3
Streaming in Apache FlinkStreamed? • Anything (if you write a serializer/deserializer for it) • Flink has a built-in type system which supports: • basic types, i.e., String, Long, Integer, Boolean, Array • composite types: DataStreamcontrol = env.fromElements("DROP", "IGNORE").keyBy(x -> x); DataStream streamOfWords = env.fromElements("data", "DROP", "artisans", "IGNORE") .keyBy(x -> x); control ValueStateDescriptor<>("blocked", Boolean.class)); } @Override public void flatMap1(String control_value, Collector out) throws Exception { blocked.update(Boolean.TRUE); } @Override 0 码力 | 45 页 | 3.00 MB | 1 年前3
Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020alerts for abnormal system metrics • Detect invariant violations • Identify outlier tasks Inspired by this paper : “SAQL: A Stream-based Query System for Real- Time Abnormal System Behavior Detection” Boston University 2020 [1, 4, 5, 23, 8, 0, 7] 5 median ‣ We cannot store the entire stream ‣ No control over arrival rate or order f’ ∞ ? Continuously arriving, possibly unbounded data f read write0 码力 | 34 页 | 2.53 MB | 1 年前3
Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020arrival and/or a generation timestamp. • They are produced by external sources, i.e. the DSMS has no control over their arrival order or the data rate. • They have unknown, possibly unbounded length, i DSMS Database Management System • ad-hoc queries, data manipulation tasks • insertions, updates, deletions of single row or groups of rows Data Stream Management System • continuous queries •0 码力 | 45 页 | 1.22 MB | 1 年前3
Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020available cores / threads • Fused operators can share the address space but use separate threads of control • avoid communication cost without losing pipeline parallelism • use a shared buffer for communication • fixed number of random state accesses, 32K L1 cache • the throughput of the non-shared version degrades first State sharing B A Β Α Profitability ??? Vasiliki Kalavri | Boston University 20200 码力 | 54 页 | 2.83 MB | 1 年前3
共 22 条
- 1
- 2
- 3













