version control system - IT文库_程序员IT互联网编程电子书和文档免费下载，助您码力十足！

首页文库资料文章资讯上传文档发布文章登录账户

Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Stream Processing and Analytics Vasiliki (Vasia) Kalavri  vkalavri@bu.edu Spring 2020 4/09: Flow control and load shedding ??? Vasiliki Kalavri | Boston University 2020 Keeping up with the producers what if the queue grows larger than available memory? • block the producer (back-pressure, flow control) 2 ??? Vasiliki Kalavri | Boston University 2020 Load management approaches 3 ! Load shedder latency constraints that can tolerate approximate results. Slow down the flow of data: • The system buffers excess data for later processing, once input rates stabilize. • Requires a persistent

0 码力 | 43 页 | 2.42 MB | 1 年前
3
Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020

& reconfiguration ??? Vasiliki Kalavri | Boston University 2020 • To recover from failures, the system needs to • restart failed processes • restart the application and recover its state 2 Checkpointing and all required metadata, such as the application’s JAR file, into a remote persistent storage system • Zookeeper also holds state handles and checkpoint locations 5 JobManager failures ??? Vasiliki operator placement • skew and straggler mitigation • Migrate to a different cluster or software version 9 Reconfiguration cases ??? Vasiliki Kalavri | Boston University 2020 Streaming applications

0 码力 | 41 页 | 4.09 MB | 1 年前
3
PyFlink 1.15 Documentation

Python Version Supported PyFlink Version Python Version Supported PyFlink 1.16 Python 3.6 to 3.9 PyFlink 1.15 Python 3.6 to 3.8 PyFlink 1.14 Python 3.6 to 3.8 You could check your Python version as following: following: 3 pyflink-docs, Release release-1.15 python3 --version Create a Python virtual environment Virtual environment gives you the ability to isolate the Python dependencies of different projects g. venv virtualenv venv # You can also create Python virtual environment with a specific Python version virtualenv --python /path/to/python/executable venv The virtual environment needs to be activated

0 码力 | 36 页 | 266.77 KB | 1 年前
3
PyFlink 1.16 Documentation

Python Version Supported PyFlink Version Python Version Supported PyFlink 1.16 Python 3.6 to 3.9 PyFlink 1.15 Python 3.6 to 3.8 PyFlink 1.14 Python 3.6 to 3.8 You could check your Python version as following: following: 3 pyflink-docs, Release release-1.16 python3 --version Create a Python virtual environment Virtual environment gives you the ability to isolate the Python dependencies of different projects g. venv virtualenv venv # You can also create Python virtual environment with a specific Python version virtualenv --python /path/to/python/executable venv The virtual environment needs to be activated

0 码力 | 36 页 | 266.80 KB | 1 年前
3
Elasticity and state migration: Part I - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Kalavri | Boston University 2020 Control: When and how much to adapt? Mechanism: How to apply the re-configuration? 3 • Detect environment changes: external workload and system performance • Identify to ensure result correctness ??? Vasiliki Kalavri | Boston University 2020 Automatic Scaling Control 4 ??? Vasiliki Kalavri | Boston University 2020 The automatic scaling problem 5 Given a logical congestion, back pressure, throughput Policy • Queuing theory models: for latency objectives • Control theory models: e.g., PID controller • Rule-based models, e.g. if CPU utilization > 70% => scale

0 码力 | 93 页 | 2.42 MB | 1 年前
3
Scalable Stream Processing - Spark Streaming and Flink

Output Operations (1/4) ▶ Push out DStream’s data to external systems, e.g., a database or a file system. ▶ foreachRDD: the most generic output operator • Applies a function to each RDD generated from automatically converts this batch-like query to a streaming execution plan. ▶ 2. Specify triggers to control when to update the results. • Each time a trigger fires, Spark checks for new data (new row in the automatically converts this batch-like query to a streaming execution plan. ▶ 2. Specify triggers to control when to update the results. • Each time a trigger fires, Spark checks for new data (new row in the

0 码力 | 113 页 | 1.22 MB | 1 年前
3
Streaming in Apache Flink

Streamed? • Anything (if you write a serializer/deserializer for it) • Flink has a built-in type system which supports: • basic types, i.e., String, Long, Integer, Boolean, Array • composite types: DataStream control = env.fromElements("DROP", "IGNORE").keyBy(x -> x); DataStream streamOfWords = env.fromElements("data", "DROP", "artisans", "IGNORE") .keyBy(x -> x); control ValueStateDescriptor<>("blocked", Boolean.class)); } @Override public void flatMap1(String control_value, Collector out) throws Exception { blocked.update(Boolean.TRUE); } @Override

0 码力 | 45 页 | 3.00 MB | 1 年前
3
Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020

alerts for abnormal system metrics • Detect invariant violations • Identify outlier tasks Inspired by this paper : “SAQL: A Stream-based Query System for Real- Time Abnormal System Behavior Detection” Boston University 2020 [1, 4, 5, 23, 8, 0, 7] 5 median ‣ We cannot store the entire stream ‣ No control over arrival rate or order f’ ∞ ? Continuously arriving, possibly unbounded data f read write

0 码力 | 34 页 | 2.53 MB | 1 年前
3
Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020

arrival and/or a generation timestamp. • They are produced by external sources, i.e. the DSMS has no control over their arrival order or the data rate. • They have unknown, possibly unbounded length, i DSMS Database Management System • ad-hoc queries, data manipulation tasks • insertions, updates, deletions of single row or groups of rows Data Stream Management System • continuous queries •

0 码力 | 45 页 | 1.22 MB | 1 年前
3
Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020

available cores / threads • Fused operators can share the address space but use separate threads of control • avoid communication cost without losing pipeline parallelism • use a shared buffer for communication • fixed number of random state accesses, 32K L1 cache • the throughput of the non-shared version degrades first State sharing B A Β Α Profitability ??? Vasiliki Kalavri | Boston University 2020

0 码力 | 54 页 | 2.83 MB | 1 年前
3

共 22 条前往

页

分类

语言

格式

Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020

PyFlink 1.15 Documentation

PyFlink 1.16 Documentation

Elasticity and state migration: Part I - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Scalable Stream Processing - Spark Streaming and Flink

Streaming in Apache Flink

Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020