Exactly-once fault-tolerance in Apache Flink - CS 591 K1: Data Stream Processing and Analytics Spring 2020logging tuples in their output queues until downstream operators have completely processed them. 4 Vasiliki Kalavri | Boston University 2020 Active Standby 5 • The secondary receives tuples from data to be completely processed 3. Copy the state of each task to a remote, persistent storage 4. Wait until all tasks have finished their copies 5. Resume processing and stream ingestion 12 ? sha1_base64="CtNBlvh1gZfP7+4+v4893Q12R8=">AB53icbVBNS8NAEJ34WetX1aOXxSJ4Kok I6q3oxWMLxhbaUDbSbt2swm7G6GE/gIvHlS8+pe8+W/ctjlo64OBx3szMwLU8G1cd1vZ2V1bX1js7RV3t7Z3duvHBw+6CRTDH2WiES1Q6pRcIm+4UZgO1VI41BgKxzdTv3WE0 码力 | 81 页 | 13.18 MB | 1 年前3
Graph streaming algorithms - CS 591 K1: Data Stream Processing and Analytics Spring 2020K1: Data Stream Processing and Analytics Vasiliki (Vasia) Kalavri vkalavri@bu.edu Spring 2020 4/28: Graph Streaming ??? Vasiliki Kalavri | Boston University 2020 Modeling the world as a graph 2 Vasiliki Kalavri | Boston University 2020 Basics 1 5 4 3 2 “node” or “vertex” “edge” 1 5 4 3 2 undirected graph directed graph 4 ??? Vasiliki Kalavri | Boston University 2020 Graph streams vertex 3, 4 1 2 1, 4 5 3 . . . 1 5 4 3 2 ??? Vasiliki Kalavri | Boston University 2020 14 (Vi+1, outbox) <— compute(Vi, inbox) Superstep i Superstep i+1 1 3, 4 2 1, 4 5 3 . . . 1 3, 4 2 1,0 码力 | 72 页 | 7.77 MB | 1 年前3
Flink如何实时分析Iceberg数据湖的CDC数据如3实时写 4F取 ## 未来规划 #4 #见的CDC分析方案 #1 离线 HBase 集u分析 CDC 数a 、CDC记录实时写入HBase。高吞P + 低延迟。 2、小vSg询延迟低。 3、集u可拓展 ci评C B点 、行存o引不适O分析A务。 2、HBase集ur护成e较高。 3、通过Re12o4Server定DHF23e, ServerlB化Rs存完H用不上。 4、数a格式q定HF23e,不cF拓展到 方案评估 优点 、cedKudup群,a较小众。维护 O本q。 2、H HDFS / S3 / OSS 等D裂。数据c e,且KAO本不如S3 / OSS。 3、Kudud批量P描不如3ar4u1t。 4、不支持增量SF。 h点 直接D入CDC到Hi2+分析 、流程能E作 2、Hi2+存量数据不受增量数据H响。 方案评估 优点 、数据不是CR写入; 2、每次数据D致都要 MERGE a<4a CaCDC数据 1、仅依t S1a2k+D+/4a,架构简e。 2、无在k服务。l护和运nS本低。 2、D存存储,Ca速O快。 3、方便上S3 OSS,超高性价比。 方案s估 优点 1、增量和全量表割p,时效性不足。 2、r计和l护额外hChang+ S+4表。 3、计算引擎并非原g支UCDC。 4、不支U实时U13+24。 缺点 为何选择 0 码力 | 36 页 | 781.69 KB | 1 年前3
Skew mitigation - CS 591 K1: Data Stream Processing and Analytics Spring 2020K1: Data Stream Processing and Analytics Vasiliki (Vasia) Kalavri vkalavri@bu.edu Spring 2020 4/16: Skew mitigation ??? Vasiliki Kalavri | Boston University 2020 Key partitioning 2 w2 w1 w3 solution will not contain any item y with frequency: • freq(y) < (δ - ε)*N, for a user-chosen value ε 4 (δ - ε)*Ν δ*Ν not included may be included included ??? Vasiliki Kalavri | Boston University 2020 1 1 3 5 input stream ε=0.2 w1 w4 w3 w2 ??? Vasiliki Kalavri | Boston University 2020 Example 8 1 2 2 3 5 5 1 1 2 3 3 3 3 1 2 0 1 1 3 5 input stream ε=0.2 w1 w4 w3 w2 1 2 2 3 5 w1 ??? Vasiliki0 码力 | 31 页 | 1.47 MB | 1 年前3
PyFlink 1.15 Documentationjava.lang.String.value accessible: module java.base does not “opens java.lang” to unnamed module @4e4aea35 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.3.4 Connector issues . . . . txt 2. Conda install pandoc conda install pandoc 3. Build the docs python3 setup.py build_sphinx 4. Open the pyflink-docs/build/sphinx/html/index.html in the Browser 1.1 Getting Started This page summarizes environment needs to be activated before to use it. To activate the conda virtual environment, run: 4 Chapter 1. How to build docs locally pyflink-docs, Release release-1.15 conda activate venv Install0 码力 | 36 页 | 266.77 KB | 1 年前3
PyFlink 1.16 Documentationjava.lang.String.value accessible: module java.base does not “opens java.lang” to unnamed module @4e4aea35 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.3.4 Connector issues . . . . txt 2. Conda install pandoc conda install pandoc 3. Build the docs python3 setup.py build_sphinx 4. Open the pyflink-docs/build/sphinx/html/index.html in the Browser 1.1 Getting Started This page summarizes environment needs to be activated before to use it. To activate the conda virtual environment, run: 4 Chapter 1. How to build docs locally pyflink-docs, Release release-1.16 conda activate venv Install0 码力 | 36 页 | 266.80 KB | 1 年前3
Elasticity and state migration: Part I - CS 591 K1: Data Stream Processing and Analytics Spring 2020K1: Data Stream Processing and Analytics Vasiliki (Vasia) Kalavri vkalavri@bu.edu Spring 2020 4/02: Elasticity policies and state migration ??? Vasiliki Kalavri | Boston University 2020 Streaming ensure result correctness ??? Vasiliki Kalavri | Boston University 2020 Automatic Scaling Control 4 ??? Vasiliki Kalavri | Boston University 2020 The automatic scaling problem 5 Given a logical dataflow effect of Dhalion’s scaling actions in an initially under-provisioned wordcount dataflow 1 2 3 6 5 4 ??? Vasiliki Kalavri | Boston University 2020 Dataflow worker activities worker 1 worker 2 worker0 码力 | 93 页 | 2.42 MB | 1 年前3
Cardinality and frequency estimation - CS 591 K1: Data Stream Processing and Analytics Spring 2020K1: Data Stream Processing and Analytics Vasiliki (Vasia) Kalavri vkalavri@bu.edu Spring 2020 4/23: Cardinality and frequency estimation ??? Vasiliki Kalavri | Boston University 2020 Counting distinct 9013, h(x2) = 24 or 11000 => rank(x2) = 3 h(x) = M−1 ∑ k=0 ik2k = (i0i1 . . . iM−1)2, ik ∈ {0,1} 4 ??? Vasiliki Kalavri | Boston University 2020 5 Let n be the number of distinct elements in the input that indicates 2 distinct elements, whereas if two 0s is the maximum we’ve seen, that indicates 4 distinct elements, … 6 ??? Vasiliki Kalavri | Boston University 2020 The hash function h hashes0 码力 | 69 页 | 630.01 KB | 1 年前3
Scalable Stream Processing - Spark Streaming and Flinkmicro-batch processing ▶ Record-at-a-Time vs. declarative APIs 3 / 79 Outline ▶ Spark streaming ▶ Flink 4 / 79 Spark Streaming 5 / 79 Contribution ▶ Design issues • Continuous vs. micro-batch processing on DStreams ▶ Input operations ▶ Transformation ▶ Output operations 19 / 79 Transformations (1/4) ▶ Transformations on DStreams are still lazy! ▶ Now instead, computation is kicked off explicitly DStreams support many of the transformations available on normal Spark RDDs. 20 / 79 Transformations (2/4) ▶ map • Returns a new DStream by passing each element of the source DStream through a given function0 码力 | 113 页 | 1.22 MB | 1 年前3
High-availability, recovery semantics, and guarantees - CS 591 K1: Data Stream Processing and Analytics Spring 2020machine learning models State in dataflow computations 3 Vasiliki Kalavri | Boston University 2020 4 Distributed streaming systems will fail • how can we guard state against failures and guarantee correct Precise t1 t2 t3 t4 t5 t6 … Gap t1 t2 t3 t5 t6 … Rollback-repeating t1 t2 t3 t2 t3 t4 … Rollback-convergent t1 t2 t3 t’2 t’3 t4 … Rollback-divergent t1 t2 t3 t’2 t’3 t’4 … The output semantics Processing guarantees and result semantics 11 sum 4 3 2 1 0 … Vasiliki Kalavri | Boston University 2020 Processing guarantees and result semantics 11 sum 4 3 2 1 … 1 5 Vasiliki Kalavri | Boston University0 码力 | 49 页 | 2.08 MB | 1 年前3
共 26 条
- 1
- 2
- 3













