Scalable Stream Processing - Spark Streaming and Flink
sources. 13 / 79 Input Operations - Basic Sources ▶ Socket connection • Creates a DStream from text data received over a TCP socket connection. ssc.socketTextStream("localhost", 9999) ▶ File stream ory) 14 / 79 Input Operations - Basic Sources ▶ Socket connection • Creates a DStream from text data received over a TCP socket connection. ssc.socketTextStream("localhost", 9999) ▶ File stream private def receive() { ... socket = new Socket(host, port) val reader = ... // read from the socket connection val userInput = reader.readLine() while(! isStopped && userInput ! = null) { store(userInput)0 码力 | 113 页 | 1.22 MB | 1 年前3Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020
context of streaming? • queries run continuously • streams are unbounded • In traditional ad-hoc database queries, the query plan is generated on- the-fly. Different plans can be used for two consecutive example: URL access frequency 26 map() reduce() GET /dumprequest HTTP/1.1 Host: rve.org.uk Connection: keep-alive Accept: text/html,application/ xhtml+xml,application/ xml;q=0.9,*/*;q=0.8 User-Agent: en;q=0.8 Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3 GET /dumprequest HTTP/1.1 Host: rve.org.uk Connection: keep-alive Accept: text/html,application/ xhtml+xml,application/ xml;q=0.9,*/*;q=0.8 User-Agent:0 码力 | 54 页 | 2.83 MB | 1 年前3Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020
userSentPayment 4 Connecting producers to consumers • Indirectly • Producer writes to a file or database • Consumer periodically polls and retrieves new data • polling overhead, latency? • Consumer consumers. • It receives messages from the producers and pushes them to the consumers. • A TCP connection is a simple messaging system which connects one sender with one recipient. • A general messaging Databases • DBs keep data until explicitly deleted while MBs delete messages once consumed. • Use a database for long-term data storage! • MBs assume a small working set. If consumers are slow, throughput0 码力 | 33 页 | 700.14 KB | 1 年前3PyFlink 1.15 Documentation
api.ValidationException: Unable to create a source for reading␣ ˓→table 'default_catalog.default_database.sourceKafka'. Table options are: 'connector'='kafka' 'format'='json' 'properties.bootstrap.servers'='192 flink.connector.jdbc.internal.connection.SimpleJdbcConnectionProvider. ˓→loadDriver(SimpleJdbcConnectionProvider.java:90) at org.apache.flink.connector.jdbc.internal.connection.SimpleJdbcConnectionProvider ˓→getLoadedDriver(SimpleJdbcConnectionProvider.java:100) at org.apache.flink.connector.jdbc.internal.connection.SimpleJdbcConnectionProvider. (continues on next page) 1.3. Frequently Asked Questions (FAQ)0 码力 | 36 页 | 266.77 KB | 1 年前3PyFlink 1.16 Documentation
api.ValidationException: Unable to create a source for reading␣ ˓→table 'default_catalog.default_database.sourceKafka'. Table options are: 'connector'='kafka' 'format'='json' 'properties.bootstrap.servers'='192 flink.connector.jdbc.internal.connection.SimpleJdbcConnectionProvider. ˓→loadDriver(SimpleJdbcConnectionProvider.java:90) at org.apache.flink.connector.jdbc.internal.connection.SimpleJdbcConnectionProvider ˓→getLoadedDriver(SimpleJdbcConnectionProvider.java:100) at org.apache.flink.connector.jdbc.internal.connection.SimpleJdbcConnectionProvider. (continues on next page) 1.3. Frequently Asked Questions (FAQ)0 码力 | 36 页 | 266.80 KB | 1 年前3Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020
traditional data processing applications, we know the entire dataset in advance, e.g. tables stored in a database. A data stream is a data set that is produced incrementally over time, rather than being available not know when the stream ends. 3 Vasiliki Kalavri | Boston University 2020 DW DBMS SDW DSMS Database Management System • ad-hoc queries, data manipulation tasks • insertions, updates, deletions Single-node execution Synopses and sketches Approximate results In-order data processing Stream Database Systems 2000 1992 2013 MapReduce 2004 Tapestry NiagaraCQ Aurora TelegraphCQ STREAM Naiad0 码力 | 45 页 | 1.22 MB | 1 年前3Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 2020
Kalavri | Boston University 2020 ESL: Expressive Stream Language • Ad-hoc SQL queries • Updates on database tables • Continuous queries on data streams • New streams (derived) are defined as virtual views of the 10th international conference on Database Theory (ICDT’05). • Yan-Nei Law, Haixun Wang, and Carlo Zaniolo. Query languages and data models for database sequences and data streams. In Proceedings0 码力 | 53 页 | 532.37 KB | 1 年前3Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020
it is on the TCP channel. • If there is no buffer on the consumer side, reading from the TCP connection is interrupted. • The producer uses a threshold to control how much data is in-flight. • through an ATM network, each pair of endpoints first needs to establish a virtual circuit (VC) or connection. • CFC uses a credit system to signal the availability of buffer space from receivers to senders pairs of communicating tasks only • it does not interfere with other tasks sharing the same TCP connection. • CFC maximizes network utilization and prevents faults caused by high congestion. • In the0 码力 | 43 页 | 2.42 MB | 1 年前3State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020
management • checkpointing state to remote and persistent storage, e.g. a distributed filesystem or a database system • Available state backends in Flink: • In-memory • File system • RocksDB State backends0 码力 | 24 页 | 914.13 KB | 1 年前3监控Apache Flink应用程序(入门)
persistent message queue, before it is processed by Apache Flink, which then writes the results to a database or calls a downstream system. In such a pipeline, latency can be introduced at each stage and for0 码力 | 23 页 | 148.62 KB | 1 年前3
共 11 条
- 1
- 2