Scalable Stream Processing - Spark Streaming and FlinkBasic Operations ▶ Most of operations on DataFrame/Dataset are supported for streaming. case class Call(action: String, time: Timestamp, id: Int) val df: DataFrame = spark.readStream.json("s3://logs") val aggregations. // count words within 10 minute windows, updating every 5 minutes. // streaming DataFrame of schema {time: Timestamp, word: String} val calls = ... val actionHours = calls.groupBy(col("action") Late Data (3/3) // count words within 10 minute windows, updating every 5 minutes. // streaming DataFrame of schema {timestamp: Timestamp, word: String} val words = ... val windowedCounts = words.withW0 码力 | 113 页 | 1.22 MB | 1 年前3
PyFlink 1.15 Documentation[4]: root |-- id: TINYINT |-- data: STRING Create a Table from a Pandas DataFrame [5]: import pandas as pd df = pd.DataFrame({'id': [1, 2], 'data': ['Hi', 'Hello']}) table = table_env.from_pandas(df) ------------------+ 2 rows in set PyFlink Table also provides the conversion back to a pandas DataFrame to leverage pandas API. [14]: table.to_pandas() [14]: id data 0 1 Hi 1 2 Hello 16 Chapter 1.0 码力 | 36 页 | 266.77 KB | 1 年前3
PyFlink 1.16 Documentation[4]: root |-- id: TINYINT |-- data: STRING Create a Table from a Pandas DataFrame [5]: import pandas as pd df = pd.DataFrame({'id': [1, 2], 'data': ['Hi', 'Hello']}) table = table_env.from_pandas(df) ------------------+ 2 rows in set PyFlink Table also provides the conversion back to a pandas DataFrame to leverage pandas API. [14]: table.to_pandas() [14]: id data 0 1 Hi 1 2 Hello 16 Chapter 1.0 码力 | 36 页 | 266.80 KB | 1 年前3
【05 计算平台 蓉荣】Flink 批处理及其应⽤未经⼤大规模⽣生产验证 性能 ⼀一般(分钟⼩小时级别) 快(秒级) 优秀 x2 稳定性 好 ⼀一般 已在阿⾥里里内部验证 API 差(MR) 最丰富 (RDD/DataSet/DataFrame) Python/Scala/R/Java 丰富 (TableAPI) Scala/Java SQL HiveSQL SparkSQL ANSI SQL 易易⽤用性 ⼀一般 易易⽤用0 码力 | 12 页 | 1.44 MB | 1 年前3
共 4 条
- 1













