7. UDF in ClickHouse
Computing Task Result Table Pipeline = Directed Acyclic Graph (DAG) of modules Module = Input + Task + Output Task = Query or external program Query = “CREATE TABLE ... AS SELECT ...” A Database provided by the user UDF in ClickHouse • Scalar functions • Aggregate functions & combinators • Table functions & storage engines Usage Examples in Our ML Systems Data Preprocessing Filling invalid can pass the type as a parameter just like in CAST function • Difficulties in cross-platform compatibility • Pull request #4686 and #5124 Begin Content Area = 16,30 20 Miscellaneous Statistics •0 码力 | 29 页 | 1.54 MB | 1 年前3ClickHouse in Production
ClickHouse: DDL CREATE TABLE EventLogHDFS ( EventTime DateTime, BannerID UInt64, Cost UInt64, CounterType Enum('Hit'=0, 'Show'=1, 'Click'=2) ) 49 / 97 In ClickHouse: DDL CREATE TABLE EventLogHDFS ( EventTime ENGINE = HDFS('hdfs://hdfs1:9000/event_log.parq', 'Parquet') 50 / 97 In ClickHouse: DDL CREATE TABLE EventLogHDFS ( EventTime DateTime, BannerID UInt64, Cost UInt64, CounterType Enum('Hit'=0, 'Show'=1 Elapsed: 109.586 sec. Processed 28.75 mln rows. 53 / 97 In ClickHouse: Local Log Copy CREATE TABLE EventLogLocal AS EventLogHDFS ENGINE = MergeTree() ORDER BY BannerID; Ok. INSERT INTO EventLogLocal0 码力 | 100 页 | 6.86 MB | 1 年前31. Machine Learning with ClickHouse
resp = requests.get(url, data=query) string_io = io.StringIO(resp.text) table = pd.read_csv(string_io, sep="\t") 5 / 62 Table (part) 6 / 62 How to sample data You already know it! › LIMIT N › WHERE for fixed sample query › Only for MergeTree 11 / 62 How to sample data SAMPLE x OFFSET y CREATE TABLE trips_sample_time ( pickup_datetime DateTime ) ENGINE = MergeTree ORDER BY sipHash64(pickup_datetime) to store trained model You can store model as aggregate function state in a separate table Example CREATE TABLE models ENGINE = MergeTree ORDER BY tuple() AS SELECT stochasticLinearRegressionState(total_amount0 码力 | 64 页 | 1.38 MB | 1 年前30. Machine Learning with ClickHouse
resp = requests.get(url, data=query) string_io = io.StringIO(resp.text) table = pd.read_csv(string_io, sep="\t") 5 / 62 Table (part) 6 / 62 How to sample data You already know it! › LIMIT N › WHERE for fixed sample query › Only for MergeTree 11 / 62 How to sample data SAMPLE x OFFSET y CREATE TABLE trips_sample_time ( pickup_datetime DateTime ) ENGINE = MergeTree ORDER BY sipHash64(pickup_datetime) to store trained model You can store model as aggregate function state in a separate table Example CREATE TABLE models ENGINE = MergeTree ORDER BY tuple() AS SELECT stochasticLinearRegressionState(total_amount0 码力 | 64 页 | 1.38 MB | 1 年前33. Sync Clickhouse with MySQL_MongoDB
Can’t update/delete table frequently in Clickhouse Possible Solutions 2. MySQL Engine Not suitable for big tables Not suitable for MongoDB Possible Solutions 3. Reinit whole table every day…… Possible PTS Key Features ● Only one config file needed for a new Clickhouse table ● Init and keep syncing data in one app for a table ● Sync multiple data source to Clickhouse in minutes PTS Provider Transform mongodb, redis Listen: binlog, // binlog, kafka DataSource: user:pass@tcp(example.com:3306)/user, Table: user, QueryKeys: [ // usually primary key id ], Pairs: { // field mapping id: id, name: name0 码力 | 38 页 | 7.13 MB | 1 年前38. Continue to use ClickHouse as TSDB
Column-Orient Model ► (2) Time-Series-Orient Model How we do ► Column-Orient Model How we do CREATE TABLE demonstration.insert_view ( `Time` DateTime, `Name` String, `Age` UInt8, ..., `HeartRate` PARTITION BY toYYYYMM(Time) ORDER BY (Name, Time, Age, ...); ► Column-Orient Model How we do CREATE TABLE demonstration.insert_view ( `Time` DateTime, `Name` LowCardinality(String), `Age` UInt8 rows, 5.19 GB (168.64 million rows/s., 6.07 GB/s.) ► Time-Series-Orient Model How we do CREATE TABLE demonstration.test ( `time_series_interval` DateTime, `metric_name` String, `Name`0 码力 | 42 页 | 911.10 KB | 1 年前3ClickHouse: настоящее и будущее
Обработка графов • Batch jobs • Data Hub Support For Semistructured Data 27 JSO data type: CREATE TABLE games (data JSON) ENGINE = MergeTree; • You can insert arbitrary nested JSONs • Types are automatically games dataset CREATE TABLE games (data String) ENGINE = MergeTree ORDER BY tuple(); SELECT JSONExtractString(data, 'teams', 1, 'name') FROM games; — 0.520 sec. CREATE TABLE games (data JSON) ENGINE teams.name[1] FROM games; — 0.015 sec. Support For Semistructured Data <-- inferred type DESCRIBE TABLE games SETTINGS describe_extend_object_types = 1 name: data type: Tuple( `_id.$oid` String, `date0 码力 | 32 页 | 2.62 MB | 1 年前3ClickHouse: настоящее и будущее
Обработка графов • Batch jobs • Data Hub Support For Semistructured Data 27 JSO data type: CREATE TABLE games (data JSON) ENGINE = MergeTree; • You can insert arbitrary nested JSONs • Types are automatically games dataset CREATE TABLE games (data String) ENGINE = MergeTree ORDER BY tuple(); SELECT JSONExtractString(data, 'teams', 1, 'name') FROM games; — 0.520 sec. CREATE TABLE games (data JSON) ENGINE teams.name[1] FROM games; — 0.015 sec. Support For Semistructured Data <-- inferred type DESCRIBE TABLE games SETTINGS describe_extend_object_types = 1 name: data type: Tuple( `_id.$oid` String, `date0 码力 | 32 页 | 776.70 KB | 1 年前32. Clickhouse玩转每天千亿数据-趣头条
1:趣头条和米读的上报数据是按照”事件类型”(eventType)进行区分 2:指标系统分”分时”和”累时”指标 3:指标的一般都是会按照eventType进行区分 select count(1) from table where dt='' and timestamp>='' and timestamp<='' and eventType='' 建表的时候缺乏深度思考,由于分时指标的特性,我们的表是order 1:max_memory_usage指定单个SQL查询在该机器上面最大内存使用量 2:除了些简单的SQL,空间复杂度是O(1) 如: select count(1) from table where column=value select column1, column2 from table where column=value 凡是涉及group by, order by, distinct, join这样的SQL内存占用不再是O(1)0 码力 | 14 页 | 1.10 MB | 1 年前32. ClickHouse MergeTree原理解析-朱凯
这 些数据片段,属于相同分区的数据片段会被合成一个新的片段。这种数据片 段往复合并的特点也正是合并树的名称由来。 MergeTree的创建方式 CREATE TABLE [IF NOT EXISTS] [db_name.]table_name ( name1 [type] [DEFAULT|MATERIALIZED|ALIAS expr], name2 [type] [DEFAULT|MATERIALIZED|ALIAS0 码力 | 35 页 | 13.25 MB | 1 年前3
共 14 条
- 1
- 2