TiDB 可观测性的设计与实现 陈霜Usage By Tag CPU Usage By Tag Another approach to CPU resource bind in Go Goroutine CPU Stats ● Try to collect goroutine runtime information. begin := GetGoroutineStats() executeSQL() end := ebug/pprof/trace\?seconds\=1 --output trace.out go tool trace trace.out ● In Goroutine analysis Page, chose the Goroutine you want to view. Trace - drawback ● Large performance impact. ● May generate Event goroutine_id description 0 GoCreate 1 创建⼀个 goroutine 10 GoStart 1 开始运⾏ goroutine 30 GoBlockSelect 1 被 block 了,暂停运⾏ goroutine 50 GoUnpark 1 goroutine 没有被 block 了 60 GoStart 1 开始运⾏ goroutine 800 码力 | 39 页 | 3.97 MB | 1 年前3
Tracing in TiDB 浅谈全链路监控:
从应用到数据库到 Runtimeerror) 让我们更近一步? 「我确实能看到的时间花在哪里了,但是为什么花了那么长时间?」 总所周知 灵魂拷问:「到底是我的 SQL 写得?,还是 Goroutine 调度不周?」 Network Ready Goroutine Wakeup 4.368ms go tool trace go tool trace ● 优点:好用,好看(UI) ● 缺点:性能损耗太大,不能一直开着 的原理是? Trace 会 Go Runtime 的代码中打桩收集 CPU time,在 Goroutine 开始执行时记录 start_run_time, 在调度退出执行时记录 end_run_time,累加 (end_run_time - start_run_time) 即为 这个 goroutine 的 CPU time。 A little bit about Go runtime https://learnku s-dev2 Tracing Runtime 伪代码 // Goroutine 开始运行时,记录开始信息 gp.lastSchedTime = nanotime() if gp.statCtx != nil { atomic.Xadd64(&gp.statCtx.schedtick, 1) } … // Goroutine 暂停运行时,收集执行时长 endTickTime := nanotime()0 码力 | 39 页 | 3.43 MB | 1 年前3
2.1.4 PingCAP Go runtime related problems in TiDB production environmentAgenda Part I - Latency in scheduler ● The client consists of a goroutine and a channel ○ The channel batch the request ○ A goroutine run read-send-recv loop Background Description ● When the machine Network IO is ready => goroutine wake up == 4.3ms ○ Sometime even 10ms+ latency here! ○ The time spend on runtime schedule is not negligible ● When CPU is overload, which goroutine should be given priority priority? Analysis ● The goroutine is special, it block all the callers ● The scheduler treat them equally Analysis ● Under heavy workload, goroutines get longer to be scheduled ● The runtime scheduling0 码力 | 56 页 | 50.15 MB | 6 月前3
1.3 Go practices in TiDB 姚维com Agenda ● How to build a stable database ○ Schrodinger-test platform ○ Failpoint injection ○ Goroutine-leak detection ● Optimization ○ Chunk vs interface{} ○ Vectorized execution TiDB Overview TiDB } } } } } Let us talk about goroutine leak What is goroutine leak? func main() { go func() { // Just invalid the deadlock detection. for { time * time.Second) } }() done := make(chan bool) leakCh := make(chan string, 1) go func() { // This goroutine is leaked. for { recv, more := <-leakCh if !more { break } fmt.Printf("recv: %v", recv) } done <-0 码力 | 32 页 | 1.76 MB | 6 月前3
1.2 Go in TiDBimprovement Go in TiDB • More than 100k lines of Go code and 94 contributors. Goroutine • Starting a goroutine is easy and cheap. • Goroutines come with built-in primitives to communicate safely Parallel HashJoin Operator Goroutine Leak • Write to a chan with no reader. • Read from a chan with no writer. • How to resolve? • Block profile • Timeout • Context Goroutine Leak Test • A useful tool0 码力 | 27 页 | 935.47 KB | 6 月前3
TiDB v5.3 Documentationremote memory will greatly reduce performance. By default, TiDB will use all CPUs of the server, and goroutine scheduling will inevitably lead to cross-CPU memory access. Therefore, it is recommended to deploy the usage exceeds 10 G, an alert is triggered. • Solution: Use the HTTP API to troubleshoot the goroutine leak issue. 7.5.1.3.2 TiDB_query_duration • Alert rule: histogram_quantile(0.99, sum(rate(ti Cause 2:Early versions of TiDB (earlier than v3.0.8) have heavy internal load because of a lot of goroutine at high concurrency. – Cause 3:In early versions (v2.1.15 & versions < v3.0.0-rc1), PD instances0 码力 | 2996 页 | 49.30 MB | 1 年前3
TiDB v5.2 DocumentationTiDB configuration feedback-probability �→ . If the value is not 0, the “panic in the recoverable goroutine” error will occur after the upgrade, but this error does not affect the upgrade. • TiDB is now remote memory will greatly reduce performance. By default, TiDB will use all CPUs of the server, and goroutine scheduling will inevitably lead to cross-CPU memory access. Therefore, it is recommended to deploy the usage exceeds 10 G, an alert is triggered. • Solution: Use the HTTP API to troubleshoot the goroutine leak issue. 7.5.1.3.2 TiDB_query_duration • Alert rule: histogram_quantile(0.99, sum(rate(ti0 码力 | 2848 页 | 47.90 MB | 1 年前3
TiDB v5.2 中文手册WARN 变为 OFF。 • 升级前,请检查 TiDB 配置项feedback-probability 的值。如果不为 0,升级后会触发 “panic in the recoverable goroutine” 报错,但不影响升级。 • 兼容 MySQL 5.7 的 noop 变量 innodb_default_row_format,配置此变量无实际效果 #23541。 • 从 TiDB 5.2 DML 错误,TiCDC 快速失败并退出 #1928 * 在 Region 初始化后不立即执行 resolve lock #2235 * 优化 workerpool 以降低在高并发情况下 goroutine 的数量 #2201 – Dumpling * 通过 tidb_rowid 对 TiDB v3.x 的表进行数据划分以节省 TiDB 的内存 #301 * 减少 Dumpling 对 information_schema 使用瓶颈,为什么 TiDB 的 CPU 利用率依然很低? 在某些高端设备上,使用的是 NUMA 架构的 CPU,跨 CPU 访问远端内存将极大降低性能。TiDB 默认将使用服务 器所有 CPU,goroutine 的调度不可避免地会出现跨 CPU 内存访问。 因此,建议在 NUMA 架构服务器上,部署 n 个 TiDB(n = NUMA CPU 的个数),同时将 TiDB 的 max-procs 变量的值0 码力 | 2259 页 | 48.16 MB | 1 年前3
TiDB v5.1 DocumentationTiDB configuration feedback-probability �→ . If the value is not 0, the “panic in the recoverable goroutine” error will occur after the upgrade, but this error does not affect the upgrade. • Upgrade the remote memory will greatly reduce performance. By default, TiDB will use all CPUs of the server, and goroutine scheduling will inevitably lead to cross-CPU memory access. Therefore, it is recommended to deploy the usage exceeds 10 G, an alert is triggered. • Solution: Use the HTTP API to troubleshoot the goroutine leak issue. 7.5.1.3.2 TiDB_query_duration • Alert rule: histogram_quantile(0.99, sum(rate(ti0 码力 | 2745 页 | 47.65 MB | 1 年前3
TiDB v5.4 DocumentationCPU profiling of TiFlash. – More forms of profiling display: Supports showing CPU profiling and Goroutine results on flame charts. – More deployment environments supported: Continuous Profiling can also remote memory will greatly reduce performance. By default, TiDB will use all CPUs of the server, and goroutine scheduling will inevitably lead to cross-CPU memory access. Therefore, it is recommended to deploy the usage exceeds 10 G, an alert is triggered. • Solution: Use the HTTP API to troubleshoot the goroutine leak issue. 7.5.1.3.2 TiDB_query_duration • Alert rule: histogram_quantile(0.99, sum(rate(ti0 码力 | 3650 页 | 52.72 MB | 1 年前3
共 39 条
- 1
- 2
- 3
- 4













