2.1.4 PingCAP Go runtime related problems in TiDB production environmentAgenda Part I - Latency in scheduler ● The client consists of a goroutine and a channel ○ The channel batch the request ○ A goroutine run read-send-recv loop Background Description ● When the machine Network IO is ready => goroutine wake up == 4.3ms ○ Sometime even 10ms+ latency here! ○ The time spend on runtime schedule is not negligible ● When CPU is overload, which goroutine should be given priority priority? Analysis ● The goroutine is special, it block all the callers ● The scheduler treat them equally Analysis ● Under heavy workload, goroutines get longer to be scheduled ● The runtime scheduling0 码力 | 56 页 | 50.15 MB | 6 月前3
SOFAMOSN持续演进路径及实践分享无法识别协议, 断开链接 继续读取数据技术案例 – HTTP/2.0优化 官方HTTP/2.0实现问题: 1. syscall read较多,效率低下 2. 每个stream分配单独的goroutine处理, 调度开销高 3. 临时对象多,GC占比高 4. 基本实现了RFC中MUST部分,部分功 能需求上不匹配,如GRPC trailer实现技术案例 – HTTP/2.0优化 优化思路:适 implmented in Golang runtime conn goroutine conn.read conn goroutine conn.read …… 调度切换/就绪通知技术案例 – 长连接网关RawEpoll模式 RawEpoll模式:使用epoll感知到可读事件之后,再从协程池中为其分配协程进行处理。 大幅减少goroutine实例数量,从而降低内存、调度开销 Netpoll implmented implmented in Golang runtime conn.read conn …… 调度切换/就绪通知 3.请求处理过程中,协程调度 与经典netpoll模式一致 Raw Epoll goroutine pool conn.read conn 1. 链接建立后,向epoll注册oneshot 可读事件监听;并且此时不允许有协 程调用conn.read,避免与runtime netpoll冲突。0 码力 | 29 页 | 7.03 MB | 6 月前3
1.3 Go practices in TiDB 姚维com Agenda ● How to build a stable database ○ Schrodinger-test platform ○ Failpoint injection ○ Goroutine-leak detection ● Optimization ○ Chunk vs interface{} ○ Vectorized execution TiDB Overview TiDB } } } } } Let us talk about goroutine leak What is goroutine leak? func main() { go func() { // Just invalid the deadlock detection. for { time * time.Second) } }() done := make(chan bool) leakCh := make(chan string, 1) go func() { // This goroutine is leaked. for { recv, more := <-leakCh if !more { break } fmt.Printf("recv: %v", recv) } done <-0 码力 | 32 页 | 1.76 MB | 6 月前3
1.2 Go in TiDBimprovement Go in TiDB • More than 100k lines of Go code and 94 contributors. Goroutine • Starting a goroutine is easy and cheap. • Goroutines come with built-in primitives to communicate safely Parallel HashJoin Operator Goroutine Leak • Write to a chan with no reader. • Read from a chan with no writer. • How to resolve? • Block profile • Timeout • Context Goroutine Leak Test • A useful tool0 码力 | 27 页 | 935.47 KB | 6 月前3
TiDB v8.2 Documentationremote memory will greatly reduce performance. By default, TiDB will use all CPUs of the server, and goroutine scheduling will inevitably lead to cross-CPU memory access. Therefore, it is recommended to deploy the usage exceeds 10 G, an alert is triggered. • Solution: Use the HTTP API to troubleshoot the goroutine leak issue. 9.6.1.3.2 TiDB_query_duration • Alert rule: histogram_quantile(0.99, sum(rate(ti Cause 3: Early versions of TiDB (earlier than v3.0.8) have heavy internal load because of a lot of goroutine at high concurrency. • Cause 4: In early versions (v2.1.15 & versions < v3.0.0-rc1), PD instances0 码力 | 6549 页 | 108.77 MB | 10 月前3
TiDB v8.3 Documentationremote memory will greatly reduce performance. By default, TiDB will use all CPUs of the server, and goroutine scheduling will inevitably lead to cross-CPU memory access. Therefore, it is recommended to deploy the usage exceeds 10 G, an alert is triggered. • Solution: Use the HTTP API to troubleshoot the goroutine leak issue. 9.6.1.3.2 TiDB_query_duration • Alert rule: histogram_quantile(0.99, sum(rate(ti Cause 3: Early versions of TiDB (earlier than v3.0.8) have heavy internal load because of a lot of goroutine at high concurrency. • Cause 4: In early versions (v2.1.15 & versions < v3.0.0-rc1), PD instances0 码力 | 6606 页 | 109.48 MB | 10 月前3
TiDB v8.4 Documentationremote memory will greatly reduce performance. By default, TiDB will use all CPUs of the server, and goroutine scheduling will inevitably lead to cross-CPU memory access. Therefore, it is recommended to deploy the usage exceeds 10 G, an alert is triggered. • Solution: Use the HTTP API to troubleshoot the goroutine leak issue. 9.6.1.3.2 TiDB_query_duration • Alert rule: histogram_quantile(0.99, sum(rate(ti Cause 3: Early versions of TiDB (earlier than v3.0.8) have heavy internal load because of a lot of goroutine at high concurrency. • Cause 4: In early versions (v2.1.15 & versions < v3.0.0-rc1), PD instances0 码力 | 6705 页 | 110.86 MB | 10 月前3
TiDB v8.1 Documentationremote memory will greatly reduce performance. By default, TiDB will use all CPUs of the server, and goroutine scheduling will inevitably lead to cross-CPU memory access. Therefore, it is recommended to deploy the usage exceeds 10 G, an alert is triggered. • Solution: Use the HTTP API to troubleshoot the goroutine leak issue. 9.6.1.3.2 TiDB_query_duration • Alert rule: histogram_quantile(0.99, sum(rate(ti Cause 3: Early versions of TiDB (earlier than v3.0.8) have heavy internal load because of a lot of goroutine at high concurrency. • Cause 4: In early versions (v2.1.15 & versions < v3.0.0-rc1), PD instances0 码力 | 6479 页 | 108.61 MB | 10 月前3
TiDB v8.5 Documentationremote memory will greatly reduce performance. By default, TiDB will use all CPUs of the server, and goroutine scheduling will inevitably lead to cross-CPU memory access. Therefore, it is recommended to deploy panel is as follows: • Uptime: The time for which TiKV nodes and TiCDC nodes have been running • Goroutine count: The number of goroutines of a TiCDC node • Open FD count: The number of file handles opened the usage exceeds 10 G, an alert is triggered. • Solution: Use the HTTP API to troubleshoot the goroutine leak issue. 9.6.1.3.2 TiDB_query_duration • Alert rule: histogram_quantile(0.99, sum(rate(ti0 码力 | 6730 页 | 111.36 MB | 10 月前3
TiDB v8.4 中文手册使用瓶颈,为什么 TiDB 的 CPU 利用率依然很低? 在某些高端设备上,使用的是 NUMA 架构的 CPU,跨 CPU 访问远端内存将极大降低性能。TiDB 默认将使用服务 器所有 CPU,goroutine 的调度不可避免地会出现跨 CPU 内存访问。 因此,建议在 NUMA 架构服务器上,部署 n 个 TiDB(n = NUMA CPU 的个数),同时将 TiDB 的 max-procs 变量的值 tes{job="tidb"} > 1e+10 • 规则描述: 对 TiDB 内存使用量的监控。如果内存使用大于 10 G,则报警。 • 处理方法: 通过 HTTP API 来排查 goroutine 泄露的问题。 955 9.6.1.3.2 TiDB_query_duration • 报警规则: histogram_quantile(0.99, sum(rate(tidb_ser panic,需报 bug。 – PD OOM,参考5.3 PD OOM 问题。 – 其他原因,通过 curl http://127.0.0.1:2379/debug/pprof/goroutine?debug=2 抓 goroutine,报 bug。 • 5.2.4 其他问题 – PD 报 FATAL 错误,日志中有 range failed to find revision pair,v3.0.8 已经修复该问题,见0 码力 | 5072 页 | 104.05 MB | 10 月前3
共 16 条
- 1
- 2













