4 【王琼】容器监控架构演进 王琼 YY直播0 码力 | 23 页 | 2.17 MB | 1 年前3
B站统⼀监控系统的设计,演进
与实践分享进程监控 业务层 • qps/tps • 耗时分布 • 饱和度 • 吞吐量量 • 依赖响应 • 缓存命中率 • 调⽤用链 • SLA • ⽇日志 播放质量量 • 点播/直播 • 播放卡顿 • 平均⾸首帧 • 播放失败率 • 弹幕加载 • cdn质量量 客户端质量量 • ⽤用户端⽹网络质量量 • 劫持情况 • 崩溃&卡顿 • 返回码 告警阈值需要随着流量量变化⽽而调整 wrong 建议: 告警规则: 业务A 慢请求⽐比例例 > 80% 案例例2 告警规则: 磁盘容量量可⽤用率 <10% 告警规则: 磁盘容量量预计将于3⼩小时后饱和 0 now -1h +3h predict_linear(node_filesystem_free{}[1h], 3 * 3600) < 0 异常检测 异常流量量 abs(requests - req0 码力 | 34 页 | 650.25 KB | 1 年前3
Prometheus Deep Dive - Monitoring. At scale.backend Caveat: Prometheus 2.0 comes with storage v3 Staleness handling Remote read & write API is now stable-ish Links to in-depth talks about these features are at the end Richard Hartmann & Frederic Introduction Intro 2.0 to 2.2.1 2.4 - 2.6 Beyond Outro Remote read API Playing nicely with others We now have a stable-ish remote read/write API Twelve integrations for this API Ongoing work to send write-ahead-log Cortex On storage level, there are object storage backends for Prometheus, e.g. Thanos Remote API can now send WAL over the wire to fill gaps in data There are twelve different systems which are able to ingest0 码力 | 34 页 | 370.20 KB | 1 年前3
Intro to Prometheus - With a dash of operations & observabilityendpoint Hard API commitments within major versions No built-in TLS yet, use reverse proxies for now Richard Hartmann & Frederic Branczyk @TwitchiH & @fredbrancz Intro to Prometheus Introduction Background0 码力 | 19 页 | 63.73 KB | 1 年前3
共 4 条
- 1













