Prometheus Deep Dive - Monitoring. At scale.Introduction Intro 2.0 to 2.2.1 2.4 - 2.6 Beyond Outro Prometheus Deep Dive Monitoring. At scale. Richard Hartmann & Frederic Branczyk @TwitchiH & @fredbrancz 2018-12-12 Richard Hartmann & Frederic environments in mind It’s the second project to ever join CNCF and the de facto standard in cloud-native monitoring Kubelets, sidecars, microservices, ALL the cloud-native But it’s a monolithic application . What do you need for operations? Power and cooling Network connectivity Observability, a.k.a. Monitoring The rest you can fix Richard Hartmann & Frederic Branczyk @TwitchiH & @fredbrancz Prometheus0 码力 | 34 页 | 370.20 KB | 1 年前3
Intro to Prometheus - With a dash of operations & observabilityPrometheus is a pull-based system Black-box monitoring: Looking at a service from the outside (Does the server answer to HTTP requests?) White-box monitoring: Instrumention code from the inside (How much0 码力 | 19 页 | 63.73 KB | 1 年前3
OpenMetrics - Standing on the shoulders of TitansIntroduction Quick intro OpenMetrics Outro Problem statement Before Prometheus Historically, the monitoring landscape has been highly fragmented Many solutions based on ancient technology Most data formats Problem statement After Prometheus Prometheus has become a de-facto standard in cloud-native metric monitoring Ease of exposing data has lead to an explosion in compatible metrics endpoints Prometheus’ exposition0 码力 | 21 页 | 84.83 KB | 1 年前3
告警OnCall事件中心建设方法白皮书
一般收敛逻辑是三级收敛,event -> alert -> incident。举个例子,最原始的告警事件,比如 host1 在 timestamp1 产生了一条 cpu_usage_idle 的告警,我们称为一个 event。如果没有恢复,一段时间之 后,比如 timestamp1 + 60min,一般会再发出一个告警,还是 host1,还是 cpu_usage_idle 这个指 标。很明显,这两个告警事件是有关联关 刚才的例子,告警策略的 ID 假设为 32,标签集是:[“name=cpu_usage_idle”, “host=host1”], 这两个时间戳产生的告警事件,哈希值都是一样的。 计算方法是: hash(32 + ["__name__=cpu_usage_idle", "host=host1"]) 从 event 到 alert 的这个收敛逻辑,我们叫做一级收敛。只有这个收敛逻辑还不够,告警信息还是比较 如何聚合呢? 告警聚合 事件到告警的聚合比较容易,通常是用类似下面的算法来计算不同事件的关联关系: hash(32 + ["__name__=cpu_usage_idle", "host=host1"]) 这个值姑且称为事件 Hash,相同 Hash 的事件就被聚合为一条告警。更复杂的是告警到故障的合并,当 前我们支持基于规则的聚合,后面会基于算法聚合: 比0 码力 | 23 页 | 1.75 MB | 1 年前3
共 4 条
- 1













