00:文章简介
介绍prometheus的的告警管理。
01:部署AlertManager
这里依然使用官网下载的二进制程序部署
wget https://github.com/prometheus/alertmanager/releases/download/v0.23.0/alertmanager-0.23.0.linux-amd64.tar.gz
tar -xf alertmanager-0.23.0.linux-amd64.tar.gz -C /usr/local/
cd /usr/local/
ln -s alertmanager-0.23.0.linux-amd64/ alertmanager
/lib/systemd/system/prometheus-alertmanager.service
[Unit]
Description=Alertmanager for prometheus
Documentation=https://prometheus.io/docs/alerting/alertmanager/
[Service]
Restart=always
ExecStart=/usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml --storage.path=/usr/local/alertmanager/data --web.listen-address=0.0.0.0:9093
ExecReload=/bin/kill -HUP $MAINPID
TimeoutStopSec=20s
SendSIGKILL=no
[Install]
WantedBy=multi-user.target
设置服务文件
systemctl daemon-reload
systemctl start prometheus-alertmanager.service
systemctl enable prometheus-alertmanager.service
systemctl status prometheus-alertmanager.service
检查监听端口
ss -nutlp | grep alertmanager
# 正常会监听在9093 和 9094
9093是与prometheus通信使用,9094是集群通信端口
查看web界面
02:与Prometheus集成
对于生产环境,建议将Prometheus和AlertManger分开部署并搭建高可用集群
cd /usr/local/prometheus/cfg
/usr/local/prometheus/cfg/prometheus.yml
alerting:
alertmanagers: # 指定alertmanager地址
- static_configs:
- targets:
- localhost:9093
rule_files: # 告警的规则
- "rules/*.yml"
重启prometheus服务,查看效果
Status -> Runtime & Build Information
03:配置告警规则
mkdir /usr/local/prometheus/cfg/rules
/usr/local/prometheus/cfg/rules/k8s.yml
groups:
- name: kubernetes
rules:
- alert: InstanceDown
expr: up == 0 # 每一个instance都有一个up状态,0为失败,1为存活
for: 1m # 报警的持续时间,1分钟内都是up == 0状态则报警
labels:
severity: error # 报警级别
annotations: # 报警的注视信息
summary: "Instance {{ $labels.instance }} has stopped working"
description: "{{ $labels.instance }} job {{ $labels.job }} has stopped working for more than 1 minute"
检查rules文件
# promtool check rules k8s.yml
Checking k8s.yml
SUCCESS: 1 rules found
重启后查看web控制台
我们把监控的一个节点断开,查看效果
04:告警流程
05:配置邮件告警通道
官网参考:https://prometheus.io/docs/alerting/latest/configuration/#configuration-file
global:
resolve_timeout: 5m # 解析超时时间
smtp_smarthost: 'smtp.126.com:25' # 邮箱smtp地址
smtp_from: 'linux98_mail@126.com' # 发信人地址
smtp_auth_username: 'linux98_mail@126.com' # 发信人用户名
smtp_auth_password: 'HJKXELGNVVTVPQXC' # 发信邮箱授权码
smtp_hello: '126.com' # 邮箱服务提供域名
smtp_require_tls: false # 是否启用tls,启用的话,上面smtp地址后面端口要修改
route:
group_by: ['alertname'] # 采用哪个标签进行分组
group_wait: 30s # 分组的等待时间,收到信息后,不立即发送,而是看时间内该组还有告警发送,就一起发送
group_interval: 30s # 每组报警发送的时间间隔
repeat_interval: 1m # 重复报警时间
resolve_timeout: 30s # 该时间内未收到告警,则认为问题已经被恢复
receiver: 'email' # 使用的报警配置
receivers: # 用于配置报警渠道
- name: 'email' # 报警配置名称
email_configs: # email的配置
- to: 'linux98_mail@yeah.net' # 接收人
send_resolved: true # 发送恢复通知
inhibit_rules: # 用于配置报警规则,抑制不重要报警,只发送关键报警
- source_match: # 用于匹配重要告警的规则,如果匹配到,其他报警会被抑制
severity: 'critical' # 匹配告警级别为 critical的报警
target_match: # 其他报警
severity: 'warning' # 其他报警的级别
equal: ['alertname', 'dev', 'instance'] # 对那些分组进行生效
配置完成后,重启prometheus-alertmanager服务
06:配置多个路由和报警通道
alertmanager的告警是经过route进行分发的,如果要配置多个告警渠道,需要在route下配置routes然后进行匹配。
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 30s
repeat_interval: 1m
receiver: 'email'
routes:
- match:
idc: m8-a1-a1
group_by: [env, app]
receiver: 'm8'
- match_re:
app: http-dev|mysql-dev
receiver: 'dev'
- match_re:
app: redis-prod|redis-dev
receiver: 'redis'
receivers:
- name: 'email'
email_configs:
- to: 'linux98_mail@yeah.net'
send_resolved: true
headers: { Subject: "[email] 报警邮件"}
- name: 'm8'
email_configs:
- to: 'linux98_mail@yeah.net'
send_resolved: true
headers: { Subject: "[m8] 报警邮件"}
- name: 'dev'
email_configs:
- to: 'linux98_mail@yeah.net'
send_resolved: true
headers: { Subject: "[dev] 报警邮件"}
- name: 'redis'
email_configs:
- to: 'linux98_mail@yeah.net'
send_resolved: true
headers: { Subject: "[redis] 报警邮件"}
配置子路由后,我们用6台设备模拟不同数据中心业务,不同类型业务进行监控
- job_name: "mysql-prod"
metrics_path: '/metrics'
static_configs:
- targets: ["192.168.31.11:9100"]
labels:
app: mysql_prod
env: prod
idc: m8-a1-a1
- job_name: "http-prod"
metrics_path: '/metrics'
static_configs:
- targets: ["192.168.31.12:9100"]
labels:
app: http-prod
env: prod
idc: m8-a1-a1
- job_name: "mysql-dev"
metrics_path: '/metrics'
static_configs:
- targets: ["192.168.31.13:9100"]
labels:
app: mysql-dev
env: dev
idc: m8-a1-a2
- job_name: "http-dev"
metrics_path: '/metrics'
static_configs:
- targets: ["192.168.31.21:9100"]
labels:
app: http-dev
env: dev
idc: m8-a1-a2
- job_name: "redis-dev"
metrics_path: '/metrics'
static_configs:
- targets: ["192.168.31.22:9100"]
labels:
app: redis-dev
env: dev
idc: m8-a1-a3
- job_name: "redis-prod"
metrics_path: '/metrics'
static_configs:
- targets: ["192.168.31.23:9100"]
labels:
app: redis-prod
env: prod
idc: m8-a1-a3
使用批量管理工具,对6台设备中的node_export进行服务停止,观察邮件
最后得出结论
在router下的routers配置中,是逐层筛选过滤的
- 匹配idc=m8-a1-a1的所有target
- 使用group_by进行分组,不同值的分别发送邮件
- 发送邮件到m8通道
- 上面匹配剩下的,再进行匹配,根据正则,app=http-dev或mysql-dev的
- 发送到dev通道
- 再匹配剩下的,根据app=redis-prod或redis-dev
- 匹配到的发送给redis通道
- 再匹配剩下的,根据app=redis-prod或redis-dev
- 发送到dev通道
- 上面匹配剩下的,再进行匹配,根据正则,app=http-dev或mysql-dev的
07:配置webhook报警通道
7.1:安装prometheus-webhook-dingtalk
官方地址:https://github.com/timonwong/prometheus-webhook-dingtalk
wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v2.0.0/prometheus-webhook-dingtalk-2.0.0.linux-amd64.tar.gz
tar -xf prometheus-webhook-dingtalk-2.0.0.linux-amd64.tar.gz -C /usr/local/
ln -s prometheus-webhook-dingtalk-2.0.0.linux-amd64/ prometheus-webhook-dingtalk
启动配置文件
/lib/systemd/system/prometheus-webhook-dingtalk.service
[Unit]
Description=prometheus-webhook-dingtalk
Documentation=https://github.com/timonwong/prometheus-webhook-dingtalk
[Service]
Restart=always
ExecStart=/usr/local/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk --config.file=/usr/local/prometheus-webhook-dingtalk/config.yml
[Install]
WantedBy=multi-user.target
编辑配置文件
/usr/local/prometheus-webhook-dingtalk/config.yml
targets:
webhook1:
url: https://oapi.dingtalk.com/robot/sendxxxxxxxxxxxxxxxxxx
# webhook url
secret: xxxxxxxxxxxxxxxxxxx
# 安全方式选择secret
启动服务
systemctl daemon-reload
systemctl start prometheus-webhook-dingtalk
# 默认监听在8060
7.2:配置alertmanager
/usr/local/alertmanager/alertmanager.yml
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 30s
repeat_interval: 1m
receiver: 'email'
routes:
- match:
idc: m8-a1-a1
group_by: [env, app]
receiver: 'm8'
- match_re:
app: http-dev|mysql-dev
receiver: 'dev'
- match_re:
app: redis-prod|redis-dev
receiver: 'dingtalk'
- name: 'dingtalk'
webhook_configs:
- url: http://localhost:8060/dingtalk/webhook1/send
send_resolved: true
7.3:测试
关闭app=redis-prod或redis-dev的target主机,查看钉钉
08:配置邮件告警模板
8.1:配置prometheus
先配置prometheus的rule,使其可以传送阈值
/usr/local/prometheus/cfg/rules/k8s.yml
groups:
- name: kubernetes
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: error
annotations:
summary: "Instance {{ $labels.instance }} has stopped working"
description: "{{ $labels.instance }} job {{ $labels.job }} has stopped working for more than 1 minute"
value: "{{ $value }}"
重启prometheus
8.2:创建模板
/usr/local/alertmanager/templates/email.tmpl
{{ define "test.html" }}
<table border="1">
<tr>
<td>报警项</td>
<td>实例</td>
<td>报警阈值</td>
<td>开始时间</td>
</tr>
{{ range $i, $alert := .Alerts }}
<tr>
<td>{{ index $alert.Labels "alertname" }}</td>
<td>{{ index $alert.Labels "instance" }}</td>
<td>{{ index $alert.Annotations "value" }}</td>
<td>{{ $alert.StartsAt }}</td>
</tr>
{{ end }}
</table>
{{ end }}
8.3:配置alertmanager
templates: # 设置模板路径
- 'templates/*.tmpl'
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 30s
repeat_interval: 1m
receiver: 'email'
routes:
- match:
idc: m8-a1-a1
group_by: [env, app]
receiver: 'm8'
receivers:
- name: 'm8'
email_configs:
- to: 'linux98_mail@yeah.net'
send_resolved: true
html: '{{ template "test.html" . }}' # 引用模板
headers: { Subject: "[m8] 报警邮件"} # 邮件标题
重启alertmanager服务
8.4:测试
关闭掉idc=m8-a1-a1的target主机,查看结果
对于恢复邮件,这里和告警邮件是一样的,需要改进
如果使用默认的Subject是这样的
09:维护
当我们接受到告警对主机进行维护时,需要对该告警进行静默处理。或者在发生告警前,静默主机的告警。
9.1:创建Silence
在Alertmanager主页菜单栏中点击Silence -> New Silence
创建完成后
9.2:测试
我们将app=redis-dev的target停止服务
评论区