Prometheus系列05-告警管理

00：文章简介

介绍prometheus的的告警管理。

01：部署AlertManager

这里依然使用官网下载的二进制程序部署

wget https://github.com/prometheus/alertmanager/releases/download/v0.23.0/alertmanager-0.23.0.linux-amd64.tar.gz
tar -xf alertmanager-0.23.0.linux-amd64.tar.gz -C /usr/local/
cd /usr/local/
ln -s alertmanager-0.23.0.linux-amd64/ alertmanager

/lib/systemd/system/prometheus-alertmanager.service

[Unit]
Description=Alertmanager for prometheus
Documentation=https://prometheus.io/docs/alerting/alertmanager/

[Service]
Restart=always
ExecStart=/usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml --storage.path=/usr/local/alertmanager/data --web.listen-address=0.0.0.0:9093
ExecReload=/bin/kill -HUP $MAINPID
TimeoutStopSec=20s
SendSIGKILL=no

[Install]
WantedBy=multi-user.target

设置服务文件

systemctl daemon-reload 
systemctl start prometheus-alertmanager.service
systemctl enable prometheus-alertmanager.service
systemctl status prometheus-alertmanager.service

检查监听端口

ss -nutlp | grep alertmanager
# 正常会监听在9093 和 9094
9093是与prometheus通信使用，9094是集群通信端口

查看web界面

02：与Prometheus集成

对于生产环境，建议将Prometheus和AlertManger分开部署并搭建高可用集群

cd /usr/local/prometheus/cfg

/usr/local/prometheus/cfg/prometheus.yml

alerting: 
  alertmanagers: # 指定alertmanager地址
    - static_configs:
        - targets:
           - localhost:9093
           
rule_files: # 告警的规则
  - "rules/*.yml"

重启prometheus服务，查看效果

Status -> Runtime & Build Information

03：配置告警规则

mkdir /usr/local/prometheus/cfg/rules

/usr/local/prometheus/cfg/rules/k8s.yml

groups:
- name: kubernetes
  rules:
  - alert: InstanceDown
    expr: up == 0 # 每一个instance都有一个up状态，0为失败，1为存活
    for: 1m       # 报警的持续时间，1分钟内都是up == 0状态则报警
    labels:
      severity: error  # 报警级别
    annotations:       # 报警的注视信息
      summary: "Instance {{ $labels.instance }} has stopped working"
      description: "{{ $labels.instance }} job {{ $labels.job }} has stopped working for more than 1 minute"

检查rules文件

# promtool check rules k8s.yml 
Checking k8s.yml
  SUCCESS: 1 rules found

重启后查看web控制台

我们把监控的一个节点断开，查看效果

04：告警流程

05：配置邮件告警通道

官网参考：https://prometheus.io/docs/alerting/latest/configuration/#configuration-file

global:
  resolve_timeout: 5m      											# 解析超时时间
  smtp_smarthost: 'smtp.126.com:25'							# 邮箱smtp地址
  smtp_from: 'linux98_mail@126.com'							# 发信人地址
  smtp_auth_username: 'linux98_mail@126.com'		# 发信人用户名
  smtp_auth_password: 'HJKXELGNVVTVPQXC'				# 发信邮箱授权码
  smtp_hello: '126.com'													# 邮箱服务提供域名
  smtp_require_tls: false												# 是否启用tls，启用的话，上面smtp地址后面端口要修改

route:
  group_by: ['alertname']							# 采用哪个标签进行分组
  group_wait: 30s											# 分组的等待时间，收到信息后，不立即发送，而是看时间内该组还有告警发送，就一起发送
  group_interval: 30s									# 每组报警发送的时间间隔
  repeat_interval: 1m									# 重复报警时间
  resolve_timeout: 30s   							# 该时间内未收到告警，则认为问题已经被恢复
  receiver: 'email'										# 使用的报警配置

receivers: # 用于配置报警渠道
- name: 'email'												# 报警配置名称
  email_configs:											# email的配置
  - to: 'linux98_mail@yeah.net'				# 接收人
    send_resolved: true								# 发送恢复通知

inhibit_rules:		 # 用于配置报警规则，抑制不重要报警，只发送关键报警
  - source_match:  # 用于匹配重要告警的规则，如果匹配到，其他报警会被抑制
      severity: 'critical'  # 匹配告警级别为 critical的报警
    target_match:  # 其他报警
      severity: 'warning'  # 其他报警的级别
    equal: ['alertname', 'dev', 'instance'] # 对那些分组进行生效

配置完成后，重启prometheus-alertmanager服务

06：配置多个路由和报警通道

alertmanager的告警是经过route进行分发的，如果要配置多个告警渠道，需要在route下配置routes然后进行匹配。

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 30s
  repeat_interval: 1m
  receiver: 'email'
  routes:
    - match:
        idc: m8-a1-a1
      group_by: [env, app]
      receiver: 'm8'
    - match_re:
        app: http-dev|mysql-dev
      receiver: 'dev'
    - match_re:
        app: redis-prod|redis-dev
      receiver: 'redis'

receivers:
- name: 'email'
  email_configs:
  - to: 'linux98_mail@yeah.net'
    send_resolved: true
    headers: { Subject: "[email] 报警邮件"}

- name: 'm8'
  email_configs:
  - to: 'linux98_mail@yeah.net'
    send_resolved: true
    headers: { Subject: "[m8] 报警邮件"}

- name: 'dev'
  email_configs:
  - to: 'linux98_mail@yeah.net'
    send_resolved: true
    headers: { Subject: "[dev] 报警邮件"}

- name: 'redis'
  email_configs:
  - to: 'linux98_mail@yeah.net'
    send_resolved: true
    headers: { Subject: "[redis] 报警邮件"}

配置子路由后，我们用6台设备模拟不同数据中心业务，不同类型业务进行监控

  - job_name: "mysql-prod"
    metrics_path: '/metrics'
    static_configs:
      - targets: ["192.168.31.11:9100"]
        labels:
          app: mysql_prod
          env: prod
          idc: m8-a1-a1

  - job_name: "http-prod"
    metrics_path: '/metrics'
    static_configs:
      - targets: ["192.168.31.12:9100"]
        labels:
          app: http-prod
          env: prod
          idc: m8-a1-a1

  - job_name: "mysql-dev"
    metrics_path: '/metrics'
    static_configs:
      - targets: ["192.168.31.13:9100"]
        labels:
          app: mysql-dev
          env: dev
          idc: m8-a1-a2

  - job_name: "http-dev"
    metrics_path: '/metrics'
    static_configs:
      - targets: ["192.168.31.21:9100"]
        labels:
          app: http-dev
          env: dev
          idc: m8-a1-a2

  - job_name: "redis-dev"
    metrics_path: '/metrics'
    static_configs:
      - targets: ["192.168.31.22:9100"]
        labels:
          app: redis-dev
          env: dev
          idc: m8-a1-a3

  - job_name: "redis-prod"
    metrics_path: '/metrics'
    static_configs:
      - targets: ["192.168.31.23:9100"]
        labels:
          app: redis-prod
          env: prod
          idc: m8-a1-a3

使用批量管理工具，对6台设备中的node_export进行服务停止，观察邮件

最后得出结论

在router下的routers配置中，是逐层筛选过滤的

匹配idc=m8-a1-a1的所有target
- 使用group_by进行分组，不同值的分别发送邮件
- 发送邮件到m8通道
  - 上面匹配剩下的，再进行匹配，根据正则，app=http-dev或mysql-dev的
    - 发送到dev通道
      - 再匹配剩下的，根据app=redis-prod或redis-dev
        
        匹配到的发送给redis通道

07：配置webhook报警通道

7.1：安装prometheus-webhook-dingtalk

官方地址：https://github.com/timonwong/prometheus-webhook-dingtalk

wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v2.0.0/prometheus-webhook-dingtalk-2.0.0.linux-amd64.tar.gz
tar -xf prometheus-webhook-dingtalk-2.0.0.linux-amd64.tar.gz -C /usr/local/
ln -s prometheus-webhook-dingtalk-2.0.0.linux-amd64/ prometheus-webhook-dingtalk

启动配置文件

/lib/systemd/system/prometheus-webhook-dingtalk.service

[Unit]
Description=prometheus-webhook-dingtalk
Documentation=https://github.com/timonwong/prometheus-webhook-dingtalk

[Service]
Restart=always
ExecStart=/usr/local/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk --config.file=/usr/local/prometheus-webhook-dingtalk/config.yml

[Install]
WantedBy=multi-user.target

编辑配置文件

/usr/local/prometheus-webhook-dingtalk/config.yml

targets:
  webhook1:
    url: https://oapi.dingtalk.com/robot/sendxxxxxxxxxxxxxxxxxx
    # webhook url
    secret: xxxxxxxxxxxxxxxxxxx
    # 安全方式选择secret

启动服务

systemctl daemon-reload 
systemctl start prometheus-webhook-dingtalk
# 默认监听在8060

7.2：配置alertmanager

/usr/local/alertmanager/alertmanager.yml

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 30s
  repeat_interval: 1m
  receiver: 'email'
  routes:
    - match:
        idc: m8-a1-a1
      group_by: [env, app]
      receiver: 'm8'
    - match_re:
        app: http-dev|mysql-dev
      receiver: 'dev'
    - match_re:
        app: redis-prod|redis-dev
      receiver: 'dingtalk'

- name: 'dingtalk'
  webhook_configs:
  - url: http://localhost:8060/dingtalk/webhook1/send
    send_resolved: true

7.3：测试

关闭app=redis-prod或redis-dev的target主机，查看钉钉

08：配置邮件告警模板

8.1：配置prometheus

先配置prometheus的rule，使其可以传送阈值

/usr/local/prometheus/cfg/rules/k8s.yml

groups:
- name: kubernetes
  rules:
  - alert: InstanceDown
    expr: up == 0
    for: 1m
    labels:
      severity: error
    annotations:
      summary: "Instance {{ $labels.instance }} has stopped working"
      description: "{{ $labels.instance }} job {{ $labels.job }} has stopped working for more than 1 minute"
      value: "{{ $value }}"

重启prometheus

8.2：创建模板

/usr/local/alertmanager/templates/email.tmpl

{{ define "test.html" }}
<table border="1">
<tr> 
    <td>报警项</td>
    <td>实例</td>
    <td>报警阈值</td>
    <td>开始时间</td>
</tr>
{{ range $i, $alert := .Alerts }}
<tr> 
    <td>{{ index $alert.Labels "alertname" }}</td>
    <td>{{ index $alert.Labels "instance" }}</td>
    <td>{{ index $alert.Annotations "value" }}</td>
    <td>{{ $alert.StartsAt }}</td>
</tr>
{{ end }}
</table>
{{ end }}

8.3：配置alertmanager

templates: # 设置模板路径
  - 'templates/*.tmpl'

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 30s
  repeat_interval: 1m
  receiver: 'email'
  routes:
    - match:
        idc: m8-a1-a1
      group_by: [env, app]
      receiver: 'm8'

receivers:
- name: 'm8'
  email_configs:
  - to: 'linux98_mail@yeah.net'
    send_resolved: true
    html: '{{ template "test.html" . }}'  # 引用模板
    headers: { Subject: "[m8] 报警邮件"}   # 邮件标题

重启alertmanager服务

8.4：测试

关闭掉idc=m8-a1-a1的target主机，查看结果

对于恢复邮件，这里和告警邮件是一样的，需要改进

如果使用默认的Subject是这样的

09：维护

当我们接受到告警对主机进行维护时，需要对该告警进行静默处理。或者在发生告警前，静默主机的告警。

9.1：创建Silence

在Alertmanager主页菜单栏中点击Silence -> New Silence

创建完成后

9.2：测试

我们将app=redis-dev的target停止服务

目录CONTENT