Prometheus與Grafana監控體系搭建實戰詳情 - 微服務成熟的海豚博客

本文詳解如何搭建Prometheus + Grafana監控體系，實現服務器、應用、數據庫的全方位監控。

前言

生產環境必須要有監控：

及時發現問題
追溯歷史數據
容量規劃依據
告警通知

Prometheus + Grafana 是目前最流行的開源監控方案：

Prometheus：採集和存儲指標
Grafana：可視化展示
豐富的生態：各種Exporter

今天來搭建一套完整的監控體系。

一、架構設計

1.1 整體架構


┌─────────────────────────────────────────────────────┐
│                    Grafana                          │
│                  (可視化展示)                        │
└─────────────────────────────────────────────────────┘
↑
┌─────────────────────────────────────────────────────┐
│                  Prometheus                          │
│               (採集+存儲+查詢)                        │
└─────────────────────────────────────────────────────┘
↑               ↑               ↑
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Node Exporter│ │MySQL Exporter│ │Redis Exporter│
│  (主機監控)   │ │ (MySQL監控)  │ │ (Redis監控)  │
└──────────────┘ └──────────────┘ └──────────────┘

1.2 數據流


1. Exporter採集指標 → 暴露HTTP接口（:9100等）
2. Prometheus定時拉取 → 存儲時序數據
3. Grafana查詢Prometheus → 展示圖表
4. Alertmanager → 發送告警

二、Prometheus部署

2.1 Docker Compose部署

# docker-compose.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/rules:/etc/prometheus/rules
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana_data:/var/lib/grafana
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

2.2 Prometheus配置

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

rule_files:
  - /etc/prometheus/rules/*.yml

scrape_configs:
  # Prometheus自身
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # 主機監控
  - job_name: 'node'
    static_configs:
      - targets: 
        - '192.168.1.101:9100'
        - '192.168.1.102:9100'
        - '192.168.1.103:9100'

  # MySQL監控
  - job_name: 'mysql'
    static_configs:
      - targets: ['192.168.1.101:9104']

  # Redis監控
  - job_name: 'redis'
    static_configs:
      - targets: ['192.168.1.101:9121']

2.3 啓動服務

# 創建目錄
mkdir -p prometheus/rules alertmanager

# 啓動
docker compose up -d

# 訪問
# Prometheus: http://localhost:9090
# Grafana: http://localhost:3000 (admin/admin123)

三、Node Exporter（主機監控）

3.1 安裝部署

# 方式1：Docker
docker run -d --name node_exporter \
  --net="host" \
  --pid="host" \
  -v "/:/host:ro,rslave" \
  prom/node-exporter:latest \
  --path.rootfs=/host

# 方式2：二進制安裝
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-*.tar.gz
cd node_exporter-*/
./node_exporter &

3.2 驗證

curl http://localhost:9100/metrics

# 輸出示例
# node_cpu_seconds_total{cpu="0",mode="idle"} 12345.67
# node_memory_MemTotal_bytes 8.3e+09
# node_filesystem_size_bytes{device="/dev/sda1"} 1.0e+11

3.3 常用指標

指標	説明
node_cpu_seconds_total	CPU使用時間
node_memory_MemTotal_bytes	總內存
node_memory_MemAvailable_bytes	可用內存
node_filesystem_size_bytes	磁盤大小
node_filesystem_avail_bytes	磁盤可用
node_network_receive_bytes_total	網絡接收
node_network_transmit_bytes_total	網絡發送
node_load1/5/15	系統負載

四、應用監控

4.1 MySQL Exporter

# 部署
docker run -d --name mysql_exporter \
  -p 9104:9104 \
  -e DATA_SOURCE_NAME="exporter:password@(mysql:3306)/" \
  prom/mysqld-exporter

# 創建監控用户
CREATE USER 'exporter'@'%' IDENTIFIED BY 'password';
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'%';
FLUSH PRIVILEGES;

常用指標：

mysql_up：MySQL是否存活
mysql_global_status_connections：連接數
mysql_global_status_slow_queries：慢查詢數
mysql_global_status_questions：查詢總數

4.2 Redis Exporter

docker run -d --name redis_exporter \
  -p 9121:9121 \
  -e REDIS_ADDR=redis://192.168.1.101:6379 \
  oliver006/redis_exporter

常用指標：

redis_up：Redis是否存活
redis_connected_clients：客户端連接數
redis_used_memory：內存使用
redis_commands_processed_total：命令處理數

4.3 Nginx Exporter

# 需要先啓用Nginx狀態模塊
# nginx.conf添加：
# location /nginx_status {
#     stub_status on;
# }

docker run -d --name nginx_exporter \
  -p 9113:9113 \
  nginx/nginx-prometheus-exporter \
  -nginx.scrape-uri=http://192.168.1.101/nginx_status

4.4 Java應用（Micrometer）

<!-- pom.xml -->
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>

# application.yml
management:
  endpoints:
    web:
      exposure:
        include: prometheus,health
  metrics:
    export:
      prometheus:
        enabled: true

訪問 http://localhost:8080/actuator/prometheus 獲取指標。

五、Grafana配置

5.1 添加數據源

1. Configuration → Data Sources → Add data source
2. 選擇Prometheus
3. URL: http://prometheus:9090（Docker網絡）
   或 http://192.168.1.100:9090（外部）
4. Save & Test

5.2 導入Dashboard

推薦Dashboard（Grafana官網ID）：

ID	名稱	用途
1860	Node Exporter Full	主機監控
7362	MySQL Overview	MySQL監控
763	Redis Dashboard	Redis監控
12708	Nginx Exporter	Nginx監控

導入方式：
1. Dashboards → Import
2. 輸入ID：1860
3. Load → 選擇數據源 → Import

5.3 自定義面板

# CPU使用率
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 內存使用率
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# 磁盤使用率
(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100

# 網絡流量
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])

六、告警配置

6.1 告警規則

# prometheus/rules/alert.yml
groups:
  - name: 主機告警
    rules:
      - alert: 主機宕機
        expr: up{job="node"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "主機 {{ $labels.instance }} 宕機"
          description: "主機已超過1分鐘無法訪問"

      - alert: CPU使用率過高
        expr: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "主機 {{ $labels.instance }} CPU使用率過高"
          description: "CPU使用率超過80%，當前值: {{ $value }}%"

      - alert: 內存使用率過高
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "主機 {{ $labels.instance }} 內存使用率過高"
          description: "內存使用率超過80%，當前值: {{ $value }}%"

      - alert: 磁盤空間不足
        expr: (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "主機 {{ $labels.instance }} 磁盤空間不足"
          description: "磁盤使用率超過85%，當前值: {{ $value }}%"

6.2 Alertmanager配置

# alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'webhook'

receivers:
  - name: 'webhook'
    webhook_configs:
      - url: 'http://your-webhook-url/alert'
        send_resolved: true

  # 或使用郵件
  # - name: 'email'
  #   email_configs:
  #     - to: 'admin@example.com'
  #       from: 'alert@example.com'
  #       smarthost: 'smtp.example.com:587'
  #       auth_username: 'alert@example.com'
  #       auth_password: 'password'

6.3 告警測試

# 查看告警狀態
curl http://localhost:9090/api/v1/alerts

# 查看規則狀態
curl http://localhost:9090/api/v1/rules

七、多站點監控

7.1 場景

監控需求：
- 總部機房10台服務器
- 分部A機房5台服務器
- 分部B機房3台服務器
- 雲上2台服務器

挑戰：各站點網絡不通

7.2 傳統方案

方案1：每個站點部署Prometheus

優點：獨立運行
缺點：無法統一查看，告警分散

方案2：公網暴露Exporter

優點：中心化採集
缺點：安全風險高

7.3 組網方案（推薦）

使用組網軟件（如星空組網）打通所有節點：

組網後的架構：
                    ┌──────────────────────┐
                    │   中心Prometheus     │
                    │      10.10.0.1       │
                    └──────────────────────┘
                              ↑
        ┌─────────────────────┼─────────────────────┐
        ↑                     ↑                     ↑
┌──────────────┐      ┌──────────────┐      ┌──────────────┐
│    總部       │      │    分部A     │      │    分部B     │
│  10.10.0.2   │      │  10.10.0.3   │      │  10.10.0.4   │
│  Node Export │      │  Node Export │      │  Node Export │
│    :9100     │      │    :9100     │      │    :9100     │
└──────────────┘      └──────────────┘      └──────────────┘

Prometheus配置：

scrape_configs:
  # 總部服務器（組網IP）
  - job_name: 'node-headquarters'
    static_configs:
      - targets: 
        - '10.10.0.10:9100'
        - '10.10.0.11:9100'
        - '10.10.0.12:9100'
    relabel_configs:
      - source_labels: [__address__]
        target_label: location
        replacement: '總部'

  # 分部A服務器（組網IP）
  - job_name: 'node-branch-a'
    static_configs:
      - targets: 
        - '10.10.0.20:9100'
        - '10.10.0.21:9100'
    relabel_configs:
      - source_labels: [__address__]
        target_label: location
        replacement: '分部A'

  # 分部B服務器（組網IP）
  - job_name: 'node-branch-b'
    static_configs:
      - targets: 
        - '10.10.0.30:9100'
    relabel_configs:
      - source_labels: [__address__]
        target_label: location
        replacement: '分部B'

優勢：

統一監控入口
所有數據集中展示
告警統一管理
無需公網暴露
配置簡單

八、高可用部署

8.1 Prometheus聯邦

# 中心Prometheus配置
scrape_configs:
  - job_name: 'federate'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job=~".+"}'
    static_configs:
      - targets:
        - '10.10.0.2:9090'  # 總部Prometheus
        - '10.10.0.3:9090'  # 分部Prometheus

8.2 Grafana高可用

# 使用外部MySQL存儲
services:
  grafana:
    image: grafana/grafana:latest
    environment:
      - GF_DATABASE_TYPE=mysql
      - GF_DATABASE_HOST=mysql:3306
      - GF_DATABASE_NAME=grafana
      - GF_DATABASE_USER=grafana
      - GF_DATABASE_PASSWORD=password

九、常見問題

9.1 Prometheus內存佔用高

# 減少數據保留時間
--storage.tsdb.retention.time=15d

# 減少採集頻率
global:
  scrape_interval: 30s

9.2 查詢慢

# 使用Recording Rules預計算
groups:
  - name: recording
    rules:
      - record: job:node_cpu_usage:avg
        expr: avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (job)

9.3 熱重載配置

curl -X POST http://localhost:9090/-/reload

十、總結

監控體系搭建要點：

基礎架構：Prometheus + Grafana + Alertmanager
主機監控：Node Exporter必裝
應用監控：根據技術棧選Exporter
Dashboard：導入現成的，再自定義
告警規則：按優先級設置
多站點：組網打通後統一監控
高可用：聯邦 + 外部存儲

我的監控清單：

必監控項：
- CPU/內存/磁盤/網絡
- 服務存活狀態
- 數據庫連接數和慢查詢
- 應用響應時間和錯誤率

監控是運維的眼睛，沒有監控的系統就是在裸奔。

參考資料

Prometheus官方文檔：https://prometheus.io/docs/
Grafana官方文檔：https://grafana.com/docs/
Awesome Prometheus Alerts：https://awesome-prometheus-alerts.grep.to/

💡 建議：先監控核心指標，逐步完善。告警不要太多，否則容易麻木。

博客 / 詳情

Prometheus與Grafana監控體系搭建實戰

前言