概念解析

Kubernetes監控和日誌管理是保障集羣穩定運行和應用健康的關鍵組成部分。監控關注系統的性能指標和健康狀態,而日誌管理則關注應用和系統的詳細行為記錄。

監控概念

  1. 指標監控:收集和分析系統性能指標,如CPU、內存、網絡、磁盤使用率
  2. 健康檢查:監控應用和服務的可用性和響應狀態
  3. 告警機制:在檢測到異常時觸發通知和自動響應
  4. 可視化展示:通過圖表和儀表板直觀展示監控數據

日誌管理概念

  1. 應用日誌:應用程序產生的業務和錯誤日誌
  2. 系統日誌:Kubernetes組件和節點系統產生的日誌
  3. 審計日誌:記錄集羣API訪問和安全相關事件
  4. 日誌聚合:集中收集、存儲和分析分佈式系統的日誌

核心組件

  1. 數據收集:Prometheus、Fluentd、Logstash等
  2. 數據存儲:Prometheus TSDB、Elasticsearch、Loki等
  3. 數據展示:Grafana、Kibana等
  4. 告警系統:Alertmanager、ElastAlert等

核心特性

監控特性

  1. 多維度監控:支持節點、Pod、容器、服務等多層級監控
  2. 自動發現:自動發現和監控新部署的服務
  3. 指標聚合:支持指標的聚合、計算和轉換
  4. 歷史數據:長期存儲和查詢歷史監控數據
  5. 告警規則:靈活的告警規則配置和管理

日誌管理特性

  1. 實時收集:實時收集和傳輸日誌數據
  2. 結構化解析:自動解析和結構化日誌內容
  3. 全文檢索:支持高效的日誌搜索和查詢
  4. 日誌過濾:根據條件過濾和路由日誌
  5. 壓縮存儲:高效的日誌壓縮和存儲機制

實踐教程

部署Prometheus監控系統

# 使用Helm部署Prometheus
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# 創建監控命名空間
kubectl create namespace monitoring

# 部署Prometheus
helm install prometheus prometheus-community/prometheus \
  -n monitoring \
  --set alertmanager.persistentVolume.enabled=false \
  --set server.persistentVolume.enabled=false

部署Grafana可視化平台

# 使用Helm部署Grafana
helm repo add grafana https://grafana.github.io/helm-charts

# 部署Grafana
helm install grafana grafana/grafana \
  -n monitoring \
  --set persistence.enabled=false \
  --set adminPassword=admin123

# 獲取Grafana登錄密碼
kubectl get secret --namespace monitoring grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo

# 端口轉發訪問Grafana
kubectl port-forward svc/grafana 3000:80 -n monitoring

部署EFK日誌收集棧

# 部署Elasticsearch
helm install elasticsearch elastic/elasticsearch \
  -n monitoring \
  --set replicas=1 \
  --set minimumMasterNodes=1

# 部署Fluentd
helm install fluentd fluent/fluentd \
  -n monitoring

# 部署Kibana
helm install kibana elastic/kibana \
  -n monitoring

配置應用監控

# 應用Deployment配置監控註解
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
  labels:
    app: web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: app
        image: my-web-app:1.0
        ports:
        - containerPort: 8080

真實案例

案例:電商平台全鏈路監控系統

某大型電商平台需要建立完整的監控和日誌管理系統,覆蓋基礎設施、應用服務、業務指標等多個層面:

# Prometheus配置
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
    rule_files:
      - "/etc/prometheus-rules/*.rules"
    scrape_configs:
      # Kubernetes節點監控
      - job_name: 'kubernetes-nodes'
        kubernetes_sd_configs:
        - role: node
        relabel_configs:
        - source_labels: [__address__]
          regex: '(.*):10250'
          target_label: __address__
          replacement: '${1}:10255'
      
      # Kubernetes Pod監控
      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
        - role: pod
        relabel_configs:
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)
        - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
          action: replace
          regex: ([^:]+)(?::\d+)?;(\d+)
          replacement: $1:$2
          target_label: __address__
      
      # 應用業務指標監控
      - job_name: 'business-metrics'
        static_configs:
        - targets: ['business-metrics-exporter:8080']
---
# Alertmanager配置
apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-config
  namespace: monitoring
data:
  config.yml: |
    global:
      smtp_smarthost: 'smtp.example.com:587'
      smtp_from: 'alert@example.com'
      smtp_auth_username: 'alert'
      smtp_auth_password: 'password'
    
    route:
      group_by: ['alertname']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 1h
      receiver: 'team-X-mails'
    
    receivers:
    - name: 'team-X-mails'
      email_configs:
      - to: 'team-X@example.com'
    
    - name: 'webhook'
      webhook_configs:
      - url: 'http://webhook-service:8080/alert'
---
# Grafana Dashboard配置
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-dashboards
  namespace: monitoring
data:
  kubernetes-cluster-dashboard.json: |
    {
      "dashboard": {
        "id": null,
        "title": "Kubernetes Cluster Monitoring",
        "timezone": "browser",
        "panels": [
          {
            "title": "Cluster CPU Usage",
            "type": "graph",
            "targets": [
              {
                "expr": "sum(rate(container_cpu_usage_seconds_total[5m])) by (instance)",
                "legendFormat": "{{instance}}"
              }
            ]
          },
          {
            "title": "Cluster Memory Usage",
            "type": "graph",
            "targets": [
              {
                "expr": "sum(container_memory_usage_bytes) by (instance)",
                "legendFormat": "{{instance}}"
              }
            ]
          }
        ]
      }
    }
---
# Fluentd配置
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluentd-config
  namespace: monitoring
data:
  fluent.conf: |
    <source>
      @type tail
      path /var/log/containers/*.log
      pos_file /var/log/fluentd-containers.log.pos
      tag kubernetes.*
      read_from_head true
      <parse>
        @type json
        time_format %Y-%m-%dT%H:%M:%S.%NZ
      </parse>
    </source>
    
    <filter kubernetes.**>
      @type kubernetes_metadata
    </filter>
    
    <match kubernetes.var.log.containers.app-**>
      @type elasticsearch
      host elasticsearch
      port 9200
      logstash_format true
      <buffer>
        @type file
        path /var/log/fluentd-buffers/kubernetes.system.buffer
        flush_mode interval
        retry_type exponential_backoff
        flush_thread_count 2
        flush_interval 5s
        retry_forever
        retry_max_interval 30
        chunk_limit_size 2M
        queue_limit_length 8
        overflow_action block
      </buffer>
    </match>
---
# 自定義監控Exporter
apiVersion: apps/v1
kind: Deployment
metadata:
  name: business-metrics-exporter
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: business-metrics-exporter
  template:
    metadata:
      labels:
        app: business-metrics-exporter
    spec:
      containers:
      - name: exporter
        image: business/metrics-exporter:1.0
        ports:
        - containerPort: 8080
        env:
        - name: DATABASE_HOST
          value: "mysql-service"
        - name: REDIS_HOST
          value: "redis-service"
---
apiVersion: v1
kind: Service
metadata:
  name: business-metrics-exporter
  namespace: monitoring
spec:
  selector:
    app: business-metrics-exporter
  ports:
  - port: 8080
    targetPort: 8080
---
# 監控告警規則
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-rules
  namespace: monitoring
data:
  app.rules: |
    groups:
    - name: app.rules
      rules:
      - alert: HighRequestLatency
        expr: api_http_request_latencies_second{quantile="0.5"} > 1
        for: 10m
        labels:
          severity: page
        annotations:
          summary: "High request latency on {{ $labels.instance }}"
          description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"
      
      - alert: ApplicationDown
        expr: up{job="kubernetes-pods"} == 0
        for: 5m
        labels:
          severity: page
        annotations:
          summary: "Application instance {{ $labels.instance }} down"
          description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."

這種全鏈路監控系統的優勢:

  • 全面覆蓋:從基礎設施到業務指標的全方位監控
  • 實時告警:及時發現問題並通知相關人員
  • 可視化展示:直觀的監控面板便於分析問題
  • 日誌聚合:集中的日誌管理便於排查故障
  • 自動發現:自動監控新部署的服務
  • 可擴展性:支持自定義監控指標和告警規則

配置詳解

Prometheus配置詳解

global:
  scrape_interval: 15s      # 全局抓取間隔
  evaluation_interval: 15s  # 規則評估間隔

rule_files:
  - "first_rules.yml"       # 告警規則文件
  - "second_rules.yml"

scrape_configs:
  # 監控Prometheus自身
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']
  
  # 監控Kubernetes節點
  - job_name: 'kubernetes-nodes'
    kubernetes_sd_configs:
    - role: node
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
    - action: labelmap
      regex: __meta_kubernetes_node_label_(.+)
    - target_label: __address__
      replacement: kubernetes.default.svc:443
    - source_labels: [__meta_kubernetes_node_name]
      regex: (.+)
      target_label: __metrics_path__
      replacement: /api/v1/nodes/${1}/proxy/metrics
  
  # 監控Kubernetes服務Endpoints
  - job_name: 'kubernetes-service-endpoints'
    kubernetes_sd_configs:
    - role: endpoints
    relabel_configs:
    - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
      action: keep
      regex: true
    - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
      action: replace
      target_label: __scheme__
      regex: (https?)
    - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
      action: replace
      target_label: __metrics_path__
      regex: (.+)
    - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
      action: replace
      target_label: __address__
      regex: ([^:]+)(?::\d+)?;(\d+)
      replacement: $1:$2
    - action: labelmap
      regex: __meta_kubernetes_service_label_(.+)
    - source_labels: [__meta_kubernetes_namespace]
      action: replace
      target_label: kubernetes_namespace
    - source_labels: [__meta_kubernetes_service_name]
      action: replace
      target_label: kubernetes_name

Grafana配置詳解

apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-config
  namespace: monitoring
data:
  grafana.ini: |
    [server]
    domain = grafana.example.com
    root_url = %(protocol)s://%(domain)s:%(http_port)s/
    serve_from_sub_path = true
    
    [security]
    admin_user = admin
    admin_password = __FILE__/etc/grafana/secrets/admin_password
    
    [users]
    allow_sign_up = false
    auto_assign_org = true
    auto_assign_org_role = Viewer
    
    [auth.anonymous]
    enabled = false
    
    [dashboards]
    versions_to_keep = 20
    
    [metrics]
    enabled = true
    interval_seconds = 10
    
    [unified_alerting]
    enabled = true
    
    [alerting]
    enabled = false
    
    [log]
    mode = console file
    level = info
    
    [paths]
    logs = /var/log/grafana
    plugins = /var/lib/grafana/plugins
    provisioning = /etc/grafana/provisioning

Fluentd配置詳解

<source>
  @type tail
  @id in_tail_container_logs
  path /var/log/containers/*.log
  pos_file /var/log/fluentd-containers.log.pos
  tag kubernetes.*
  read_from_head true
  <parse>
    @type json
    time_key time
    time_format %Y-%m-%dT%H:%M:%S.%NZ
  </parse>
</source>

<filter kubernetes.**>
  @type kubernetes_metadata
  @id filter_kube_metadata
  kubernetes_url "#{ENV['FLUENT_FILTER_KUBERNETES_URL'] || 'https://kubernetes.default.svc:443/api'}"
  bearer_token_file /var/run/secrets/kubernetes.io/serviceaccount/token
  ca_file /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  skip_labels false
  skip_master_url false
  skip_container_metadata false
</filter>

<filter kubernetes.var.log.containers.**_app-**>
  @type parser
  key_name log
  reserve_data true
  <parse>
    @type json
  </parse>
</filter>

<match kubernetes.var.log.containers.**_app-**>
  @type rewrite_tag_filter
  <rule>
    key log_level
    pattern ^ERROR$
    tag error.${tag}
  </rule>
</match>

<match error.**>
  @type elasticsearch
  host elasticsearch
  port 9200
  logstash_format true
  logstash_prefix error-logs
  <buffer>
    @type file
    path /var/log/fluentd-buffers/error.buffer
    flush_mode interval
    flush_interval 5s
    flush_thread_count 2
    retry_type exponential_backoff
    retry_forever true
    retry_max_interval 30
    chunk_limit_size 2M
    total_limit_size 512MB
    overflow_action block
  </buffer>
</match>

<match kubernetes.**>
  @type elasticsearch
  host elasticsearch
  port 9200
  logstash_format true
  logstash_prefix kubernetes-logs
  <buffer>
    @type file
    path /var/log/fluentd-buffers/kubernetes.buffer
    flush_mode interval
    flush_interval 10s
    flush_thread_count 4
    retry_type exponential_backoff
    retry_forever true
    retry_max_interval 60
    chunk_limit_size 4M
    total_limit_size 1GB
    overflow_action block
  </buffer>
</match>

故障排除

常見問題及解決方案

  1. 監控指標缺失

    # 檢查Prometheus目標狀態
    kubectl port-forward svc/prometheus-server 9090:80 -n monitoring
    # 訪問 http://localhost:9090/targets 查看目標狀態
    
    # 檢查服務註解
    kubectl describe pod <pod-name> -n <namespace>
    
    # 檢查網絡策略
    kubectl get networkpolicy -n <namespace>
    
  2. 日誌收集不完整

    # 檢查Fluentd日誌
    kubectl logs <fluentd-pod> -n monitoring
    
    # 檢查日誌文件權限
    kubectl exec <node-pod> -- ls -la /var/log/containers/
    
    # 檢查Fluentd配置
    kubectl describe configmap fluentd-config -n monitoring
    
  3. 告警不觸發

    # 檢查告警規則
    kubectl port-forward svc/prometheus-server 9090:80 -n monitoring
    # 訪問 http://localhost:9090/rules 查看規則狀態
    
    # 檢查告警狀態
    # 訪問 http://localhost:9090/alerts 查看告警狀態
    
    # 檢查Alertmanager配置
    kubectl logs <alertmanager-pod> -n monitoring
    
  4. Grafana面板無數據

    # 檢查數據源配置
    kubectl port-forward svc/grafana 3000:80 -n monitoring
    # 在Grafana中檢查數據源連接狀態
    
    # 檢查查詢語句
    # 在Grafana查詢編輯器中驗證PromQL語句
    
    # 檢查時間範圍
    # 確認選擇了正確的時間範圍
    

最佳實踐

  1. 監控策略

    • 建立分層監控體系(基礎設施、平台、應用、業務)
    • 設置合理的告警閾值和靜默期
    • 定期審查和優化監控規則
    • 建立監控指標基線和趨勢分析
  2. 日誌管理

    • 統一日誌格式和結構
    • 合理設置日誌級別
    • 實施日誌輪轉和清理策略
    • 建立日誌分類和標籤體系
  3. 性能優化

    • 合理設置監控採集頻率
    • 優化日誌收集和傳輸性能
    • 實施數據採樣和降精度策略
    • 使用高效的存儲和索引機制
  4. 安全考慮

    • 限制監控和日誌系統的訪問權限
    • 加密敏感監控數據傳輸
    • 定期審計監控和日誌訪問日誌
    • 實施數據備份和災難恢復
  5. 運維管理

    • 建立監控和告警響應流程
    • 定期進行監控系統健康檢查
    • 維護監控文檔和操作手冊
    • 培訓團隊成員使用監控工具

安全考慮

監控系統安全配置

apiVersion: v1
kind: Secret
metadata:
  name: prometheus-secrets
  namespace: monitoring
type: Opaque
data:
  admin-password: <base64-encoded-password>
  tls-cert: <base64-encoded-cert>
  tls-key: <base64-encoded-key>
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: prometheus-network-policy
  namespace: monitoring
spec:
  podSelector:
    matchLabels:
      app: prometheus
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: monitoring
    - podSelector:
        matchLabels:
          app: grafana
    ports:
    - protocol: TCP
      port: 9090
  egress:
  - to:
    - namespaceSelector: {}
    ports:
    - protocol: TCP
      port: 443
    - protocol: TCP
      port: 10250
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- apiGroups:
  - extensions
  resources:
  - ingresses
  verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: monitoring

日誌系統安全配置

apiVersion: v1
kind: Secret
metadata:
  name: elasticsearch-secrets
  namespace: monitoring
type: Opaque
data:
  elastic-password: <base64-encoded-password>
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: elasticsearch-network-policy
  namespace: monitoring
spec:
  podSelector:
    matchLabels:
      app: elasticsearch
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: monitoring
    - podSelector:
        matchLabels:
          app: fluentd
    ports:
    - protocol: TCP
      port: 9200
    - protocol: TCP
      port: 9300

命令速查

命令 描述
kubectl top nodes 查看節點資源使用情況
kubectl top pods 查看Pod資源使用情況
kubectl logs <pod> 查看Pod日誌
kubectl logs -f <pod> 實時查看Pod日誌
kubectl port-forward svc/prometheus-server 9090:80 -n monitoring 端口轉發訪問Prometheus
kubectl port-forward svc/grafana 3000:80 -n monitoring 端口轉發訪問Grafana
kubectl exec -it <prometheus-pod> -n monitoring -- wget -qO- http://localhost:9090/metrics 查看Prometheus指標
kubectl get servicemonitor -n monitoring 查看ServiceMonitor資源
kubectl get prometheusrule -n monitoring 查看PrometheusRule資源
kubectl logs <fluentd-pod> -n monitoring --tail=100 查看Fluentd日誌

總結

監控和日誌管理是Kubernetes集羣穩定運行的重要保障。通過本文檔的學習,你應該能夠:

  • 理解監控和日誌管理的核心概念和重要性
  • 部署和配置主流監控和日誌系統
  • 設計和實施全鏈路監控解決方案
  • 配置複雜的監控規則和告警策略
  • 排查常見的監控和日誌問題
  • 遵循監控和日誌管理的最佳實踐和安全考慮

這套完整的K8s技能文檔涵蓋了從基礎概念到高級特性的各個方面,為讀者提供了全面的K8s知識體系和實踐指導。