博客 / 詳情

返回

opentelemetry全鏈路初探--埋點與jaeger

前言

某天一位業務研發老哥跑來諮詢

  • 研發老哥:我的服務出現了504,但是不太清楚是哪個環節報錯,每次請求需要訪問4個微服務、2個數據庫、1個redis、1個消息隊列。。。
  • 苦逼運維:停停停,不要再説了,目前不支持鏈路追蹤,只能手動幫你一個服務一個服務的排查了
  • 先請老哥大概描述了一下業務邏輯以及訪問方式,10分鐘過去了。再逐級排查每個服務以及對應訪問的資源層,終於在半小時之後完成了故障定位。。。

這效率也太低了,於是,關於鏈路建設項目提上了議程,目標只有一個,快速定位問題,提高穩定性。而鏈路建設,OpenTelemetry是目前行業熱點,那本運維就來研究研究

環境準備

組件 版本
操作系統 Ubuntu 22.04.4 LTS
opentelemetry-sdk 1.35.0

安裝

首先先簡單説一下OpenTelemetry的數據採集流程,然後先跑起來再去討論細節

  • OpenTelemetry就是在代碼中埋入採集點進行數據採集,opentelemetry-sdk
  • 再通過固定的協議將數據上傳至某個地方進行數據展示,jaeger UI

安裝OpenTelemetry-sdk

pip3 install opentelemetry-sdk opentelemetry-exporter-otlp opentelemetry-api

安裝數據展示jaeger UI

docker pull docker.m.daocloud.io/jaegertracing/all-in-one:latest

docker run -d --name jaeger \
  -e COLLECTOR_OTLP_ENABLED=true \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  docker.m.daocloud.io/jaegertracing/all-in-one:latest

docker啓動之後訪問:http://127.0.0.1:16686

watermarked-first_1.png

第一個例子

web服務

首先先準備一個web服務,這裏我們用tornado來實現,安裝tornado:pip3 install tornado

import tornado.httpserver as httpserver
import tornado.web
from tornado.ioloop import IOLoop


class TestFlow(tornado.web.RequestHandler):
    def get(self):
        self.finish('hello world')


def applications():
    urls = []
    urls.append([r'/', TestFlow])
    return tornado.web.Application(urls)

def main():
    app = applications()
    server = httpserver.HTTPServer(app)
    server.bind(10000, '0.0.0.0')
    server.start(1)
    IOLoop.current().start()


if __name__ == "__main__":
    try:
        main()
    except KeyboardInterrupt as e:
        IOLoop.current().stop()
    finally:
        IOLoop.current().close()

檢查是否能夠正常訪問:

watermarked-first_2.png

添加埋點

import tornado.httpserver as httpserver
import tornado.web
from tornado.ioloop import IOLoop
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter


trace.set_tracer_provider(
    TracerProvider(resource=Resource.create({SERVICE_NAME: "s1"}))
)
tracer = trace.get_tracer(__name__)
span_processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces"))
trace.get_tracer_provider().add_span_processor(span_processor)


class TestFlow(tornado.web.RequestHandler):
    def get(self):
        views()
        self.finish('hello world')

def views():
    span = tracer.start_span("s1-span")
    span.end()

def applications():
    urls = []
    urls.append([r'/', TestFlow])
    return tornado.web.Application(urls)

def main():
    app = applications()
    server = httpserver.HTTPServer(app)
    server.bind(10000, '0.0.0.0')
    server.start(1)
    IOLoop.current().start()


if __name__ == "__main__":
    try:
        main()
    except KeyboardInterrupt as e:
        IOLoop.current().stop()
    finally:
        IOLoop.current().close()

再次訪問 curl http://localhost:10000 ,打開jaeger UI查看

watermarked-first_3.png

watermarked-first_4.png

已經有數據了,剛才的埋點已經上報至jaeger UI了

埋點數據屬性

豐富一下埋點數據的屬性

def views():
    span = tracer.start_span("s1-span")
    span.set_attribute("name", "wilson")
    span.set_attribute("addr", "cd")
    span.end()

watermarked-first_5.png

增加數據庫訪問追蹤

def views():
    span = tracer.start_span("s1-span")
    span.set_attribute("name", "wilson")
    span.set_attribute("addr", "cd")
    ctx = trace.set_span_in_context(span)
    get_db(ctx)
    span.end()

def get_db(parent_ctx):
    span = tracer.start_span("s1-span-db", context=parent_ctx)
    span.end()

watermarked-first_6.png

增加跨服務追蹤

增加第二個web服務:s2.py

import tornado.httpserver as httpserver
import tornado.web
from tornado.ioloop import IOLoop
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator



trace.set_tracer_provider(
    TracerProvider(resource=Resource.create({SERVICE_NAME: "s2"}))
)
tracer = trace.get_tracer(__name__)
span_processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces"))
trace.get_tracer_provider().add_span_processor(span_processor)


class TestFlow(tornado.web.RequestHandler):
    def get(self):
        ctx = TraceContextTextMapPropagator().extract(self.request.headers)
        span = tracer.start_span("s2-span", context=ctx)
        span.end()
        self.finish('hello world')

def applications():
    urls = []
    urls.append([r'/', TestFlow])
    return tornado.web.Application(urls)

def main():
    app = applications()
    server = httpserver.HTTPServer(app)
    server.bind(20000, '0.0.0.0')
    server.start(1)
    IOLoop.current().start()


if __name__ == "__main__":
    try:
        main()
    except KeyboardInterrupt as e:
        IOLoop.current().stop()
    finally:
        IOLoop.current().close()

修改s1.py

from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator
import requests

def views():
    span = tracer.start_span("s1-span")
    span.set_attribute("name", "wilson")
    span.set_attribute("addr", "cd")
    ctx = trace.set_span_in_context(span)
    get_db(ctx)
    headers = {}
    TraceContextTextMapPropagator().inject(headers, context=ctx)
    requests.get("http://localhost:20000", headers=headers)
    span.end()

watermarked-first_7.png

改造進k8s

jaeger

編排文件:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: jaeger
  name: jaeger
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jaeger
  template:
    metadata:
      labels:
        app: jaeger
    spec:
      containers:
      - image: docker.m.daocloud.io/jaegertracing/all-in-one:latest
        imagePullPolicy: Always
        name: jaeger
      dnsPolicy: ClusterFirst
      restartPolicy: Always

---

apiVersion: v1
kind: Service
metadata:
  labels:
    app: jaeger-service
  name: jaeger-service
  namespace: default
spec:
  ports:
  - name: port-4317
    port: 4317
    protocol: TCP
    targetPort: 4317
  - name: port-4318
    port: 4318
    protocol: TCP
    targetPort: 4318
  - name: port-16686
    port: 16686
    protocol: TCP
    targetPort: 16686
  selector:
    app: jaeger
  type: NodePort

s2

1)製作鏡像

由於在k8s集羣中通過svc訪問jaeger,需要改造一下s2.py

s2.py

...
import os

JAEGER_ADDR=os.environ.get('JAEGER_ADDR')
...
span_processor = BatchSpanProcessor(OTLPSpanExporter(endpoint=JAEGER_ADDR))
...

Dockerfile

FROM python:3.8

WORKDIR /opt
RUN pip3 install tornado opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp -i https://pypi.tuna.tsinghua.edu.cn/simple
ADD s2.py /opt
CMD python3 s2.py

2)編排文件

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: s2
  name: s2
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: s2
  template:
    metadata:
      labels:
        app: s2
    spec:
      containers:
      - env:
        - name: JAEGER_ADDR
          value: http://jaeger-service:4318/v1/traces
        image: s2:v1
        imagePullPolicy: Always
        name: s2
      dnsPolicy: ClusterFirst
      restartPolicy: Always
---

apiVersion: v1
kind: Service
metadata:
  labels:
    app: s2-service
  name: s2-service
  namespace: default
spec:
  ports:
  - name: s2-port
    port: 20000
    protocol: TCP
    targetPort: 20000
  selector:
    app: s2
  type: NodePort

s1

1)製作鏡像

由於在k8s集羣中通過svc訪問s2與jaeger,需要改造一下s1.py

s1.py

...
import os

S2_ADDR=os.environ.get('S2_ADDR')
JAEGER_ADDR=os.environ.get('JAEGER_ADDR')

...
span_processor = BatchSpanProcessor(OTLPSpanExporter(endpoint=JAEGER_ADDR))
...

def views():
    span = tracer.start_span("s1-span")
    span.set_attribute("name", "wilson")
    span.set_attribute("addr", "cd")
    ctx = trace.set_span_in_context(span)
    get_db(ctx)
    headers = {}
    TraceContextTextMapPropagator().inject(headers, context=ctx)
    requests.get(S2_ADDR, headers=headers)
    span.end()

...

Dockerfile:

FROM python:3.8

WORKDIR /opt
RUN pip3 install tornado opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp -i https://pypi.tuna.tsinghua.edu.cn/simple
ADD s1.py /opt
CMD python3 s1.py

2)編排文件

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: s1
  name: s1
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: s1
  template:
    metadata:
      labels:
        app: s1
    spec:
      containers:
      - env:
        - name: S2_ADDR
          value: http://s2-service:20000
        - name: JAEGER_ADDR
          value: http://jaeger-service:4318/v1/traces
        image: s1:v1
        imagePullPolicy: Always
        name: s1
      dnsPolicy: ClusterFirst
      restartPolicy: Always

---

apiVersion: v1
kind: Service
metadata:
  labels:
    app: s1-service
  name: s1-service
  namespace: default
spec:
  ports:
  - name: s1-port
    port: 10000
    protocol: TCP
    targetPort: 10000
  selector:
    app: s1
  type: NodePort

查看結果

▶ kubectl get pod -owide
NAME                            READY   STATUS    RESTARTS         AGE     IP             NODE       NOMINATED NODE   READINESS GATES
jaeger-6669cd7c4-4pl5j          1/1     Running   0                7m31s   10.244.0.236   minikube   <none>           <none>
s1-5c569c5b4b-lctzq             1/1     Running   0                73s     10.244.0.237   minikube   <none>           <none>
s2-5bb648dcdf-mlnbj             1/1     Running   0                61s     10.244.0.238   minikube   <none>           <none>

▶ kubectl get svc
NAME             TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                         AGE
jaeger-service   NodePort    10.106.13.217    <none>        4317:31891/TCP,4318:31997/TCP,16686:31002/TCP   5m49s
s1-service       NodePort    10.102.25.195    <none>        10000:32376/TCP                                 4m23s
s2-service       NodePort    10.103.114.198   <none>        20000:30032/TCP                                 3m40s

進行數據測試:

  • 訪問s1服務

    ▶ curl http://192.168.49.2:32376
    hello world%
  • 查看jaeger日誌,訪問:http://192.168.49.2:31002/

watermarked-first_10.png

總結

在第一個例子中,我們主要採集了業務服務的trace記錄,即一個完整的請求需要經過的路徑,包括讀取數據庫、跨服務請求等等

在整個跟蹤過程中trace_idspan_id發揮了決定性的作用,前者為請求鏈路的唯一標識,串聯了整個訪問步驟;而後者則是鏈路上每一次不同的具體操作的標識

watermarked-first_8.png

  • 採集:通過嵌入代碼埋點,採集重點監控的流程,比如數據庫讀寫速度、下游服務速度等
  • 處理:opentelemetry-sdk對數據進行處理:過濾、緩存、合併
  • 導出:將處理過的數據,通過固定的協議(otlp協議、grpc協議、http協議等)發送到後端系統,比如jaeger

watermarked-first_9.png

聯繫我

  • 聯繫我,做深入的交流


至此,本文結束
在下才疏學淺,有撒湯漏水的,請各位不吝賜教...

user avatar sealio 頭像
1 位用戶收藏了這個故事!

發佈 評論

Some HTML is okay.