博客 / 詳情

返回

從零開始學Flink:數據轉換的藝術

在實時數據處理流程中,數據轉換(Transformation)是連接數據源與輸出結果的橋樑,也是體現計算邏輯的核心環節。Flink提供了豐富的數據轉換操作,讓開發者能夠靈活地對數據流進行各種處理和分析。本文將以Flink DataStream API為核心,帶你探索Flink數據轉換的精妙世界,並結合之前文章中的Kafka Source實現一個完整的數據處理流程。

一、數據轉換概覽

數據轉換是指將原始輸入數據通過一系列操作轉換為所需輸出結果的過程。在Flink中,這些操作主要分為以下幾類:

  • 基本轉換:如映射(Map)、過濾(Filter)、扁平映射(FlatMap)等
  • 鍵控轉換:如分組(KeyBy)、聚合(Reduce、Aggregate)等
  • 多流轉換:如聯合(Union)、連接(Join)、拆分(Split)等
  • 狀態轉換:如鍵控狀態(Keyed State)、算子狀態(Operator State)等

這些轉換操作就像數據的"加工廠",讓原始數據經過一系列"工序"後,變成有價值的信息產品。

二、環境準備與依賴配置

為了演示數據轉換,我們將繼續使用之前文章中的Kafka Source環境。如果您已經完成了《從零開始學Flink:數據源》中的環境搭建,可以直接使用現有配置;如果還沒有,請先參考該文章完成環境準備。

1. 版本説明

  • Flink:1.20.1
  • Kafka:3.4.0
  • JDK:17+
  • gradle 8.3+

2. 核心依賴

除了基礎的Flink和Kafka依賴外,我們在本文中將引入一些額外的依賴來支持更豐富的數據處理場景:

dependencies {
    // Flink核心依賴
    implementation 'org.apache.flink:flink-java:1.20.1'
    implementation 'org.apache.flink:flink-streaming-java_2.12:1.20.1'

    // Flink Kafka Connector
    implementation 'org.apache.flink:flink-connector-kafka_2.12:1.20.1'

    // 日誌依賴
    implementation 'org.apache.logging.log4j:log4j-api:2.17.1'
    implementation 'org.apache.logging.log4j:log4j-core:2.17.1'
    implementation 'org.apache.logging.log4j:log4j-slf4j-impl:2.17.1'
    
    // JSON處理庫(用於處理JSON格式數據)
    implementation 'com.fasterxml.jackson.core:jackson-databind:2.14.2'
    implementation 'com.fasterxml.jackson.datatype:jackson-datatype-jsr310:2.14.2'
}

三、基本轉換操作

基本轉換是Flink中最常用、最簡單的數據轉換操作,它們對數據流中的每個元素進行獨立處理,不涉及狀態管理。

1. 映射(Map)

Map操作將輸入流中的每個元素轉換為另一個元素。例如,將字符串轉換為大寫:

// 從Kafka讀取字符串數據
DataStream<String> kafkaStream = env.fromSource(
    kafkaSource,
    WatermarkStrategy.noWatermarks(), 
    "Kafka Source"
);

// 使用Map將字符串轉換為大寫
DataStream<String> upperCaseStream = kafkaStream.map(s -> s.toUpperCase());

upperCaseStream.print("UppercaseData");

2. 過濾(Filter)

Filter操作根據條件過濾掉不需要的元素,只保留滿足條件的元素:

// 過濾出包含"flink"關鍵詞的消息
DataStream<String> filteredStream = kafkaStream.filter(s -> s.toLowerCase().contains("flink"));

filteredStream.print("FilteredData");

3. 扁平映射(FlatMap)

FlatMap操作類似於Map,但它可以將一個元素轉換為零個、一個或多個元素,常用於數據拆分場景:

// 將每行文本拆分為單詞
DataStream<String> wordStream = kafkaStream.flatMap((String value, Collector<String> out) -> {
    // 按空格拆分字符串
    String[] words = value.split(" ");
    // 將每個單詞發送到輸出流
    for (String word : words) {
        out.collect(word);
    }
});

wordStream.print("WordData");

四、鍵控轉換操作

鍵控轉換是基於鍵(Key)對數據進行分組和聚合的操作,是實現複雜業務邏輯的基礎。

1. 分組(KeyBy)

KeyBy操作根據指定的鍵將數據流劃分為不同的分區,具有相同鍵的元素將被髮送到同一個分區進行處理:

// 假設我們的Kafka消息格式為"userId:message"
// 先將消息拆分為用户ID和消息內容
DataStream<Tuple2<String, String>> userMessageStream = kafkaStream.flatMap((String value, Collector<Tuple2<String, String>> out) -> {
    if (value.contains(":")) {
        String[] parts = value.split(":", 2);
        if (parts.length == 2) {
            out.collect(new Tuple2<>(parts[0], parts[1]));
        }
    }
});

// 按鍵分組(這裏以用户ID為鍵)
KeyedStream<Tuple2<String, String>, String> keyedStream = userMessageStream.keyBy(tuple -> tuple.f0);

2. 聚合(Reduce)

Reduce操作對KeyedStream進行聚合,常用於計算總和、最大值等:

// 假設我們的消息格式為"userId:count",其中count是數字
// 先將消息轉換為(userId, count)元組
DataStream<Tuple2<String, Integer>> userCountStream = kafkaStream.flatMap((String value, Collector<Tuple2<String, Integer>> out) -> {
    if (value.contains(":")) {
        String[] parts = value.split(":");
        if (parts.length == 2) {
            try {
                int count = Integer.parseInt(parts[1]);
                out.collect(new Tuple2<>(parts[0], count));
            } catch (NumberFormatException e) {
                // 處理格式錯誤
                LOG.warn("Invalid number format: {}", parts[1]);
            }
        }
    }
});

// 按鍵分組
KeyedStream<Tuple2<String, Integer>, String> keyedCountStream = userCountStream.keyBy(tuple -> tuple.f0);

// 使用Reduce計算每個用户的總計數
DataStream<Tuple2<String, Integer>> sumStream = keyedCountStream.reduce((value1, value2) -> 
    new Tuple2<>(value1.f0, value1.f1 + value2.f1)
);

sumStream.print("SumData");

3. 自定義聚合(Aggregate)

對於更復雜的聚合需求,可以使用Aggregate操作,它提供了更靈活的聚合方式:

// 計算每個用户消息的平均值長度
DataStream<Tuple2<String, Double>> avgLengthStream = keyedStream.aggregate(new AggregateFunction<Tuple2<String, String>, Tuple2<Integer, Integer>, Double>() {
    // 創建初始累加器
    @Override
    public Tuple2<Integer, Integer> createAccumulator() {
        return new Tuple2<>(0, 0); // (總長度, 消息數量)
    }

    // 將元素添加到累加器
    @Override
    public Tuple2<Integer, Integer> add(Tuple2<String, String> value, Tuple2<Integer, Integer> accumulator) {
        return new Tuple2<>(accumulator.f0 + value.f1.length(), accumulator.f1 + 1);
    }

    // 獲取聚合結果
    @Override
    public Double getResult(Tuple2<Integer, Integer> accumulator) {
        return accumulator.f1 > 0 ? (double) accumulator.f0 / accumulator.f1 : 0;
    }

    // 合併累加器(用於並行計算)
    @Override
    public Tuple2<Integer, Integer> merge(Tuple2<Integer, Integer> a, Tuple2<Integer, Integer> b) {
        return new Tuple2<>(a.f0 + b.f0, a.f1 + b.f1);
    }
});

avgLengthStream.print("AvgLengthData");

五、多流轉換操作

在實際應用中,我們經常需要處理多個數據流。Flink提供了多種多流轉換操作,讓我們能夠靈活地處理複雜的數據場景。

1. 聯合(Union)

Union操作可以將多個同類型的數據流合併為一個數據流:

// 假設我們有兩個Kafka主題,都產生字符串數據
KafkaSource<String> kafkaSource1 = KafkaSource.<String>builder()
    .setBootstrapServers(kafkaBootstrapServers)
    .setTopics("topic1")
    .setGroupId(consumerGroup)
    .setDeserializer(KafkaRecordDeserializationSchema.valueOnly(StringDeserializer.class))
    .setStartingOffsets(OffsetsInitializer.earliest())
    .build();

KafkaSource<String> kafkaSource2 = KafkaSource.<String>builder()
    .setBootstrapServers(kafkaBootstrapServers)
    .setTopics("topic2")
    .setGroupId(consumerGroup)
    .setDeserializer(KafkaRecordDeserializationSchema.valueOnly(StringDeserializer.class))
    .setStartingOffsets(OffsetsInitializer.earliest())
    .build();

// 創建兩個數據流
DataStream<String> stream1 = env.fromSource(kafkaSource1, WatermarkStrategy.noWatermarks(), "Kafka Source 1");
DataStream<String> stream2 = env.fromSource(kafkaSource2, WatermarkStrategy.noWatermarks(), "Kafka Source 2");

// 合併兩個數據流
DataStream<String> unionStream = stream1.union(stream2);

unionStream.print("UnionData");

2. 連接(Connect)

Connect操作可以連接兩個不同類型的數據流,保留各自的數據類型,適用於需要對不同類型數據進行協同處理的場景:

// 假設我們有一個用户數據流和一個訂單數據流
// 用户數據流格式:userId:username
DataStream<Tuple2<String, String>> userStream = kafkaStream1.flatMap((String value, Collector<Tuple2<String, String>> out) -> {
    if (value.contains(":")) {
        String[] parts = value.split(":");
        if (parts.length == 2) {
            out.collect(new Tuple2<>(parts[0], parts[1]));
        }
    }
});

// 訂單數據流格式:orderId:userId:amount
DataStream<Tuple3<String, String, Double>> orderStream = kafkaStream2.flatMap((String value, Collector<Tuple3<String, String, Double>> out) -> {
    if (value.contains(":")) {
        String[] parts = value.split(":");
        if (parts.length == 3) {
            try {
                double amount = Double.parseDouble(parts[2]);
                out.collect(new Tuple3<>(parts[0], parts[1], amount));
            } catch (NumberFormatException e) {
                LOG.warn("Invalid number format: {}", parts[2]);
            }
        }
    }
});

// 按鍵連接兩個數據流(這裏以用户ID為鍵)
ConnectedStreams<Tuple2<String, String>, Tuple3<String, String, Double>> connectedStreams = 
    userStream.keyBy(tuple -> tuple.f0).connect(orderStream.keyBy(tuple -> tuple.f1));

// 處理連接後的數據流
DataStream<String> resultStream = connectedStreams.map(
    // 處理用户數據
    user -> "User: " + user.f1,
    // 處理訂單數據
    order -> "Order from user " + order.f1 + ", amount: " + order.f2
);

resultStream.print("ConnectedData");

六、實戰案例:實時日誌分析系統

現在,讓我們結合之前學到的Kafka Source和本文介紹的數據轉換操作,實現一個簡單的實時日誌分析系統。

1. 需求分析

我們需要從Kafka讀取應用程序日誌,實時分析日誌級別分佈、錯誤日誌數量以及按小時統計的日誌量。

2. 數據模型

假設我們的日誌格式為:timestamp|logLevel|message,例如:2025-09-22 12:30:45|ERROR|Failed to connect to database

3. 完整代碼實現

創建一個主類LogAnalysisDemo,用於實現實時日誌分析系統的邏輯。

package com.cn.daimajiangxin.flink.transformation;

import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.AggregateFunction;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.connector.kafka.source.KafkaSource;
import org.apache.flink.connector.kafka.source.enumerator.initializer.OffsetsInitializer;
import org.apache.flink.connector.kafka.source.reader.deserializer.KafkaRecordDeserializationSchema;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.GlobalWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.triggers.CountTrigger;
import org.apache.flink.streaming.api.windowing.windows.GlobalWindow;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
import org.apache.flink.util.Collector;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.time.Duration;
import java.util.Date;

public class LogAnalysisDemo {
    private static final Logger LOG = LoggerFactory.getLogger(LogAnalysisDemo.class);
    private static final SimpleDateFormat DATE_FORMAT = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");

    public static void main(String[] args) throws Exception {
        // 1. 創建Flink流執行環境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // 啓用檢查點
        env.enableCheckpointing(10000); // 每10秒做一次檢查點
        env.getCheckpointConfig().setCheckpointTimeout(60000); // 檢查點超時時間60秒
        env.getCheckpointConfig().setMinPauseBetweenCheckpoints(5000); // 檢查點之間最小暫停時間
        env.getCheckpointConfig().setMaxConcurrentCheckpoints(1); // 最大併發檢查點數量


        // 2. 配置Kafka參數
        String kafkaBootstrapServers = "172.30.244.152:9092";
        String topic = "app_logs";
        String consumerGroup = "flink-log-analysis";

        // 3. 定義Kafka Source
        KafkaSource<String> kafkaSource = KafkaSource.<String>builder()
                .setBootstrapServers(kafkaBootstrapServers)
                .setTopics(topic)
                .setGroupId(consumerGroup)
                .setDeserializer(new KafkaRecordDeserializationSchema<String>() {
                    @Override
                    public void deserialize(ConsumerRecord<byte[], byte[]> record, Collector<String> out) throws IOException {
                        String value = new String(record.value(), StandardCharsets.UTF_8);
                        out.collect(value);
                    }

                    @Override
                    public TypeInformation<String> getProducedType() {
                        return TypeInformation.of(String.class);
                    }
                })
                .setStartingOffsets(OffsetsInitializer.earliest())
                // 添加Kafka客户端屬性以提高穩定性
                .setProperty("enable.auto.commit", "false") // 由Flink管理偏移量提交
                .setProperty("session.timeout.ms", "45000")
                .setProperty("max.poll.interval.ms", "300000")
                .setProperty("heartbeat.interval.ms", "10000")
                .setProperty("retry.backoff.ms", "1000")
                .setProperty("reconnect.backoff.max.ms", "10000")
                .setProperty("reconnect.backoff.ms", "1000")
                .build();

        // 4. 從Kafka讀取數據
        DataStream<String> logStream = env.fromSource(
                kafkaSource,
                WatermarkStrategy.noWatermarks(),
                "Kafka Log Source"
        );

        // 5. 解析日誌數據
        DataStream<LogEntry> parsedLogStream = logStream.flatMap(new FlatMapFunction<String, LogEntry>() {
            @Override
            public void flatMap(String value, Collector<LogEntry> out) throws Exception {
                try {
                    String[] parts = value.split("\\|", 3);
                    if (parts.length == 3) {
                        Date timestamp = DATE_FORMAT.parse(parts[0]);
                        String logLevel = parts[1];
                        String message = parts[2];
                        LogEntry entry = new LogEntry(timestamp, logLevel, message);
                        LOG.info("Parsed log entry: {}", entry);
                        out.collect(entry);
                    } else {
                        LOG.warn("Failed to parse log entry (wrong part count): {}", value);
                    }
                } catch (ParseException e) {
                    LOG.warn("Failed to parse log entry: {}", value, e);
                } catch (Exception e) {
                    LOG.error("Unexpected error while parsing log entry: {}", value, e);
                }
            }
        });

        // 6. 統計日誌級別分佈
        KeyedStream<LogEntry, String> levelKeyedStream = parsedLogStream.keyBy(entry -> entry.getLogLevel());
        DataStream<Tuple2<String, Long>> levelCountStream = levelKeyedStream
                .window(TumblingProcessingTimeWindows.of(Duration.ofMinutes(1))) // 每1分鐘統計一次
                .aggregate(
                        new AggregateFunction<LogEntry, Long, Long>() {
                            @Override
                            public Long createAccumulator() {
                                return 0L;
                            }

                            @Override
                            public Long add(LogEntry value, Long accumulator) {
                                return accumulator + 1;
                            }

                            @Override
                            public Long getResult(Long accumulator) {
                                return accumulator;
                            }

                            @Override
                            public Long merge(Long a, Long b) {
                                return a + b;
                            }
                        },
                        new ProcessWindowFunction<Long, Tuple2<String, Long>, String, TimeWindow>() {
                            @Override
                            public void process(String level, Context context, Iterable<Long> elements, Collector<Tuple2<String, Long>> out) throws Exception {
                                long count = elements.iterator().next();
                                out.collect(new Tuple2<>(level, count));
                            }
                        }
                );
        levelCountStream.print("LogLevelCount");

        // 7. 統計錯誤日誌數量
        DataStream<LogEntry> errorLogStream = parsedLogStream.filter(entry -> entry.getLogLevel().equals("ERROR"));
        KeyedStream<LogEntry, String> errorKeyedStream = errorLogStream.keyBy(entry -> "ERROR"); // 所有錯誤日誌為同一個鍵
        DataStream<Tuple2<String, Long>> errorCountStream = errorKeyedStream.window(TumblingProcessingTimeWindows.of(Duration.ofMinutes(1)))
                .aggregate(
                        new AggregateFunction<LogEntry, Long, Long>() {
                            @Override
                            public Long createAccumulator() {
                                return 0L;
                            }

                            @Override
                            public Long add(LogEntry value, Long accumulator) {
                                return accumulator + 1;
                            }

                            @Override
                            public Long getResult(Long accumulator) {
                                return accumulator;
                            }

                            @Override
                            public Long merge(Long a, Long b) {
                                return a + b;
                            }
                        },
                        new ProcessWindowFunction<Long, Tuple2<String, Long>, String, TimeWindow>() {
                            @Override
                            public void process(String key, Context context, Iterable<Long> elements, Collector<Tuple2<String, Long>> out) {
                                long count = elements.iterator().next();
                                out.collect(new Tuple2<>("ERROR_COUNT", count));
                            }
                        }
                );

        errorCountStream.print("ErrorCount");

        // 8. 按小時統計日誌量
        DataStream<Tuple2<String, LogEntry>> hourlyLogStream = parsedLogStream.map(new MapFunction<LogEntry, Tuple2<String, LogEntry>>() {
            @Override
            public Tuple2<String, LogEntry> map(LogEntry entry) throws Exception {
                String hourKey = new SimpleDateFormat("yyyy-MM-dd HH").format(entry.getTimestamp());
                return new Tuple2<>(hourKey, entry);
            }
        }).returns(new TypeHint<Tuple2<String, LogEntry>>() {});

        KeyedStream<Tuple2<String, LogEntry>, String> hourlyKeyedStream = hourlyLogStream.keyBy(tuple -> tuple.f0);
        DataStream<Tuple3<String, Long, Long>> hourlyCountStream = hourlyKeyedStream
                .window(TumblingProcessingTimeWindows.of(Duration.ofHours(1)))
                .aggregate(
                        new AggregateFunction<Tuple2<String, LogEntry>, Long, Long>() {
                            @Override
                            public Long createAccumulator() {
                                return 0L;
                            }

                            @Override
                            public Long add(Tuple2<String, LogEntry> value, Long accumulator) {
                                return accumulator + 1;
                            }

                            @Override
                            public Long getResult(Long accumulator) {
                                return accumulator;
                            }

                            @Override
                            public Long merge(Long a, Long b) {
                                return a + b;
                            }
                        },
                        new ProcessWindowFunction<Long, Tuple3<String, Long, Long>, String, TimeWindow>() {
                            @Override
                            public void process(String hour, Context context, Iterable<Long> elements, Collector<Tuple3<String, Long, Long>> out) {
                                long count = elements.iterator().next();
                                out.collect(new Tuple3<>(hour, count, context.window().getEnd()));
                            }
                        }
                );

        hourlyCountStream.print("HourlyLogCount");

        // 9. 啓動任務
        env.execute("Log Analysis Demo");
    }
}

七、測試與驗證

1. 創建測試主題

在Kafka中創建日誌主題:

# 創建日誌主題
$KAFKA_HOME/bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic app_logs

2. 發送測試數據

使用Kafka生產者發送測試日誌數據:

# 啓動Kafka生產者
$KAFKA_HOME/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic app_logs

# 輸入以下測試數據(每行一條)
2025-10-05 12:30:45|INFO|Application started
2025-10-05 12:31:10|DEBUG|Connecting to database
2025-10-05 12:31:15|ERROR|Failed to connect to database
2025-10-05 12:32:00|INFO|Retry connection to database
2025-10-05 12:32:05|INFO|Database connected successfully
2025-10-05 12:33:20|WARN|Low disk space warning
2025-10-05 12:34:00|ERROR|Out of memory error
2025-10-05 13:00:00|INFO|Daily report generated

3. 運行程序並驗證結果

在IDE中運行LogAnalysisDemo類的main方法,觀察控制枱輸出。您應該能看到類似以下的輸出:

LogLevelCount: INFO, 4
LogLevelCount: ERROR, 2
LogLevelCount: DEBUG, 1
LogLevelCount: WARN, 1
ErrorCount: ERROR_COUNT, 2
HourlyLogCount: 2025-10-05 12, 7, 1730793600000
HourlyLogCount: 2025-10-05 13, 1, 1730797200000

八、性能優化與最佳實踐

1. 並行度調優

合理設置並行度可以充分利用集羣資源,提高處理性能:

// 設置全局並行度
env.setParallelism(4);

2. 避免數據傾斜

數據傾斜會導致部分任務處理速度慢,整體性能下降。可以通過以下方式避免:

  • 合理設計鍵(Key),避免熱點鍵
  • 使用自定義分區器
  • 對傾斜的數據進行預聚合

3. 狀態管理

對於有狀態的操作,合理管理狀態可以提高程序的可靠性和性能:

  • 使用Checkpoint確保狀態一致性
  • 對於大狀態,考慮使用RocksDB後端
  • 定期清理不需要的狀態

九、總結與展望

本文詳細介紹了Flink的數據轉換操作,包括基本轉換、鍵控轉換和多流轉換,並結合Kafka Source實現了一個實時日誌分析系統。通過這些轉換操作,我們可以靈活地處理和分析實時數據流,實現各種複雜的業務需求。

在後續文章中,我們將繼續深入探討Flink的窗口計算、狀態管理以及數據輸出(Sink)等核心概念,包括各種Sink連接器的使用、輸出格式配置、可靠性保證機制等內容,幫助您更全面地掌握Flink的端到端數據處理能力。敬請關注!


源文來自:http://blog.daimajiangxin.com.cn

源碼地址:https://gitee.com/daimajiangxin/flink-learning

user avatar
0 位用戶收藏了這個故事!

發佈 評論

Some HTML is okay.