文章目錄

  • 前言
  • 驗證SQL代碼
  • 函數分解詳解
  • 1. SEQUENCE函數
  • 2. FILTER函數
  • 3. SIZE函數
  • 實際應用示例
  • 場景1: 單個字符差異
  • 場景2: 多個字符差異
  • 實際應用_[使用序列函數(Spark 3.0+)]
  • 驗證數據sql
  • 字段解釋
  • 總結

前言

  • SEQUENCE函數 與 FILTER函數 與 SIZE函數 組合比較字符串的差異

驗證SQL代碼

  • 為了驗證SIZE(FILTER(SEQUENCE(…)))這個字符差異分析函數的正確性,我編寫了以下測試SQL:
  • 可以直接運行
-- 驗證字符差異分析函數的測試SQL
WITH test_data AS (
  SELECT 
    'VIN001' as vin,
    'ABC123' as t1_engine,
    'ABC124' as t2_engine,
    '1個字符差異' as expected_result
  UNION ALL
  SELECT 
    'VIN002' as vin,
    'CA6DM2-42E52' as t1_engine,
    'CA6DM2-46E52' as t2_engine,
    '1個字符差異' as expected_result
  UNION ALL
  SELECT 
    'VIN003' as vin,
    'XYZ789' as t1_engine,
    'XWZ789' as t2_engine,
    '2個字符差異' as expected_result
  UNION ALL
  SELECT 
    'VIN004' as vin,
    'ENGINE001' as t1_engine,
    'ENGINE002' as t2_engine,
    '1個字符差異' as expected_result
  UNION ALL
  SELECT 
    'VIN005' as vin,
    'TEST1234' as t1_engine,
    'TEST5678' as t2_engine,
    '4個字符差異' as expected_result
),
diff_analysis AS (
  SELECT 
    vin,
    t1_engine,
    t2_engine,
    expected_result,
    -- 核心分析函數
    SIZE(
      FILTER(
        SEQUENCE(1, LENGTH(t1_engine)),
        i -> SUBSTRING(UPPER(t1_engine), i, 1) != SUBSTRING(UPPER(t2_engine), i, 1)
      )
    ) as actual_diff_count,
    -- 分解步驟驗證
    SEQUENCE(1, LENGTH(t1_engine)) as char_positions,
    FILTER(
      SEQUENCE(1, LENGTH(t1_engine)),
      i -> SUBSTRING(UPPER(t1_engine), i, 1) != SUBSTRING(UPPER(t2_engine), i, 1)
    ) as diff_positions,
    -- 逐個字符對比
    TRANSFORM(
      SEQUENCE(1, LENGTH(t1_engine)),
      i -> STRUCT(
        i as position,
        SUBSTRING(UPPER(t1_engine), i, 1) as t1_char,
        SUBSTRING(UPPER(t2_engine), i, 1) as t2_char,
        CASE WHEN SUBSTRING(UPPER(t1_engine), i, 1) != SUBSTRING(UPPER(t2_engine), i, 1) 
             THEN 'DIFF' 
             ELSE 'SAME' END as status
      )
    ) as detailed_comparison
  FROM test_data
)
SELECT 
  vin,
  t1_engine,
  t2_engine,
  expected_result,
  actual_diff_count,
  char_positions,
  diff_positions,
  detailed_comparison,
  CASE 
    WHEN actual_diff_count = 1 AND expected_result LIKE '%1個字符差異%' THEN '✅ 正確'
    WHEN actual_diff_count = 2 AND expected_result LIKE '%2個字符差異%' THEN '✅ 正確'
    WHEN actual_diff_count = 4 AND expected_result LIKE '%4個字符差異%' THEN '✅ 正確'
    ELSE '❌ 異常'
  END as verification_result
FROM diff_analysis
ORDER BY vin;

函數分解詳解

1. SEQUENCE函數

SEQUENCE(1, LENGTH(t1_engine_no))
  • 作用: 生成從1到字符串長度的整數序列
  • 示例: 對於’ABC123’ (長度6),生成[1, 2, 3, 4, 5, 6]
  • 目的: 為每個字符位置創建索引

2. FILTER函數

FILTER(
    sequence_result,
    i -> SUBSTRING(UPPER(t1_engine_no), i, 1) != SUBSTRING(UPPER(t2_engine_no), i, 1)
)
  • 作用: 過濾出字符不同的位置索引
  • Lambda函數: i -> condition 對每個位置i檢查字符是否不同
  • SUBSTRING函數: 提取指定位置的單個字符
  • UPPER函數: 統一轉為大寫,忽略大小寫差異

3. SIZE函數

SIZE(filtered_sequence)
  • 作用: 計算過濾後序列的長度
  • 結果: 即字符差異的數量

實際應用示例

場景1: 單個字符差異

-- 輸入: 'ABC123' vs 'ABC124'
-- 處理過程:
SEQUENCE(1, 6) → [1,2,3,4,5,6]
FILTER(...) → [6]  -- 只有第6個字符不同 ('3' vs '4')
SIZE([6]) → 1
-- 結果: 1個字符差異 ✅

場景2: 多個字符差異

-- 輸入: 'TEST1234' vs 'TEST5678'
-- 處理過程:
SEQUENCE(1, 8) → [1,2,3,4,5,6,7,8]
FILTER(...) → [5,6,7,8]  -- 第5-8個字符不同 ('1234' vs '5678')
SIZE([5,6,7,8]) → 4
-- 結果: 4個字符差異 ✅

實際應用_[使用序列函數(Spark 3.0+)]

-- 使用sequence函數生成位置序列(Spark 3.0+)
WITH diff_analysis AS (
    SELECT 
        t1.vin,
        t1.engine_no as t1_engine,
        t2.engine_no as t2_engine,
        SIZE(
            FILTER(
                SEQUENCE(1, LENGTH(t1.engine_no)),
                i -> SUBSTRING(UPPER(t1.engine_no), i, 1) != SUBSTRING(UPPER(t2.engine_no), i, 1)
            )
        ) as diff_char_count
    FROM (
        SELECT vin, vehicle_model, engine_no 
        FROM dwd.dwd_001
        WHERE `date` = 20251119
        AND engine_no NOT IN ('', 'N/A')
    ) t1
    INNER JOIN (
        SELECT vin, vehicle_model, engine_no 
        FROM ods.ods_001 
        WHERE `date` = 20251119
        AND engine_no != 'NA'
    ) t2 ON t1.vin = t2.vin
    WHERE LENGTH(t1.engine_no) = LENGTH(t2.engine_no)
        AND UPPER(t1.engine_no) != UPPER(t2.engine_no)
)
SELECT 
    diff_char_count,
    COUNT(*) as record_count,
    ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) as percentage
FROM diff_analysis
WHERE diff_char_count BETWEEN 1 AND 5
GROUP BY diff_char_count
ORDER BY diff_char_count;
  • 查詢到的數據結果

diff_char_count

record_count

percentage

1

434550

29.48

2

596840

40.49

3

253237

17.18

4

163933

11.12

5

25661

1.74

驗證數據sql

SELECT diff_char_count, vin, t1_engine, t2_engine FROM diff_analysis
WHERE diff_char_count BETWEEN 1 AND 1 limit 100;

字段解釋

字段名

含義説明

diff_char_count

字符差異數量,表示兩個表中相同VIN記錄的engine_no字段有多少個字符不同

record_count

記錄數量,表示對應字符差異數量的數據有多少條

percentage

百分比,表示該差異級別在所有差異記錄中的佔比


總結

  • 這個SIZE(FILTER(SEQUENCE(…)))組合函數是一個強大的字符串差異分析工具,通過函數式編程的方式優雅地解決了字符級比較的問題。在實際數據質量分析項目中,它幫助我們精確識別了不同級別的數據不一致問題,為數據治理提供了有力的技術支持。
  • 通過驗證SQL的測試,確認該函數能夠準確計算字符串間的字符差異數量,為後續的數據清洗和映射規則制定提供了可靠依據。