文章目錄
- 前言
- 驗證SQL代碼
- 函數分解詳解
- 1. SEQUENCE函數
- 2. FILTER函數
- 3. SIZE函數
- 實際應用示例
- 場景1: 單個字符差異
- 場景2: 多個字符差異
- 實際應用_[使用序列函數(Spark 3.0+)]
- 驗證數據sql
- 字段解釋
- 總結
前言
- SEQUENCE函數 與 FILTER函數 與 SIZE函數 組合比較字符串的差異
驗證SQL代碼
- 為了驗證SIZE(FILTER(SEQUENCE(…)))這個字符差異分析函數的正確性,我編寫了以下測試SQL:
可以直接運行
-- 驗證字符差異分析函數的測試SQL
WITH test_data AS (
SELECT
'VIN001' as vin,
'ABC123' as t1_engine,
'ABC124' as t2_engine,
'1個字符差異' as expected_result
UNION ALL
SELECT
'VIN002' as vin,
'CA6DM2-42E52' as t1_engine,
'CA6DM2-46E52' as t2_engine,
'1個字符差異' as expected_result
UNION ALL
SELECT
'VIN003' as vin,
'XYZ789' as t1_engine,
'XWZ789' as t2_engine,
'2個字符差異' as expected_result
UNION ALL
SELECT
'VIN004' as vin,
'ENGINE001' as t1_engine,
'ENGINE002' as t2_engine,
'1個字符差異' as expected_result
UNION ALL
SELECT
'VIN005' as vin,
'TEST1234' as t1_engine,
'TEST5678' as t2_engine,
'4個字符差異' as expected_result
),
diff_analysis AS (
SELECT
vin,
t1_engine,
t2_engine,
expected_result,
-- 核心分析函數
SIZE(
FILTER(
SEQUENCE(1, LENGTH(t1_engine)),
i -> SUBSTRING(UPPER(t1_engine), i, 1) != SUBSTRING(UPPER(t2_engine), i, 1)
)
) as actual_diff_count,
-- 分解步驟驗證
SEQUENCE(1, LENGTH(t1_engine)) as char_positions,
FILTER(
SEQUENCE(1, LENGTH(t1_engine)),
i -> SUBSTRING(UPPER(t1_engine), i, 1) != SUBSTRING(UPPER(t2_engine), i, 1)
) as diff_positions,
-- 逐個字符對比
TRANSFORM(
SEQUENCE(1, LENGTH(t1_engine)),
i -> STRUCT(
i as position,
SUBSTRING(UPPER(t1_engine), i, 1) as t1_char,
SUBSTRING(UPPER(t2_engine), i, 1) as t2_char,
CASE WHEN SUBSTRING(UPPER(t1_engine), i, 1) != SUBSTRING(UPPER(t2_engine), i, 1)
THEN 'DIFF'
ELSE 'SAME' END as status
)
) as detailed_comparison
FROM test_data
)
SELECT
vin,
t1_engine,
t2_engine,
expected_result,
actual_diff_count,
char_positions,
diff_positions,
detailed_comparison,
CASE
WHEN actual_diff_count = 1 AND expected_result LIKE '%1個字符差異%' THEN '✅ 正確'
WHEN actual_diff_count = 2 AND expected_result LIKE '%2個字符差異%' THEN '✅ 正確'
WHEN actual_diff_count = 4 AND expected_result LIKE '%4個字符差異%' THEN '✅ 正確'
ELSE '❌ 異常'
END as verification_result
FROM diff_analysis
ORDER BY vin;
函數分解詳解
1. SEQUENCE函數
SEQUENCE(1, LENGTH(t1_engine_no))
- 作用: 生成從1到字符串長度的整數序列
- 示例: 對於’ABC123’ (長度6),生成[1, 2, 3, 4, 5, 6]
- 目的: 為每個字符位置創建索引
2. FILTER函數
FILTER(
sequence_result,
i -> SUBSTRING(UPPER(t1_engine_no), i, 1) != SUBSTRING(UPPER(t2_engine_no), i, 1)
)
- 作用: 過濾出字符不同的位置索引
- Lambda函數: i -> condition 對每個位置i檢查字符是否不同
- SUBSTRING函數: 提取指定位置的單個字符
- UPPER函數: 統一轉為大寫,忽略大小寫差異
3. SIZE函數
SIZE(filtered_sequence)
- 作用: 計算過濾後序列的長度
- 結果: 即字符差異的數量
實際應用示例
場景1: 單個字符差異
-- 輸入: 'ABC123' vs 'ABC124'
-- 處理過程:
SEQUENCE(1, 6) → [1,2,3,4,5,6]
FILTER(...) → [6] -- 只有第6個字符不同 ('3' vs '4')
SIZE([6]) → 1
-- 結果: 1個字符差異 ✅
場景2: 多個字符差異
-- 輸入: 'TEST1234' vs 'TEST5678'
-- 處理過程:
SEQUENCE(1, 8) → [1,2,3,4,5,6,7,8]
FILTER(...) → [5,6,7,8] -- 第5-8個字符不同 ('1234' vs '5678')
SIZE([5,6,7,8]) → 4
-- 結果: 4個字符差異 ✅
實際應用_[使用序列函數(Spark 3.0+)]
-- 使用sequence函數生成位置序列(Spark 3.0+)
WITH diff_analysis AS (
SELECT
t1.vin,
t1.engine_no as t1_engine,
t2.engine_no as t2_engine,
SIZE(
FILTER(
SEQUENCE(1, LENGTH(t1.engine_no)),
i -> SUBSTRING(UPPER(t1.engine_no), i, 1) != SUBSTRING(UPPER(t2.engine_no), i, 1)
)
) as diff_char_count
FROM (
SELECT vin, vehicle_model, engine_no
FROM dwd.dwd_001
WHERE `date` = 20251119
AND engine_no NOT IN ('', 'N/A')
) t1
INNER JOIN (
SELECT vin, vehicle_model, engine_no
FROM ods.ods_001
WHERE `date` = 20251119
AND engine_no != 'NA'
) t2 ON t1.vin = t2.vin
WHERE LENGTH(t1.engine_no) = LENGTH(t2.engine_no)
AND UPPER(t1.engine_no) != UPPER(t2.engine_no)
)
SELECT
diff_char_count,
COUNT(*) as record_count,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) as percentage
FROM diff_analysis
WHERE diff_char_count BETWEEN 1 AND 5
GROUP BY diff_char_count
ORDER BY diff_char_count;
- 查詢到的數據結果
|
diff_char_count
|
record_count
|
percentage
|
|
1
|
434550
|
29.48
|
|
2
|
596840
|
40.49
|
|
3
|
253237
|
17.18
|
|
4
|
163933
|
11.12
|
|
5
|
25661
|
1.74
|
驗證數據sql
SELECT diff_char_count, vin, t1_engine, t2_engine FROM diff_analysis
WHERE diff_char_count BETWEEN 1 AND 1 limit 100;
字段解釋
|
字段名
|
含義説明
|
|
diff_char_count
|
字符差異數量,表示兩個表中相同VIN記錄的engine_no字段有多少個字符不同
|
|
record_count
|
記錄數量,表示對應字符差異數量的數據有多少條
|
|
percentage
|
百分比,表示該差異級別在所有差異記錄中的佔比
|
總結
- 這個SIZE(FILTER(SEQUENCE(…)))組合函數是一個強大的字符串差異分析工具,通過函數式編程的方式優雅地解決了字符級比較的問題。在實際數據質量分析項目中,它幫助我們精確識別了不同級別的數據不一致問題,為數據治理提供了有力的技術支持。
- 通過驗證SQL的測試,確認該函數能夠準確計算字符串間的字符差異數量,為後續的數據清洗和映射規則制定提供了可靠依據。