[開發技巧] 如何獲取漢字筆畫數？詳情 - python,技巧阿爾的代碼屋博客

[開發技巧] 如何獲取漢字筆畫數？

背景

在開發一個簡單的卜筮小腳本的過程中，遇見了這個有趣的問題。如果只是特定個別漢字，我們大可以硬編碼一個字典在腳本中，但是如果想獲取任意一個漢字的筆畫數呢？

pypinyin庫

from pypinyin import pinyin, Style

def get_strokes_count(chinese_character):
    pinyin_list = pinyin(chinese_character, style=Style.NORMAL)
    strokes_count = len(pinyin_list[0])
    return strokes_count

character = input("請輸入一個漢字：")
strokes = get_strokes_count(character)
print("漢字'{}'的筆畫數為：{}".format(character, strokes))

嘗試了一下，發現得到的結果實際上是該漢字在normal拼音格式下的結果數，

pypinin wrong

unihan數據庫

unihan數據庫是一個由Unicode聯盟維護的漢字數據庫，看起來很靠譜，還提供了在線的工具。

在其在線查詢工具Unihan Database Lookup中進行檢索，發現查詢結果中
unihan_lokup
存在kTotalStrokes字段，即為所需的筆畫數數據。
作為unicode的官方數據庫，目前版本完全滿足基本的漢字查詢。

Nice! 離成功更進了一步！

從Unihan數據庫中獲取筆畫信息

最開始打算直接通過lookup發送查詢請求，hmmm，太慢了，地址在國外。發現數據庫文件本身也不大，就直接下載下來了。

Unihan下載地址

打開壓縮包，有文件若干.

通過lookup檢索得到的結果，我們要的kTotalStrokes字段在IRG Source中,取出該文件。
在regex101中測試正則，取出要的unicode部分和筆畫數部分，單獨存成文件, 以供查詢.

編碼

提取筆畫信息

file = Path("Stroke/Unihan_IRGSources.txt")
output = Path("Stroke/unicode2stroke.json")
stroke_dict = dict()
with open(file,mode="r") as f:
    for line in f:
        raw_line = line.strip()
        pattern = r"(U\+.*)\skTotalStrokes.*\s(\d+)"
        result = re.findall(pattern=pattern, string=raw_line)
        if len(result) == 0:
            continue
        unicode_key = result[0][0]
        unicode_stroke = result[0][1]
        print(f"{unicode_key}: {unicode_stroke}")
        stroke_dict[unicode_key] = unicode_stroke

with open(file=output, mode="w", encoding="utf-8") as f:
    json.dump(stroke_dict,f, ensure_ascii=False, indent=4)

導出成json文件方便訪問

編寫獲取函數

with open(output) as f:
    unicode2stroke = json.load(f)

def get_character_stroke_count(char: str):
    unicode = "U+" + str(hex(ord(char)))[2:].upper()
    return int(unicode2stroke[unicode])

test_char = "阿"
get_character_stroke_count(char=test_char)

在獲取時，注意unicode將漢字轉為其對應的十六進制碼

成功!達到預期結果!

阿爾的代碼屋博客

阿爾的代碼屋博客

博客 / 詳情