基於預訓練BERT的情感分類，命名實體識別，問答系統簡單示例詳情 - BERT,機器學習,人工智能 Maverick1218 博客

模型構建

# coding=utf-8
"""
BERT 三大任務演示 - 簡化版
功能：
1. 情感分析 - 判斷文本正面/負面情緒
2. 命名實體識別 - 識別人名、地名、組織等
3. 問答系統 - 從文本中提取答案

使用離線模型，便於學習和理解
"""

import time
import warnings
warnings.filterwarnings('ignore')


# ============================================================================
# 配置：本地模型路徑
# ============================================================================
MODEL_PATHS = {
    'sentiment': 'C:/Users/PC/Desktop/models/sentiment-analysis',      # 情感分析模型
    'ner': 'C:/Users/PC/Desktop/models/ner',                           # 命名實體識別模型
    'qa': 'C:/Users/PC/Desktop/models/qa'                              # 問答模型
}


# ============================================================================
# 工具函數：檢查環境
# ============================================================================
def check_environment():
    """檢查Python環境和依賴庫"""
    try:
        import transformers
        import torch

        print(f"✓ transformers 版本: {transformers.__version__}")
        print(f"✓ torch 版本: {torch.__version__}")

        # 檢查是否有GPU
        if torch.cuda.is_available():
            print(f"GPU 可用: {torch.cuda.get_device_name(0)}")
        else:
            print("使用 CPU 模式")

        print()
        return True

    except ImportError as e:
        print(f"缺少必要的庫: {e}")
        print("\n請安裝依賴:")
        print("  pip install transformers torch")
        return False


# 任務1：情感分析
class SentimentAnalyzer:
    """
    情感分析器
    功能：判斷文本的情感傾向（正面或負面）
    應用：電商評論分析、輿情監控等
    """
    
    def __init__(self, model_path):
        """
        初始化模型
        參數：
            model_path: 本地模型文件夾路徑
            AutoTokenizer：分詞器：導入的是代碼框架（tokenize、encode、decode等）；from_pretrained(model_path)：加載預訓練數據（詞彙表、配置文件具體參數等）
            AutoModelForSequenceClassification：BERT模型：
            torch：PyTorch深度學習框架
        """
        from transformers import AutoTokenizer, AutoModelForSequenceClassification
        import torch
        
        print("【加載情感分析模型】")
        print(f"模型路徑: {model_path}\n")

        # 從本地路徑加載預訓練的分詞器 (Tokenizer)/ˈtoʊ.kən.aɪ.zɚ/；Token/ˈtəʊ.kən/：（包含詞彙表、特殊符號等）
        # 分詞器功能：1、文本拆分成tokens（字詞）；2、將tokens（字詞）轉換為對應固定數字ID（模型識別數字）
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)

        # 從本地加載情感分類模型（預訓練的BERT + 分類頭）
        self.model = AutoModelForSequenceClassification.from_pretrained(model_path)
        
        # 將模型設置為評估模式（不dropout以及不更新梯度）
        self.model.eval（) 

        # 設置運行設備（GPU快於CPU，有GPU則用GPU）
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.model.to(self.device)

        # 標籤映射
        self.labels = {0: "負面 ", 1: "正面 "}

        print("模型加載成功\n")

    def predict(self, text):
        """
        預測文本情感
        參數：
            text: 輸入文本
        返回：
            包含預測結果的字典
        """
        import torch

        # 1. 調用tokenizer進行文本預處理（分詞、轉ID、padding等）
        inputs = self.tokenizer(
            text,
            return_tensors='pt',      # 返回PyTorch張量
            padding=True,              # 自動padding
            truncation=True,           # 超長截斷
            max_length=512             # 最大長度
        )

        # 2. 嘗試將輸入移到對應設備提高運行速率（轉移前（CPU）→  轉移後（GPU））
        inputs = {k: v.to(self.device) for k, v in inputs.items()}

        # 3. torch.no_grad()：預測模式，不計算梯度，輸出的outputs包含原始得分
        with torch.no_grad():
            outputs = self.model(**inputs)

        # 4. 將logits轉換為概率：outputs.logits是原始得分tensor([[-3.2156,  4.8234]])，Softmax:將其映射到 [0, 1]，概率和為 1
        probs = torch.softmax(outputs.logits, dim=-1)

        # 5. 獲取預測類別：tensor([[0.0003, 0.9997]])：找最大值的索引 → 轉為整數（1）
        pred_id = torch.argmax(probs, dim=-1).item()

        # 6. 返回結果
        return {
            'text': text,                            # 這家餐廳非常好
            'label': self.labels[pred_id],           # '正面 '
            'confidence': probs[0][pred_id].item(),  # 0.9987
            'all_scores': {
                self.labels[i]: probs[0][i].item()   # {'負面 ': 0.0013,
                for i in range(len(self.labels))     # '正面 ': 0.9987 }
            }
        }



# 任務2：命名實體識別
class NamedEntityRecognizer:
    """
    命名實體識別器（NER）
    功能：識別文本中的人名、地名、組織機構等
    應用：信息抽取、知識圖譜構建等
    """

    def __init__(self, model_path):
        """
        初始化模型
        參數：
            model_path: 本地模型文件夾路徑
        """
        # AutoModelForTokenClassification:Token分類模型（每個token字詞預測一個標籤）
        from transformers import AutoTokenizer, AutoModelForTokenClassification
        import torch

        print("【加載命名實體識別模型】")
        print(f"模型路徑: {model_path}\n")

        # 加載分詞器：use_fast=True：使用Rust實現的快速分詞器（支持 return_offsets_mapping）
        self.tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)

        # 加載Token字詞分類模型：BERT + 分類頭（每個token輸出實體類別）
        self.model = AutoModelForTokenClassification.from_pretrained(model_path)

        # .eval（)：評估模式
        self.model.eval（)

        # 設置運行設備
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.model.to(self.device)

        # ID到實體標籤的映射，比如
        # {1: 'B-PER',    # Begin-Person（人名開始）
        # 2: 'I-PER',    # Inside-Person（人名延續）}
        self.id2label = self.model.config.id2label

        print("✓ 模型加載成功\n")

    def predict(self, text):
        """
        識別文本中的命名實體
        參數：
            text: 輸入文本
        返回：
            包含實體列表的字典
        """
        import torch

        # 步驟1：文本預處理（分詞 + 位置記錄）
        inputs = self.tokenizer(
            text,
            return_tensors='pt',
            padding=True,
            truncation=True,
            max_length=512,
            return_offsets_mapping=True  # 返回每個token在原文中的位置
        )

        # 步驟2：提取位置映射並將其移除：模型不需要offset_mapping
        offset_mapping = inputs.pop('offset_mapping')[0]

        # 步驟3. 將輸入移到設備
        inputs = {k: v.to(self.device) for k, v in inputs.items()}

        # 步驟4. 模型預測
        with torch.no_grad():
            outputs = self.model(**inputs)

        # 步驟5. 找出每個token最可能的標籤對應索引：predictions = [0, 1, 2...]，比如 1 對應 'B-PER'（人名開始）
        predictions = torch.argmax(outputs.logits, dim=-1)[0]

        # 6. 解析實體：BIO標註
        # 輸入： 每個token的標籤預測（如：['O', 'B-PER', 'I-PER', 'O', 'B-LOC', 'I-LOC']）
        # 輸出： 實體列表（如：[{'text': '王明', 'type': 'PER'}, {'text': '北京', 'type': 'LOC'}]）

        entities = [] # 存儲所有識別出的實體
        current_entity = None # 當前正在構建的實體（狀態機）

        # 遍歷所有token字詞並獲取標籤：比如pred_id = 1(對應'B-PER') , offset = (0,1): 表示[token在原文中的起始位置, token在原文中的結束位置)
        for pred_id, offset in zip(predictions, offset_mapping):
            
            # 跳過特殊token：[CLS]、[SEP]、[PAD] 等特殊token的offset是 (0, 0)，不需要處理
            if offset[0] == 0 and offset[1] == 0:
                continue

            # 用數字索引去查字典, 比如 pred_id = 1，self.id2label = {0: 'O', 1: 'B-PER',...},label = 'B-PER' (人名開始)
            label = self.id2label[pred_id.item()]

            # 情況一：B-開頭：新實體開始(begin)
            if label.startswith('B-'):
                if current_entity: # 如果有未完成的實體，先保存
                    entities.append(current_entity)
                current_entity = { # 創建新實體
                    'text': text[offset[0]:offset[1]], # 提取文本片段
                    'type': label[2:], # 去掉'B-'前綴
                    'start': offset[0].item(), # 起始位置
                    'end': offset[1].item() # 結束位置
                }

            # I-開頭：實體延續(inside)
            elif label.startswith('I-'):

                # 檢查是否有正在構建的current_entity實體；檢查label[2:]類型是否匹配
                if current_entity and label[2:] == current_entity['type']:

                    # 原文中提取完整的實體文本
                    # current_entity['start']：使用原來的起始位置（不變）
                    # offset[1]：使用當前token的結束位置（擴展），比如 text[0:2] = '王明'
                    current_entity['text'] = text[current_entity['start']:offset[1]]
                    current_entity['end'] = offset[1].item() # 更新結束位置

            # O：非實體
            else:
                if current_entity:
                    entities.append(current_entity)
                    current_entity = None

        # 添加最後一個實體
        if current_entity:
            entities.append(current_entity)

        return {
            'text': text,
            'entities': entities
        }


# 任務3：問答系統
class QuestionAnsweringSystem:
    """
    抽取式問答系統：
    輸入：問題 + 包含答案的文本（上下文）
    輸出：從上下文中抽取出的答案
    應用：智能客服、搜索引擎等
    """

    def __init__(self, model_path):
        """
        初始化模型
        參數：
            model_path: 本地模型文件夾路徑
        """
        # AutoModelForQuestionAnswering：兩個輸出頭：start_logits（起始位置）和 end_logits（結束位置）：足夠預測任意長度的答案
        from transformers import AutoTokenizer, AutoModelForQuestionAnswering
        import torch

        print("【加載問答模型】")
        print(f"模型路徑: {model_path}\n")

        # 加載分詞器和模型
        self.tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
        self.model = AutoModelForQuestionAnswering.from_pretrained(model_path)
        self.model.eval（)

        # 設置運行設備
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.model.to(self.device)

        print(" 模型加載成功\n")

    def answer(self, question, context):
        import torch

        # 1. 將問題和上下文兩個文本同時編碼為模型可以理解的數字格式
        inputs = self.tokenizer(
            question, # 第1個文本：問題
            context,  # 第2個文本：上下文
            return_tensors='pt', # 指定返回 PyTorch 張量格式
            padding=True, # 自動填充到批次中最長序列的長度max_length
            truncation=True, # 序列超過 max_length，自動截斷
            max_length=512  # 設置最大序列長度為512
        )

        # 2. 將輸入移到設備
        inputs = {k: v.to(self.device) for k, v in inputs.items()}

        # 3. 模型預測（預測答案的起始和結束位置）
        with torch.no_grad(): # 告訴PyTorch 不計算梯度（預測模式）
            
            # 模型前向傳播：輸出的outputs包含start_logits（起始位置）和end_logits（結束位置）
            outputs = self.model(**inputs) 

        # 4. 找到分數最高的位置索引來獲取起始和結束位置
        start_idx = torch.argmax(outputs.start_logits).item()
        end_idx = torch.argmax(outputs.end_logits).item()

        # 5. 確保結束位置不在起始位置之前,否則將結束位置設為起始位置,返回單個 token 作為答案
        if end_idx < start_idx:
            end_idx = start_idx

        # 6. 計算置信度
        # 將所有位置的 logits 轉換為概率分佈(總和為1)
        start_score = torch.softmax(outputs.start_logits[0], dim=-1)[start_idx].item() # start_probs[10] = 0.957

        end_score = torch.softmax(outputs.end_logits[0], dim=-1)[end_idx].item() # end_probs[11] = 0.964
        
        confidence = (start_score + end_score) / 2 # (0.957 + 0.964) / 2

        # 7. [start_idx:end_idx+1] ：切片提取答案部分對應的ID：[1266, 776]
        answer_tokens = inputs['input_ids'][0][start_idx:end_idx+1]

        # 舉例：[1266, 776] →（skip_special_tokens跳過特殊標記）: '北京'
        answer_text = self.tokenizer.decode(answer_tokens, skip_special_tokens=True)

        # 8. 處理無答案情況
        if not answer_text.strip():
            return {
                'question': question,
                'context': context,
                'answer': " 無法找到答案",
                'confidence': 0.0
            }

        return {
            'question': question,
            'context': context,
            'answer': answer_text,
            'confidence': confidence
        }

調用模型

def demo_sentiment():
    """演示情感分析"""
    try:
        #  創建 SentimentAnalyzer 對象，傳入模型路徑，這一步加載 BERT 模型和分詞器
        analyzer = SentimentAnalyzer(MODEL_PATHS['sentiment'])

        # 測試文本
        test_texts = [
            "這家餐廳的菜品非常好吃，服務態度也很棒！",
            "價格太貴了，性價比很低，不推薦",
            "環境很好，但是菜品一般般",
            "超級好吃！強烈推薦給大家！",
            "服務態度極差，再也不會來了"
        ]

        # 逐個預測
        for i, text in enumerate(test_texts, 1): # 給每個文本編號，從1開始
            print(f" 文本 {i}: {text}")

            # 調用模型預測
            result = analyzer.predict(text)

            print(f"   預測結果: {result['label']}")
            print(f"   置信度: {result['confidence']:.2%}")

    except Exception as e:
        print(f"✗ 演示失敗: {e}")
        print("請檢查模型路徑是否正確\n")


def demo_ner():
    """演示命名實體識別"""
    try:
        # 創建命名實體識別對象
        recognizer = NamedEntityRecognizer(MODEL_PATHS['ner'])

        # 測試文本
        test_texts = [
            "王明在北京大學學習計算機科學",
            "馬化騰是騰訊公司的創始人",
            "2024年3月，李華在上海舉辦了畫展",
            "張偉博士在清華大學擔任教授"
        ]

        # 逐個識別
        for i, text in enumerate(test_texts, 1):
            print(f"文本 {i}: {text}")

            result = recognizer.predict(text)

            if result['entities']:
                print(f"   識別到 {len(result['entities'])} 個實體:")
                for entity in result['entities']:
                    
                    # 格式化輸出（填充空格和截斷長文本）
                    print(f"      • {entity['text']:<10} → [{entity['type']}]")
            else:
                print("   未識別到實體")

            print()

        print("✓ 命名實體識別演示完成\n")

    except Exception as e:
        print(f"演示失敗: {e}")
        print("請檢查模型路徑是否正確\n")


def demo_qa():
    """演示問答系統"""
    try:
        # 初始化模型
        qa_system = QuestionAnsweringSystem(MODEL_PATHS['qa'])

        # 測試問答對
        test_cases = [
            {
                'context': 'BERT是Google在2018年提出的預訓練語言模型。它的全稱是Bidirectional Encoder Representations from Transformers。BERT在多個自然語言處理任務上都取得了突破性的成果。',
                'question': 'BERT是什麼時候提出的？'
            },
            {
                'context': '北京是中國的首都，',
                'question': '中國的首都是哪兒？'
            },
            {
                'context': '深度學習是機器學習的一個分支，它使用多層神經網絡來學習數據的表示。深度學習在計算機視覺、語音識別和自然語言處理等領域都有廣泛應用。',
                'question': '深度學習在哪些領域有應用？'
            },
            {
                'context': '張大彪非常喜歡打羽毛球',
                'question': '誰非常喜歡打羽毛球？'
            }
        ]

        # 逐個問答
        for i, case in enumerate(test_cases, 1):
            print(f"問答 {i}:")
            print(f"   上下文: {case['context']}")
            print(f"   問題: {case['question']}")

            
            # qa_system.answer() 內部會：1. 編碼：將問題和上下文轉為 token IDs → 2. 預測：模型輸出 start_logits 和 end_logits
            # → 3. 定位：找到答案的起始和結束位置 → 4. 解碼：將 token IDs 轉換回文本 → 5. 返回結果字典
            result = qa_system.answer(case['question'], case['context'])

            print(f"   答案: {result['answer']}")
            print(f"   置信度: {result['confidence']:.2%}")
            print()

        print("✓ 問答系統演示完成\n")

    except Exception as e:
        print(f"✗ 演示失敗: {e}")
        print("請檢查模型路徑是否正確\n")



# 主程序

def main():
    """主函數：運行所有演示"""

    print("        BERT 三大任務演示 - 簡化版")
    print("\n 包含任務:")
    print("   1. 情感分析")
    print("   2. 命名實體識別")
    print("   3. 問答系統")
    print("\n 模式: 離線模式（使用本地模型）")
    print("\n 模型路徑:")
    for task, path in MODEL_PATHS.items():
        print(f"   {task}: {path}")
    print("="*70)

    # 檢查環境
    if not check_environment():
        return

    # 運行三個演示
    demo_sentiment()      # 任務1：情感分析
    demo_ner()            # 任務2：命名實體識別
    demo_qa()             # 任務3：問答系統


if __name__ == "__main__":
        main()

BERT 三大任務演示 - 簡化版

 包含任務:
   1. 情感分析
   2. 命名實體識別
   3. 問答系統

 模式: 離線模式（使用本地模型）

 模型路徑:
   sentiment: C:/Users/PC/Desktop/models/sentiment-analysis
   ner: C:/Users/PC/Desktop/models/ner
   qa: C:/Users/PC/Desktop/models/qa
======================================================================
✓ transformers 版本: 4.51.1
✓ torch 版本: 2.6.0+cpu
使用 CPU 模式

【加載情感分析模型】
模型路徑: C:/Users/PC/Desktop/models/sentiment-analysis

模型加載成功

 文本 1: 這家餐廳的菜品非常好吃，服務態度也很棒！
   預測結果: 正面 
   置信度: 98.31%
 文本 2: 價格太貴了，性價比很低，不推薦
   預測結果: 負面 
   置信度: 98.19%
 文本 3: 環境很好，但是菜品一般般
   預測結果: 負面 
   置信度: 61.52%
 文本 4: 超級好吃！強烈推薦給大家！
   預測結果: 正面 
   置信度: 97.42%
 文本 5: 服務態度極差，再也不會來了
   預測結果: 負面 
   置信度: 99.63%
【加載命名實體識別模型】
模型路徑: C:/Users/PC/Desktop/models/ner

✓ 模型加載成功

文本 1: 王明在北京大學學習計算機科學
   識別到 2 個實體:
      • 王明         → [name]
      • 北京大學       → [organization]

文本 2: 馬化騰是騰訊公司的創始人
   識別到 3 個實體:
      • 馬化騰        → [name]
      • 騰訊公司       → [company]
      • 創始人        → [position]

文本 3: 2024年3月，李華在上海舉辦了畫展
   識別到 2 個實體:
      • 李華         → [name]
      • 上海         → [address]

文本 4: 張偉博士在清華大學擔任教授
   識別到 4 個實體:
      • 張偉         → [name]
      • 博士         → [position]
      • 清華大學       → [organization]
      • 教授         → [position]

✓ 命名實體識別演示完成

【加載問答模型】
模型路徑: C:/Users/PC/Desktop/models/qa

 模型加載成功

問答 1:
   上下文: BERT是Google在2018年提出的預訓練語言模型。它的全稱是Bidirectional Encoder Representations from Transformers。BERT在多個自然語言處理任務上都取得了突破性的成果。
   問題: BERT是什麼時候提出的？
   答案: 2018 年
   置信度: 80.47%

問答 2:
   上下文: 北京是中國的首都，
   問題: 中國的首都是哪兒？
   答案: 北 京
   置信度: 72.87%

問答 3:
   上下文: 深度學習是機器學習的一個分支，它使用多層神經網絡來學習數據的表示。深度學習在計算機視覺、語音識別和自然語言處理等領域都有廣泛應用。
   問題: 深度學習在哪些領域有應用？
   答案: 計 算 機 視 覺 、 語 音 識 別 和 自 然 語 言 處 理 等 領 域
   置信度: 73.05%

問答 4:
   上下文: 張大彪非常喜歡打羽毛球
   問題: 誰非常喜歡打羽毛球？
   答案: 張 大 彪
   置信度: 91.61%

✓ 問答系統演示完成

Maverick1218 博客

Maverick1218 博客

博客 / 詳情

基於預訓練BERT的情感分類，命名實體識別，問答系統簡單示例

模型構建

調用模型

發佈評論

Product

Company

Support

Company

博客 / 詳情

基於預訓練BERT的情感分類，命名實體識別，問答系統簡單示例

模型構建

調用模型

發佈 評論

發佈評論