一.機器學習基礎與特徵工程詳情 - 機器學習 leafgood 博客

1.機器學習基礎

1.1 數學基礎

需要的數學知識：
高等數學、線性代數、概率與統計。

當然一開始不用深入進去，可以在學習過程中逐步積累。

1.2 編程語言

人工智能領域很火的領域的自然是Python，門檻也低，可以作為機器學習入門的首選語言。
有精力的話，再學習C/C++，多一門語言傍身不是壞事。

1.3 機器學習的開發流程

在這裏插入圖片描述
1）從公司原有的數據庫或者爬蟲等途徑獲得原始數據
我們所拿到的數據，大致分為兩種類型：離散型數據和連續型數據。
2）對原始數據進行數據處理，python中可以使用pandas庫。
pandas 庫參考文檔 https://pandas.pydata.org/pandas-docs/stable/
3）特徵處理：非常重要的一步，直接影響模型的預測結果
4）然後選擇合適的算法，構建模式
5）對模型進行評估，是否符合預期，如何不理想，需要重新選擇算法，或者對數據重做特徵處理
6）評估OK後，模型直接應用

2.scikit-learn庫

scikit-learn 是基於 Python 語言的機器學習工具:
簡單高效的數據挖掘和數據分析工具
可供大家在各種環境中重複使用
建立在 NumPy ，SciPy 和 matplotlib 上
開源，可商業使用 - BSD許可證

通過這個庫來逐步揭開機器學習的面紗.

2.1 特徵工程

2.1.1 特徵抽取

scikit-learn庫提供了特徵抽取的API：sklearn.feature_extraction

字典特徵抽取 sklearn.feature_extraction.DictVectorizer

from sklearn.feature_extraction import DictVectorizer

def dict_ex():
    """
    字典數據抽取
    :return:
    """
    list_demo = [{'city': '美國', 'new_add': 24500},
                 {'city': '俄羅斯', 'new_add': 4000},
                 {'city': '英國', 'new_add': 5600}]

    dict_ins = DictVectorizer(sparse=False) 

    data = dict_ins.fit_transform(list_demo)

    print(dict_ins.get_feature_names()) 
    print(data.astype('int')) 

if __name__ == '__main__':
    dict_ex()

結果

['city=俄羅斯', 'city=美國', 'city=英國', 'new_add']
[[    0     1     0 24500]
 [    1     0     0  4000]
 [    0     0     1  5600]]

文本特徵抽取 sklearn.feature_extraction.text.CountVectorizer
限制英文，可以使用分詞工具

from sklearn.feature_extraction.text import CountVectorizer

def text_ex():
    """
    文本特徵抽取
    :return:
    """
    text_demo = ["""
        Beautiful is better than ugly.
        Explicit is better than implicit.
        Simple is better than complex.
        Complex is better than complicated.
        Flat is better than nested.
        Sparse is better than dense.
        Readability counts.
        Special cases aren't special enough to break the rules.
        Although practicality beats purity.
        Errors should never pass silently.
        Unless explicitly silenced.
        In the face of ambiguity, refuse the temptation to guess.
        There should be one-- and preferably only one --obvious way to do it.
        Although that way may not be obvious at first unless you're Dutch.
        Now is better than never.
        Although never is often better than *right* now.
        If the implementation is hard to explain, it's a bad idea.
        If the implementation is easy to explain, it may be a good idea.
        Namespaces are one honking great idea -- let's do more of those!
    """]

    cv = CountVectorizer()
    data = cv.fit_transform(text_demo)
    print(cv.get_feature_names())
    print(data) # 默認為sparse格式
    print(data.toarray()) # 轉化為數組
  
if __name__ == '__main__':
    text_ex()

結果

['although', 'ambiguity', 'and', 'are', 'aren', 'at', 'bad', 'be', 'beats', 'beautiful', 'better', 'break', 'cases', 'complex', 'complicated', 'counts', 'dense', 'do', 'dutch', 'easy', 'enough', 'errors', 'explain', 'explicit', 'explicitly', 'face', 'first', 'flat', 'good', 'great', 'guess', 'hard', 'honking', 'idea', 'if', 'implementation', 'implicit', 'in', 'is', 'it', 'let', 'may', 'more', 'namespaces', 'nested', 'never', 'not', 'now', 'obvious', 'of', 'often', 'one', 'only', 'pass', 'practicality', 'preferably', 'purity', 're', 'readability', 'refuse', 'right', 'rules', 'should', 'silenced', 'silently', 'simple', 'sparse', 'special', 'temptation', 'than', 'that', 'the', 'there', 'those', 'to', 'ugly', 'unless', 'way', 'you']
  (0, 9)    1
  (0, 38)    10
  (0, 10)    8
  :    :
  (0, 40)    1
  (0, 42)    1
  (0, 73)    1
[[ 3  1  1  1  1  1  1  3  1  1  8  1  1  2  1  1  1  2  1  1  1  1  2  1
   1  1  1  1  1  1  1  1  1  3  2  2  1  1 10  3  1  2  1  1  1  3  1  2
   2  2  1  3  1  1  1  1  1  1  1  1  1  1  2  1  1  1  1  2  1  8  1  5
   1  1  5  1  2  2  1]]

漢字的特徵提取示例

from sklearn.feature_extraction.text import CountVectorizer
import jieba

def zh_exc():
    """
    中文特徵值
    :return:
    """
    text_demo1 = """
    　我與父親不相見已二年餘了，我最不能忘記的是他的背影。那年冬天，祖母死了，父親的差使也交卸了，
    正是禍不單行的日子，我從北京到徐州，打算跟着父親奔喪回家。到徐州見着父親，看見滿院狼藉的東西，
    又想起祖母，不禁簌簌地流下眼淚。父親説，“事已如此，不必難過，好在天無絕人之路！
    """
    text_demo2 = """
    沿着荷塘，是一條曲折的小煤屑路。這是一條幽僻的路；白天也少人走，夜晚更加寂寞。荷塘四面，長着許多樹，
    蓊蓊鬱鬱的。路的一旁，是些楊柳，和一些不知道名字的樹。沒有月光的晚上，這路上陰森森的，有些怕人。
    今晚卻很好，雖然月光也還是淡淡的。
    """

    c1 = jieba.cut(text_demo1)
    c2 = jieba.cut(text_demo2)

    # 轉換成字符串
    text1 = ' '.join(list(c1))
    text2 = ' '.join(list(c2))

    cv = CountVectorizer()
    data = cv.fit_transform([text1, text2])
    print(cv.get_feature_names())
    print(data.toarray())
 
 if __name__ == '__main__':
    zh_exc()

結果

['一些', '一旁', '一條', '不必', '不禁', '不能', '東西', '事已如此', '二年', '交卸', '今晚', '冬天', '北京', '名字', '四面', '回家', '夜晚', '天無絕人之路', '奔喪', '寂寞', '少人', '差使', '幽僻', '徐州', '忘記', '怕人', '想起', '打算', '日子', '晚上', '曲折', '更加', '月光', '有些', '楊柳', '正是', '沒有', '沿着', '流下', '淡淡的', '滿院', '煤屑', '父親', '狼藉', '白天', '相見', '看見', '眼淚', '知道', '祖母', '禍不單行', '簌簌', '背影', '荷塘', '蓊蓊鬱鬱', '雖然', '許多', '跟着', '路上', '還是', '這是', '那年', '長着', '陰森森', '難過']
[[0 0 0 1 1 1 1 1 1 1 0 1 1 0 0 1 0 1 1 0 0 1 0 2 1 0 1 1 1 0 0 0 0 0 0 1
  0 0 1 0 1 0 5 1 0 1 1 1 0 2 1 1 1 0 0 0 0 1 0 0 0 1 0 0 1]
 [1 1 2 0 0 0 0 0 0 0 1 0 0 1 1 0 1 0 0 1 1 0 1 0 0 1 0 0 0 1 1 1 2 1 1 0
  1 1 0 1 0 1 0 0 1 0 0 0 1 0 0 0 0 2 1 1 1 0 1 1 1 0 1 1 0]]

TF-IDF 核心內容就是評估一字詞的重要程度
對應的API sklearn.feature_extraction.text.TfidfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer
import jieba


def tfidf_ext():
    text_demo1 = """
        　我與父親不相見已二年餘了，我最不能忘記的是他的背影。那年冬天，祖母死了，父親的差使也交卸了，
        正是禍不單行的日子，我從北京到徐州，打算跟着父親奔喪回家。到徐州見着父親，看見滿院狼藉的東西，
        又想起祖母，不禁簌簌地流下眼淚。父親説，“事已如此，不必難過，好在天無絕人之路！
        """
    text_demo2 = """
        沿着荷塘，是一條曲折的小煤屑路。這是一條幽僻的路；白天也少人走，夜晚更加寂寞。荷塘四面，長着許多樹，
        蓊蓊鬱鬱的。路的一旁，是些楊柳，和一些不知道名字的樹。沒有月光的晚上，這路上陰森森的，有些怕人。
        今晚卻很好，雖然月光也還是淡淡的。
        """

    c1 = jieba.cut(text_demo1)
    c2 = jieba.cut(text_demo2)

    # 轉換成字符串
    text1 = ' '.join(list(c1))
    text2 = ' '.join(list(c2))

    tf = TfidfVectorizer()
    data = tf.fit_transform([text1, text2])
    print(tf.get_feature_names())
    print(data.toarray())
   
 if __name__ == '__main__':
    tfidf_ext()

結果

['一些', '一旁', '一條', '不必', '不禁', '不能', '東西', '事已如此', '二年', '交卸', '今晚', '冬天', '北京', '名字', '四面', '回家', '夜晚', '天無絕人之路', '奔喪', '寂寞', '少人', '差使', '幽僻', '徐州', '忘記', '怕人', '想起', '打算', '日子', '晚上', '曲折', '更加', '月光', '有些', '楊柳', '正是', '沒有', '沿着', '流下', '淡淡的', '滿院', '煤屑', '父親', '狼藉', '白天', '相見', '看見', '眼淚', '知道', '祖母', '禍不單行', '簌簌', '背影', '荷塘', '蓊蓊鬱鬱', '雖然', '許多', '跟着', '路上', '還是', '這是', '那年', '長着', '陰森森', '難過']
[[0.         0.         0.         0.12598816 0.12598816 0.12598816
  0.12598816 0.12598816 0.12598816 0.12598816 0.         0.12598816
  0.12598816 0.         0.         0.12598816 0.         0.12598816
  0.12598816 0.         0.         0.12598816 0.         0.25197632
  0.12598816 0.         0.12598816 0.12598816 0.12598816 0.
  0.         0.         0.         0.         0.         0.12598816
  0.         0.         0.12598816 0.         0.12598816 0.
  0.62994079 0.12598816 0.         0.12598816 0.12598816 0.12598816
  0.         0.25197632 0.12598816 0.12598816 0.12598816 0.
  0.         0.         0.         0.12598816 0.         0.
  0.         0.12598816 0.         0.         0.12598816]
 [0.15617376 0.15617376 0.31234752 0.         0.         0.
  0.         0.         0.         0.         0.15617376 0.
  0.         0.15617376 0.15617376 0.         0.15617376 0.
  0.         0.15617376 0.15617376 0.         0.15617376 0.
  0.         0.15617376 0.         0.         0.         0.15617376
  0.15617376 0.15617376 0.31234752 0.15617376 0.15617376 0.
  0.15617376 0.15617376 0.         0.15617376 0.         0.15617376
  0.         0.         0.15617376 0.         0.         0.
  0.15617376 0.         0.         0.         0.         0.31234752
  0.15617376 0.15617376 0.15617376 0.         0.15617376 0.15617376
  0.15617376 0.         0.15617376 0.15617376 0.        ]]

2.1.2 特徵處理

什麼是特徵處理？就是通過特定的數學方法將數據轉換成我們算法要求的數據。

歸一化
min-max 歸一化的公式為：

mean 歸一化的公式為：

其中 mean(x)、min(x) 和 max(x) 分別是樣本數據的平均值、最小值和最大值。
再求X``

在這裏插入圖片描述
其中mx，mi分別為指定區間[mi, mx]，一般mx為1,mi為0。

歸一化的目的：通過對原始數據進行變換把數據映射到指定區間之間，一般是[0,1]，那就是省去求X``, 這步可以省略。

sklearn庫提供了歸一化API：sklearn.preprocessing.MinMaxScaler
代碼示例：

from sklearn.preprocessing import MinMaxScaler

def mm_ex():
    """
    歸一化示例
    :return:
    """
    test_dict = [[99,1,18,1002],
                [88,4,18,1400],
                [89, 4,25,1201],
                [97,2,19,2800]]
    mm = MinMaxScaler()
    data = mm.fit_transform(test_dict)
    print(data)
    
if __name__ == '__main__':
    mm_ex()

結果

[[1.         0.         0.         0.        ]
 [0.         1.         0.         0.22135706]
 [0.09090909 1.         1.         0.11067853]
 [0.81818182 0.33333333 0.14285714 1.        ]]

歸一化的結果因最大值和最小值而變化，所以容易特異點數據影響，如

    test_dict = [[99,1,18,200000],
                [88,4,18,1400],
                [89, 4,25,1201],
                [97,2,19,2800]]

200000這個數值就對結果有很大影響。

標準化
公式

u為平均值，σ為標準差

sklearn提供了標準化API: scikit-learn.preprocessing.StandardScaler
代碼示例

from sklearn.preprocessing import StandardScaler

def standard_ex():
    """
    標準化API示例
    :return:
    """
    test_dict = [[99, 1, 18, 1002],
                 [88, 4, 18, 1400],
                 [89, 4, 25, 1201],
                 [97, 2, 19, 2800]]
    
    ss = StandardScaler()
    data = ss.fit_transform(test_dict)
    print(data)


if __name__ == '__main__':
    standard_ex()

結果

[[ 1.1941005  -1.34715063 -0.68599434 -0.84743801]
 [-1.09026568  0.96225045 -0.68599434 -0.28413057]
 [-0.88259602  0.96225045  1.71498585 -0.56578429]
 [ 0.7787612  -0.57735027 -0.34299717  1.69735287]]

leafgood 博客

leafgood 博客

博客 / 詳情