Elasticsearch全文搜索快速入門

1. 概述

全文搜索查詢並對文檔執行語言學搜索。它包括單個或多個單詞或短語，並返回滿足搜索條件的文檔。

ElasticSearch 是基於 Apache Lucene 的搜索引擎，Apache Lucene 是一個免費且開源的信息檢索軟件庫。它提供了一個分佈式、全文搜索引擎，具有 HTTP Web 界面和無模式 JSON 文檔。

本文將探討 ElasticSearch 的 REST API，並僅使用 HTTP 請求演示基本操作。

2. 安裝準備

為了在您的機器上安裝ElasticSearch，請參考官方安裝指南。

RESTfull API 在 9200 端口運行。讓我們使用以下 curl 命令測試它是否正常運行：

curl -XGET 'http://localhost:9200/'

如果您觀察到以下響應，則實例已正確運行：

{
  "name": "NaIlQWU",
  "cluster_name": "elasticsearch",
  "cluster_uuid": "enkBkWqqQrS0vp_NXmjQMQ",
  "version": {
    "number": "5.1.2",
    "build_hash": "c8c4c16",
    "build_date": "2017-01-11T20:18:39.146Z",
    "build_snapshot": false,
    "lucene_version": "6.3.0"
  },
  "tagline": "You Know, for Search"
}

3. 索引文檔

ElasticSearch 是文檔導向的。它存儲和索引文檔。索引創建或更新文檔。索引後，您可以搜索、排序和過濾完整的文檔——而不是列式數據的行。這種思維方式與數據處理的方式 fundamentally 截然不同，這也是 ElasticSearch 能夠進行復雜的全文本搜索的原因之一。

文檔以 JSON 對象表示。 JSON 序列化受大多數編程語言的支持，並且已成為 NoSQL 運動的標準格式。它簡單、簡潔且易於閲讀。

我們將使用以下隨機條目來執行全文本搜索：

{
  "title": "He went",
  "random_text": "He went such dare good fact. The small own seven saved man age."
}

{
  "title": "He oppose",
  "random_text": 
    "He oppose at thrown desire of no. \
      Announcing impression unaffected day his are unreserved indulgence."
}

{
  "title": "Repulsive questions",
  "random_text": "Repulsive questions contented him few extensive supported."
}

{
  "title": "Old education",
  "random_text": "Old education him departure any arranging one prevailed."
}

在索引文檔之前，我們需要決定將其存儲在哪裏。它可以有多個索引，這些索引包含多個類型。這些類型包含多個文檔，並且每個文檔都有多個字段。

我們將使用以下方案存儲我們的文檔：

text：索引名稱。
article：類型名稱。
id：此特定文本條目的 ID。

要添加文檔，我們將運行以下命令：

curl -XPUT 'localhost:9200/text/article/1?pretty'
  -H 'Content-Type: application/json' -d '
{
  "title": "He went",
  "random_text": 
    "He went such dare good fact. The small own seven saved man age."
}'

在這裏，我們使用 id=1，可以使用相同的命令添加其他條目並遞增 ID。

4. 檢索文檔

在我們添加所有文檔後，可以使用以下命令檢查集羣中包含的文檔數量：

curl -XGET 'http://localhost:9200/_count?pretty' -d '
{
  "query": {
    "match_all": {}
  }
}'

此外，我們還可以使用其 ID 獲取文檔，如下所示：

curl -XGET 'localhost:9200/text/article/1?pretty'

我們應該從 Elasticsearch 獲得以下響應：

{
  "_index": "text",
  "_type": "article",
  "_id": "1",
  "_version": 1,
  "found": true,
  "_source": {
    "title": "He went",
    "random_text": 
      "He went such dare good fact. The small own seven saved man age."
  }
}

如你所見，此響應與使用 ID 1 添加的條目相對應。

5. 查詢文檔

好的，讓我們使用以下命令執行全文搜索：

curl -XGET 'localhost:9200/text/article/_search?pretty' 
  -H 'Content-Type: application/json' -d '
{
  "query": {
    "match": {
      "random_text": "him departure"
    }
  }
}'

我們得到以下結果：

{
  "took": 32,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1.4513469,
    "hits": [
      {
        "_index": "text",
        "_type": "article",
        "_id": "4",
        "_score": 1.4513469,
        "_source": {
          "title": "Old education",
          "random_text": "Old education him departure any arranging one prevailed."
        }
      },
      {
        "_index": "text",
        "_type": "article",
        "_id": "3",
        "_score": 0.28582606,
        "_source": {
          "title": "Repulsive questions",
          "random_text": "Repulsive questions contented him few extensive supported."
        }
      }
    ]
  }
}

正如我們所見，我們正在搜索 “him departure”，並且我們得到了兩個結果，具有不同的得分。第一個結果顯而易見，因為文本中包含執行過的搜索內容，並且我們可以看到得分是 1.4513469。

第二個結果是檢索到的，因為目標文檔包含單詞“him”。

默認情況下，ElasticSearch 按匹配結果的相關性得分排序，即每個文檔與查詢的匹配程度。請注意，第二個結果的得分相對於第一個命中來説很小，表明相關性較低。

6. 模糊搜索

模糊匹配將兩個“模糊相似”的詞語視為相同的詞語。首先，我們需要定義“模糊”的含義。

Elasticsearch 支持的最大編輯距離，通過 fuzziness 參數指定，為 2。 fuzziness 參數可以設置為 AUTO，這將導致以下最大編輯距離：

0 對於一到兩個字符的字符串
1 對於三、四或五個字符的字符串
2 對於超過五個字符的字符串

你可能會發現，編輯距離為 2 返回的結果似乎與目標無關。

與最大模糊性為 1 時，你可能會獲得更好的結果，以及更好的性能。距離是指 Levenshtein 距離，這是一種衡量兩個序列之間差異的字符串度量。形式上，兩個單詞的 Levenshtein 距離是單字符編輯的最小數量。

好的，讓我們使用模糊性執行搜索：

curl -XGET 'localhost:9200/text/article/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "query":
  {
    "match":
    {
      "random_text":
      {
        "query": "him departure",
        "fuzziness": "2"
      }
    }
  }
}'

以下是結果：

{
  "took": 88,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 1.5834423,
    "hits": [
      {
        "_index": "text",
        "_type": "article",
        "_id": "4",
        "_score": 1.4513469,
        "_source": {
          "title": "Old education",
          "random_text": "Old education him departure any arranging one prevailed."
        }
      },
      {
        "_index": "text",
        "_type": "article",
        "_id": "2",
        "_score": 0.41093433,
        "_source": {
          "title": "He oppose",
          "random_text":
            "He oppose at thrown desire of no. 
              \ Announcing impression unaffected day his are unreserved indulgence."
        }
      },
      {
        "_index": "text",
        "_type": "article",
        "_id": "3",
        "_score": 0.2876821,
        "_source": {
          "title": "Repulsive questions",
          "random_text": "Repulsive questions contented him few extensive supported."
        }
      },
      {
        "_index": "text",
        "_type": "article",
        "_id": "1",
        "_score": 0.0,
        "_source": {
          "title": "He went",
          "random_text": "He went such dare good fact. The small own seven saved man age."
        }
      }
    ]
  }
}'

正如我們所見，模糊性使我們獲得了更多的結果。

我們需要小心地使用模糊性，因為它傾向於檢索看起來不相關的結果。

7. 結論

在本快速教程中，我們重點介紹了 文檔索引和通過 Elasticsearch 的 REST API 進行全文搜索

當然，我們提供多種編程語言的 API，以便在需要時使用——但該 API 仍然非常方便且不受語言限制。

1. 概述

2. 安裝準備

3. 索引文檔

4. 檢索文檔

5. 查詢文檔

6. 模糊搜索

7. 結論

發佈評論

Product

Company

Support

Company