ElasticSearch全文搜索快速入門

1. 概述

全文搜索查詢並對文檔執行語言學搜索。它包含單個或多個單詞或短語，並返回滿足搜索條件的文檔。

ElasticSearch 是基於 Apache Lucene 的分佈式全文搜索引擎，Apache Lucene 是一個免費且開源的信息檢索軟件庫。它提供了一個具有 HTTP Web 界面和無模式 JSON 文檔的分佈式全文搜索引擎。

本文將探討 ElasticSearch 的 REST API，並僅使用 HTTP 請求演示基本操作。

2. 安裝準備

為了在您的機器上安裝 Elasticsearch，請參考官方的安裝指南：官方安裝指南。

RESTfull API 運行在 9200 端口。讓我們使用以下 curl 命令測試其是否正常運行：

curl -XGET 'http://localhost:9200/'

如果觀察到以下響應，則實例正在正常運行：

{
  "name": "NaIlQWU",
  "cluster_name": "elasticsearch",
  "cluster_uuid": "enkBkWqqQrS0vp_NXmjQMQ",
  "version": {
    "number": "5.1.2",
    "build_hash": "c8c4c16",
    "build_date": "2017-01-11T20:18:39.146Z",
    "build_snapshot": false,
    "lucene_version": "6.3.0"
  },
  "tagline": "You Know, for Search"
}

3. 索引文檔

ElasticSearch 是一種文檔導向的搜索引擎。它存儲和索引文檔。索引操作創建或更新文檔。索引完成後，您可以搜索、排序和過濾完整的文檔，而不是行式列式的數據。這種思維方式與傳統數據處理方式不同，也是ElasticSearch能夠進行復雜全文本搜索的原因之一。

文檔以 JSON 對象表示。JSON 序列化受大多數編程語言的支持，並且已成為 NoSQL 運動的標準格式。它簡潔、易讀且易於使用。

我們將使用以下隨機條目來執行全文本搜索：

{
  "title": "He went",
  "random_text": "He went such dare good fact. The small own seven saved man age."
}

{
  "title": "He oppose",
  "random_text": 
    "He oppose at thrown desire of no. \
      Announcing impression unaffected day his are unreserved indulgence."
}

{
  "title": "Repulsive questions",
  "random_text": "Repulsive questions contented him few extensive supported."
}

{
  "title": "Old education",
  "random_text": "Old education him departure any arranging one prevailed."
}

為了索引文檔，我們首先需要決定將其存儲在何處。允許存在多個索引，這些索引又包含多種類型。這些類型包含多個文檔，並且每個文檔都有多個字段。

我們將使用以下方案存儲我們的文檔：

text：索引名稱。
article：類型名稱。
id：此特定文本條目的 ID。

要添加文檔，我們將運行以下命令：

curl -XPUT 'localhost:9200/text/article/1?pretty'
  -H 'Content-Type: application/json' -d '
{
  "title": "He went",
  "random_text": 
    "He went such dare good fact. The small own seven saved man age."
}'

我們在這裏使用 id=1，可以使用相同的命令添加其他條目並遞增 ID。

4. 檢索文檔

在將所有文檔添加到集羣后，可以使用以下命令檢查集羣中包含的文檔數量：

curl -XGET 'http://localhost:9200/_count?pretty' -d '
{
  "query": {
    "match_all": {}
  }
}'

此外，我們也可以使用其 ID 獲取文檔，使用以下命令：

curl -XGET 'localhost:9200/text/article/1?pretty'

我們應該從 elastic search 得到以下答案：

{
  "_index": "text",
  "_type": "article",
  "_id": "1",
  "_version": 1,
  "found": true,
  "_source": {
    "title": "He went",
    "random_text": 
      "He went such dare good fact. The small own seven saved man age."
  }
}

我們可以看到這個答案與使用 id 1 添加的條目相對應。

5. 查詢文檔

好的，讓我們使用以下命令執行全文搜索：

curl -XGET 'localhost:9200/text/article/_search?pretty' 
  -H 'Content-Type: application/json' -d '
{
  "query": {
    "match": {
      "random_text": "him departure"
    }
  }
}'

我們得到以下結果：

{
  "took": 32,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1.4513469,
    "hits": [
      {
        "_index": "text",
        "_type": "article",
        "_id": "4",
        "_score": 1.4513469,
        "_source": {
          "title": "Old education",
          "random_text": "Old education him departure any arranging one prevailed."
        }
      },
      {
        "_index": "text",
        "_type": "article",
        "_id": "3",
        "_score": 0.28582606,
        "_source": {
          "title": "Repulsive questions",
          "random_text": "Repulsive questions contented him few extensive supported."
        }
      }
    ]
  }
}

如我們所見，我們正在尋找“him departure”，並得到兩個結果，具有不同的得分。第一個結果顯而易見，因為文本中包含執行過的搜索詞，並且得分是 1.4513469。

第二個結果是由於目標文檔包含“him”這個詞而檢索到的。

默認情況下，ElasticSearch 會按匹配結果的相關性得分進行排序，即按每個文檔與查詢的匹配程度。請注意，第二個結果的得分相對於第一個命中較低，表明相關性較低。

6. 模糊搜索 模糊匹配將兩個“模糊相似”的詞語視為相同的詞語。首先，我們需要定義“模糊”一詞的含義。

Elasticsearch 支持一個最大編輯距離，通過 fuzziness 參數指定，最多為 2。 fuzziness 參數可以設置為 AUTO，這會產生以下最大編輯距離：

0 對於長度為一或兩個字符的字符串1 對於長度為三、四或五個字符的字符串2 對於長度超過五個字符的字符串你可能會發現，編輯距離為 2 返回的結果可能與預期不符。

使用最大模糊度為 1 時，可以獲得更好的結果，並提高性能。距離是指 Levenshtein 距離，這是一種字符串度量，用於衡量兩個序列之間的差異。形式上，兩個單詞之間的 Levenshtein 距離是進行單字符編輯的最小次數。

現在，讓我們使用模糊度執行搜索：

curl -XGET 'localhost:9200/text/article/_search?pretty' -H 'Content-Type: application/json' -d' 
{ 
  "query": 
  { 
    "match": 
    { 
      "random_text": 
      {
        "query": "him departure",
        "fuzziness": "2"
      }
    } 
  } 
}'

以下是結果:

{
  "took": 88,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 1.5834423,
    "hits": [
      {
        "_index": "text",
        "_type": "article",
        "_id": "4",
        "_score": 1.4513469,
        "_source": {
          "title": "Old education",
          "random_text": "Old education him departure any arranging one prevailed."
        }
      },
      {
        "_index": "text",
        "_type": "article",
        "_id": "2",
        "_score": 0.41093433,
        "_source": {
          "title": "He oppose",
          "random_text":
            "He oppose at thrown desire of no. 
              \ Announcing impression unaffected day his are unreserved indulgence."
        }
      },
      {
        "_index": "text",
        "_type": "article",
        "_id": "3",
        "_score": 0.2876821,
        "_source": {
          "title": "Repulsive questions",
          "random_text": "Repulsive questions contented him few extensive supported."
        }
      },
      {
        "_index": "text",
        "_type": "article",
        "_id": "1",
        "_score": 0.0,
        "_source": {
          "title": "He went",
          "random_text": "He went such dare good fact. The small own seven saved man age."
        }
      }
    ]
  }
}'

正如我們所見，模糊查詢會給我們帶來更多結果。

我們需要謹慎使用模糊查詢，因為它容易檢索到看似無關的結果。

7. 結論

在本快速教程中，我們重點介紹了對文檔進行索引以及通過 Elasticsearch 的 REST API 進行全文搜索。

當然，我們提供多種編程語言的 API 以供使用，但 API 仍然非常方便且不受語言限制。

知識庫 / REST RSS 訂閱

1. 概述

2. 安裝準備

3. 索引文檔

4. 檢索文檔

5. 查詢文檔

7. 結論

發佈評論

Product

Company

Support

Company

知識庫 / REST RSS 訂閱