[ElasticStack-29] Elasticsearch 동의어 관리 방식

Notice

Recent Posts

Recent Comments

Link

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Tags more

Archives

Today

Total

관리 메뉴

Gibbs Kim's playground

[ElasticStack-29] Elasticsearch 동의어 관리 방식 본문

Tech 기록지/Elastic Stack

[ElasticStack-29] Elasticsearch 동의어 관리 방식

Lio Grande 2020. 9. 1. 17:30

Elasticsearch는 Tokenizer를 통해 토큰이 모두 분리되면 다양한 토큰 필터를 적용할 수 있다.

토큰 필터들 중 Synonym 필터를 적용한다면 동의어 처리에 대한 기능을 Elasticsearch를 통해 수행할 수 있다.

* 동의어를 추가하는 방법

1) 동의어를 mapping 설정 정보에 미리 파라미터로 등록하는 방식

2) 특정 파일을 별도로 생성해서 관리하는 방식 (Ex: 동의어 사전)

첫 번째 방식은 실무에서 잘 사용되지 않는다.

-> mapping 정보에서 동의어를 관리할 경우, 운영 중에는 동의어를 변경하기가 어렵기 때문

그래서 Elasticsearch에서 동의어는 주로 동의어 사전을 통해 관리된다.

* 동의어 사전 만들기

$ES_HOME/config/analysis/synonym.txt

동의어는 다음과 같은 방법으로 관리할 수 있다.

[1] 동의어 추가

# 동의어를 추가할 때 단어를 쉼표(,)로 분리하여 등록하는 방법
Elasticsearch,엘라스틱서치

[2] 동의어 치환

# 특정 단어를 임의의 단어로 변경하고 싶을 때 치환 기능을 사용
# 동의어를 치환하면 원본 토큰이 제거되고 변경될 새로운 토큰이 추가된다.
# 화살표 (=>)로 작성
Harry => 해리

다음은 직접 테스트한 과정을 작성합니다.

# 동의어 테스트 인덱스 1번 생성
PUT sample_synonym
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "synonym_test": {
            "tokenizer": "whitespace",
            "filter": [ "synonym" ]
          }
        },
        "filter": {
          "synonym": {
            "type": "synonym",
            "synonyms_path": "analysis/synonym.txt",
            "updatable": true
          }
        }
      }
    }
  }
}

테스트 인덱스 1번을 기준으로 _analyze 한 예제입니다.

# 쿼리 예제 1번
GET sample_synonym/_analyze
{
  "analyzer": "synonym_test",
  "text": "코타키나바루 호치민 시티"
}

# 쿼리 예제 2번
GET sample_synonym/_analyze
{
  "analyzer": "synonym_test",
  "text": "코타키나바루,호치민 시티"
}

테스트 인덱스 1번을 기준으로 _analyze한 결과입니다.

# "text" : "코타키나바루 호치민 시티" 로 검색한 결과
{
  "tokens" : [
    {
      "token" : "코타키나바루",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "Kota",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "코타",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "호치민",
      "start_offset" : 7,
      "end_offset" : 10,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "Kinabalu",
      "start_offset" : 7,
      "end_offset" : 10,
      "type" : "SYNONYM",
      "position" : 1
    },
    {
      "token" : "키나바루",
      "start_offset" : 7,
      "end_offset" : 10,
      "type" : "SYNONYM",
      "position" : 1
    },
    {
      "token" : "Thành",
      "start_offset" : 7,
      "end_offset" : 10,
      "type" : "SYNONYM",
      "position" : 1
    },
    {
      "token" : "Ho",
      "start_offset" : 7,
      "end_offset" : 10,
      "type" : "SYNONYM",
      "position" : 1
    },
    {
      "token" : "호치",
      "start_offset" : 7,
      "end_offset" : 10,
      "type" : "SYNONYM",
      "position" : 1
    },
    {
      "token" : "시티",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "Pho",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "SYNONYM",
      "position" : 2
    },
    {
      "token" : "Chi",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "SYNONYM",
      "position" : 2
    },
    {
      "token" : "민",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "SYNONYM",
      "position" : 2
    },
    {
      "token" : "Ho",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "SYNONYM",
      "position" : 3
    },
    {
      "token" : "Minh",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "SYNONYM",
      "position" : 3
    },
    {
      "token" : "시티",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "SYNONYM",
      "position" : 3
    },
    {
      "token" : "Chí",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "SYNONYM",
      "position" : 4
    },
    {
      "token" : "City",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "SYNONYM",
      "position" : 4
    },
    {
      "token" : "Minh",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "SYNONYM",
      "position" : 5
    }
  ]
}

# "text": "코타키나바루,호치민 시티" 로 검색한 결과
{
  "tokens" : [
    {
      "token" : "코타키나바루,호치민",
      "start_offset" : 0,
      "end_offset" : 10,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "시티",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "word",
      "position" : 1
    }
  ]
}

** 토크나이저 기준이 공백(whitespace)이기 때문에 다른 방식(Ex: 콤마)으로 검색을 수행하면 토크나이징이 제대로 수행되지 않음을 확인할 수 있습니다.

테스트 인덱스 2번을 기준으로 _analyze 한 예제입니다

# whitespace 활용한 검색
GET sample_synonym_nori/_analyze
{
  "analyzer": "synonym_test",
  "text": "코타키나바루 호치민 시티"
}

# 콤마 활용한 검색
GET sample_synonym_nori/_analyze
{
  "analyzer": "synonym_test",
  "text": "코타키나바루,호치민 시티"
}

테스트 인덱스 2번을 기준으로 _analyze한 결과입니다.

{
  "tokens" : [
    {
      "token" : "코타키",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "Kota",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "코타",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "나",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "Kinabalu",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "SYNONYM",
      "position" : 1
    },
    {
      "token" : "키",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "SYNONYM",
      "position" : 1
    },
    {
      "token" : "바루",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "나",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "SYNONYM",
      "position" : 2
    },
    {
      "token" : "호치민",
      "start_offset" : 7,
      "end_offset" : 10,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "바루",
      "start_offset" : 7,
      "end_offset" : 10,
      "type" : "SYNONYM",
      "position" : 3
    },
    {
      "token" : "Thành",
      "start_offset" : 7,
      "end_offset" : 10,
      "type" : "SYNONYM",
      "position" : 3
    },
    {
      "token" : "Ho",
      "start_offset" : 7,
      "end_offset" : 10,
      "type" : "SYNONYM",
      "position" : 3
    },
    {
      "token" : "호치",
      "start_offset" : 7,
      "end_offset" : 10,
      "type" : "SYNONYM",
      "position" : 3
    },
    {
      "token" : "시티",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "Pho",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "SYNONYM",
      "position" : 4
    },
    {
      "token" : "Chi",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "SYNONYM",
      "position" : 4
    },
    {
      "token" : "민",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "SYNONYM",
      "position" : 4
    },
    {
      "token" : "Ho",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "SYNONYM",
      "position" : 5
    },
    {
      "token" : "Minh",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "SYNONYM",
      "position" : 5
    },
    {
      "token" : "시티",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "SYNONYM",
      "position" : 5
    },
    {
      "token" : "Chí",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "SYNONYM",
      "position" : 6
    },
    {
      "token" : "City",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "SYNONYM",
      "position" : 6
    },
    {
      "token" : "Minh",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "SYNONYM",
      "position" : 7
    }
  ]
}

** 토크나이저 기준이 "nori_tokenizer"이기에 같은 결과가 도출되었습니다.

인덱스 관련 추가 설정

- Index setting시에 "updatable" : true 옵션을 설정해주면 동의어 사전 수정 시 아래의 API 적용으로 변경된 동의어 사전의 기능이 제대로 작동된다고 합니다.

POST /sample_synonym/_reload_search_analyzers

POST /sample_synonym_nori/_reload_search_analyzers

- 개인적으로 7.7.1 버전에서 테스트 시에는 해당 API를 적용하였음에도 불구하고 제대로 업데이트가 되지 않아 다음 기능을 사용하여 Update 상황을 적용하였습니다.

# 인덱스 종료
POST /sample_synonym_nori/_close
# 인덱스 시작
POST /sample_synonym_nori/_open

참고링크 : icarus8050.tistory.com/49

[Elastic Search] 동의어 사전

동의어 사전 Tokenizer에 의해 토큰이 모두 분리되면 다양한 토큰 필터를 적용해 토큰을 가공할 수 있습니다. 토큰 필터를 이용하면 토큰을 변경하는 것은 물론이고 토큰을 추가하거나 삭제하는

icarus8050.tistory.com

'Tech 기록지 > Elastic Stack' 카테고리의 다른 글

[ElasticStack-31] Logstash - JDBC input (0)	2021.03.03
[ElasticStack-30] Elastic Cloud 사용자 사전 (0)	2021.03.03
[ElasticStack-28] search.max_buckets, max_result_window 사이즈 증가 (0)	2020.08.18
[ElasticStack-27] Elasticsearch msearch (with Python) (0)	2020.07.23
[ElasticStack-26] ES data directory move (0)	2020.07.10

'Tech 기록지/Elastic Stack' Related Articles

Gibbs Kim's playground

[ElasticStack-29] Elasticsearch 동의어 관리 방식 본문

[ElasticStack-29] Elasticsearch 동의어 관리 방식

'Tech 기록지 > Elastic Stack' 카테고리의 다른 글

티스토리툴바