[Elasticsearch] 인덱스 생성, 엘라스틱서치 벡터 검색, 하이브리드 검색

검색엔진 알고리즘

BM25 (Best match 25)
TF-IDF (용어 빈도-역문서 빈도) 접근 방식의 확장
키워드 검색
희소 백터 생성

의미검색

검색의 이면의 맥락과 의미를 이해하여 검색 결과의 정확성과 관련성을 높이는 것을 목표로 하는 검색 방식

Elasticsearch 검색 API

검색결과

took: 검색에 걸린 시간
time_out: 검색 시간이 초과되었는지 여부
shards: 검색된 파편의 수와 성공 / 실패한 파편의 수를 알림 (정상이면 안뜸)
hits: 검색결과
hits.total: 검색조건과 일치하는 총 문서 수
hits.hits: 검색결과의 실제 배열

POST /drama_hnsw/_search
{
  "query": {
    "match": {
      "Synopsis": "medical"
    }
  },
  "_source": [
    "Synopsis", "Name"
  ]
}

{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 7,
      "relation": "eq"
    },
    "max_score": 27.088001,
    "hits": [
      {
        "_index": "drama_index",
        "_id": "WOKGH5ABGE6D2-d56djW",
        "_score": 27.088001,
        "_source": {
          "Name": "God's Quiz: Reboot",
          "Synopsis": "This is a drama about elite doctors and forensic scientists investigating mysterious deaths and solving mysteries related to rare diseases. Han Jin Woo is the tortured medical genius with a miracle brain that has been through its share of trouble. After he gets involved in an unexpected case, he returns to the medical examiner office for the first time in 4 years."
        }
      },

인덱스 생성 코드

# 엘라스틱서치 connect
es = Elasticsearch(hosts=['http://localhost:9200'],http_auth=("elastic", "changeme"))

# Index 설정
index_definition = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 1,
        "analysis": {
            "analyzer": {
                "default": {
                    "type": "standard"
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "Name": {
                "type": "text",
                "analyzer": "standard"
            },
            "Synopsis": {
                "type": "text",
                "analyzer": "standard"
            },
            "synopsis_vector": {
            "type": "dense_vector",
            "dims":  768,
            "index": True,
            "similarity": "l2_norm",
            "index_options": {
                "type": "hnsw",
                "m": 16,
                "ef_construction": 100
            }
        }
    }
}}

# 인덱스 생성
index_name = 'drama_hnsw'
# 인덱스 생성 (기존 인덱스 삭제 후 생성)
es.indices.delete(index=index_name, ignore=[400, 404])
es.indices.create(index=index_name, body=index_definition)

# 인덱스 생성 확인
if es.indices.exists(index=index_name):
    print(f"Index '{index_name}' created successfully.")
else:
    print(f"Failed to create index '{index_name}'.")

데이터 인풋 코드

from elasticsearch import helpers

upload_list = []
for idx, row in df.iterrows():
    document = {
        "_index": "drama_hnsw",
        "_id": idx,
        "_source": {
            "Name": row['Name'],
            "Synopsis": row['Synopsis'],
            "synopsis_vector": embed_text(row['Synopsis'])
        }
    }
    upload_list.append(document)

resp = helpers.bulk(es, upload_list)

VectorDB

엘라스틱 서치에서 제공하는 모델도 있음
- elser: 영어
- e5 : 영어가 아닌 경우 사용
vector 데이터 타입을 지원함. 벡터 차원을 저장하고 유사도 계산 기능을 지원 (7.3이상)
- cosineSimilarity: 코사인
- l2norm: 유클리드 거리
- Dot Product: 내적
임베딩 모델을 사용하여 벡터 값 생성하여 인덱싱 (엘라스틱서치에 저장하고 색인화하는 과정이지만 벡터는 색인화는 다른 방식으로 진행)
-> type을 dence_vector로 저장
-> 사용자 질문 벡터화
-> sript_score 쿼리(엘라스틱 서치 제공)를 사용하여 유사도 계산(엘라스틱 서치 유사도 계산 제공)하면 유사도 높은 문서들 추출

script_query = {
	"script_score": {
		#문서 반환하는데 사용되는 쿼리
		"query": {"match_all": {}},
		
		#반환된 문서의 점수를 계산
		"script": {
			"source": "cosineSimilarity(params.query_vector, 'synopsis_vector') + 1.0",
			"params": {"query_vector": query.tolist()}

}
}
}

    "hits": [
      {
        "_index": "drama_index", 
        "_id": "WOKGH5ABGE6D2-d56djW",
        "_score": 27.088001, # 검색결과 문서의 일치 기반
        "_source": {
          "Name": "God's Quiz: Reboot",
          "Synopsis": "This is a drama about elite doctors and forensic scientists investigating mysterious deaths and solving mysteries related to rare diseases. Han Jin Woo is the tortured medical genius with a miracle brain that has been through its share of trouble. After he gets involved in an unexpected case, he returns to the medical examiner office for the first time in 4 years."
        }
      },

score는 검색 엔진이 문서와 검색 쿼리 간의 일치 정도를 자동으로 계산하는 값
- 고려사항: TF-IDF(단어, 필드 길이 정규화, 검색어 위치 등을 고려. 사용자가 직접 설정하거나 임의로 변경하는 것은 지원안함

차이

구분	엘라스틱서치	엘라스틱서치 벡터
목적	텍스트 기반 검색	벡터 유사도 검색
검색방식	특정 단어를 포함하는 문서 빠르게 조회	벡터간의 거리를 계산하여 유사한 벡터 조회
적용분야	키워드 검색	문서 유사도 검색

하이브리드 검색

고급 rag 기술 article: https://pub.towardsai.net/advanced-rag-techniques-an-illustrated-overview-04d193d8fec6
밀집 벡터(의미 검색)와 희소한 벡터(키워드 검색) 병합
키워드 검색: 단어나 문구를 입력하면 데이터 베이스에 관련 용어 검색
벡터 검색: 쿼리의 의미를 식별하여 의미가 관련된 문서 return

BM25

TF(용어빈도): 문서에 검색어가 몇번 나타나는지
IDF(역문서 빈도): 희귀한 용어에 더 많은 중요성 부여
문서 길이 정규화: 긴 문서가 결과를 불공평하게 지배하지 않도록
쿼리 용어 포화: 지나치게 반복되는 용어로 결과가 왜곡되는 것을 방지

hybrid architecture

Opensearch VS Elasticsearch

저작자표시 비영리 변경금지 (새창열림)

'ML & DL > [NLP] LLM' 카테고리의 다른 글

[ELK] 개념 및 구조 (0)	2024.07.17

월클데싸

[Elasticsearch] 인덱스 생성, 엘라스틱서치 벡터 검색, 하이브리드 검색

검색엔진 알고리즘

의미검색

Elasticsearch 검색 API

검색결과

인덱스 생성 코드

데이터 인풋 코드

VectorDB

차이

하이브리드 검색

BM25

hybrid architecture

Opensearch VS Elasticsearch

'ML & DL > [NLP] LLM' 카테고리의 다른 글

댓글

티스토리툴바

[Elasticsearch] 인덱스 생성, 엘라스틱서치 벡터 검색, 하이브리드 검색

검색엔진 알고리즘

의미검색

Elasticsearch 검색 API

검색결과

인덱스 생성 코드

데이터 인풋 코드

VectorDB

차이

하이브리드 검색

BM25

hybrid architecture

Opensearch VS Elasticsearch

'ML & DL > [NLP] LLM' 카테고리의 다른 글

관련글

댓글

티스토리툴바