VectorDB #2 (memory issue)

research

VectorDB #2 (memory issue)

코코팜 2025. 4. 2. 02:37

vectorDB HNSW 인덱싱 메모리 이슈 문제 실험 및 수치화

HNSW 그래프가 메모리에 상주해야 하는 구조적 특성으로 인한 서비스 상의 문제점 정리

환경설정

production 환경과 비슷한 local 실험 환경 재구성

OpenSearch cluster (3 node)

호스트에 3개의 container로 opensearch cluster 생성
Host : 11 core, 36GB
1.2M document Indexed

Metric

cpu usage (1분간 점유한 코어 개수 평균)
virtual memory
rss (physical memory allocated)
file IO

Index Create

"embedding": {
          "type": "knn_vector",
          "dimension": 1024,
          "method": {
            "engine": "faiss",
            "space_type": "l2",
            "name": "hnsw",
            "parameters": {
              "ef_construction": 128,
              "m": 16
            }
          }
        },

Document Insert (20만개 삽입)

임베딩된 문서를 삽입하는 task 제출 이후 cluster 상태 확인

문서가 indexing 되는동안 RSS는 비슷한 수준으로 유지 (JVM이 메모리 관리하는 듯)
실제로 Virtual memory를 보면 변동이 심함 -> index된 정보를 file로 flush
index를 저장하면 file write가 많을 것을 것으로 예상했지만, read도 많음 -> 반복적인 인덱스의 이동 (메모리 <-> file)

Search

문서 삽입 이후, topk 100 query시에 latency 및 노드 모니터링

문서 삽입 이후 HNSW 인덱스가 바로 메모리에 올라오지 않고 file 형태로 저장 확인
이후 요청시, memory 올려 놓아 cold start 발생

first search latency

Second Search latency

Warm up

Index에 있는 hnsw graph를 메모리에 올려 놓아, search 요청 시 바로 처리하도록 함

warm-up 요청 시, 노드당 3.5GB의 인덱스를 메모리에 올려놓는 것을 확인

After warm-up

이슈

반복적으로 메모리에 index load, store시 JVM GC 작동
2025-03-30 17:41:11 opensearch-node3 | [2025-03-30T08:41:11,213][INFO ][o.o.m.j.JvmGcMonitorService] [opensearch-node3] [gc][130] overhead, spent [708ms] collecting in the last [1.7s]

샤드 재분배 작업
샤드간 비슷한 크기를 갖도록 유지하는데, node의 상태에 따라 간혹 불균일한 샤드를 갖게될 경우 이를 재할당 함

GET _list/shards/target-index

target-index 0 r STARTED    238002 5.3gb 172.23.0.3 opensearch-node3
target-index 0 p STARTED    238002 5.3gb 172.23.0.5 opensearch-node1
target-index 1 r STARTED    238174 5.3gb 172.23.0.3 opensearch-node3
target-index 1 p STARTED    238174 5.3gb 172.23.0.4 opensearch-node2
target-index 2 r STARTED    238053 5.3gb 172.23.0.3 opensearch-node3
target-index 2 p STARTED    238053 5.3gb 172.23.0.5 opensearch-node1
target-index 3 r STARTED    238743 5.3gb 172.23.0.3 opensearch-node3
target-index 3 p RELOCATING 238743 5.3gb 172.23.0.5 opensearch-node1 -> 172.23.0.4 UnHMdjXFREi63---U6yhEA opensearch-node2
target-index 4 r STARTED    238772 5.3gb 172.23.0.5 opensearch-node1
target-index 4 p STARTED    238772 5.3gb 172.23.0.4 opensearch-node2
target-index 5 p STARTED    239197 5.3gb 172.23.0.5 opensearch-node1
target-index 5 r STARTED    239197 5.3gb 172.23.0.4 opensearch-node2
next_token null

이 때, 많은 File IO 및 메모리 부하 확인

Insight

local 환경이라 disk -> memory로 로드가 빠를텐데, cloud 환경이라면 이과정이 상당히 오래걸리지 않을까?
- cloud 기반 DB들은 대부분 EBS 스토리지를 사용할텐데, 이는 실시간 서비스를 구축하기엔 latency가 너무 길지 않을까?
- 특히나 vector search(HNSW)의 구조상 인덱스에 저장하는 크기가 너무 빠르게 증가함