向量检索--ByteHouse云数仓版-火山引擎

文档中心

导航

向量检索

最近更新时间：2024.11.06 13:58:49首次发布时间：2024.11.06 13:58:49

向量是一种常见的非结构化数据表现形式。基于向量相似度的 KNN 计算广泛使用于图像搜索、多模态搜索、推荐、大模型推理等场景。我们在 ByteHouse 中提供向量数据的管理与近似度查询功能，同时通过支持多种常见近似向量查询（ANN）算法来提升检索性能，以提供对非结构化数据的处理能力。当前Bytehouse支持 HNSW(hnswlib)、Faiss 两个算法库，后续还会对 DiskANN 等算法库提供支持。

添加索引

create table 时定义

CREATE TABLE test_ann
(
    `id` UInt64,
    `label` String,
    `vector` Array(Float32),
    INDEX v1 vector TYPE HNSW('DIM=960, METRIC=COSINE'),
    CONSTRAINT cons_vec_len CHECK length(vector) = 960 -- 用于数据插入检查
)
ENGINE = CnchMergeTree
ORDER BY id
SETTINGS 
index_granularity = 1024, -- 缓解读放大问题
index_granularity_bytes = 0 -- 避免生成 adaptive mark

Add index添加

ALTER TABLE test_ann ADD INDEX v1 vector TYPE HNSW('DIM=960, METRIC=COSINE')

说明：

索引只接受一个参数，类型为 String，内部的定义格式为 k1=v1, k2=v2, ... ， DIM 是一个必须的参数，代表索引对应 vector 的维度信息，必须与 vector 实际的维度一致。
METRIC 参数定义了建立索引时的度量方式。目前 HNSW 以及 Faiss 都支持 L2 与 COSINE 距离。METRIC 参数可以不指定，缺省值为 L2。需要说明的是，后续只有使用对应 metric 的 distance 函数，才能执行基于索引的查询链路。例如，索引中 metric 为 L2 的话，后续查询需使用 L2Distance 函数进行查询才能使用对应索引，同理，metric 为 COSINE 时，需要使用 cosineDistance 函数进行查询。
目前一张表仅支持构建一个 vector index，如果为一个 vector column 定义多个 vector index，或者为大于一个 vector column 定义 vector index，会报错
如果插入数据中出现了空 vector 行，如果定义了类似上述 cons_vec_len 的长度检测 constraint，则插入时会报错。如果已经插入成功了，那么在 build index 过程中也会有维度一致性检测，如果发现有维度不一致情况，也会报错，该 part 构建 index 失败。
如果遇到Not allow to create ANN INDEX的报错，可以先执行SET enable_new_ann = 1;

使用索引查询

使用索引查询方式如下：

select 
    id， 
    label,
    dist 
from test_ann 
order by cosineDistance(vector, [query_vector]) as dist 
limit 100 
settings enable_new_ann=1

说明：

在设置 enable_new_ann 为 1 时，才会进行基于我们新实现的 ann 查询路径执行
目前只有带有 order by {metric}Distance limit k 这种模式的查询才会使用向量索引进行计算
Distance 函数必须与 index 定义的 metric 对应上才能正确使用索引

此外，还支持带有 prewhere 的混合查询使用方式，比如

select 
    id,
    label,
    dist 
from test_ann 
prewhere id > 1000
order by cosineDistance(vector, [query_vector]) as dist 
limit 100 
settings enable_new_ann=1

ANN Index 详细使用说明

HNSW

图类型索引，通过一种最小世界建立图的思路，为每个节点维护 M 个最近邻节点。同时，维护一个分层图结构，越向上越稀疏，查询时从最上层开始，找到最近点以后继续向下一层计算，直到最底层，得到查询向量的 k 个最近邻。

构造语法

HNSW 使用时可以定义参数 M 和 EF_CONSTRUCTION 两个参数来在性能和准确度之间做权衡，一般来说 M 越大，EF_CONSTRUCTION 越大，索引构建时间越长，准确度越高，搜索 latency 越高。M 的缺省值为 16，EF_CONSTRUCTION 的缺省值为 200。
一个典型的构造 HNSW 索引的语句如下：

CREATE TABLE test_ann
(
    `id` UInt64,
    `vector` Array(Float32),
    INDEX v1 vector TYPE HNSW('DIM=960, METRIC=COSINE, M=32, EF_CONSTRUCTION=512')
)
ENGINE = CnchMergeTree
ORDER BY id
SETTINGS index_granularity = 1024

查询语法

查询时，可以通过设定 hnsw_ef_s 参数来控制准确度和 latency，该值越大，准确度越高，latency 越长，大于 ef_construction 参数后，理论上准确度不会有更大提升。使用方法如下

select 
    id， 
    dist 
from test_ann 
order by cosineDistance(vector, [query_vector]) as dist 
limit 100 
settings enable_new_ann=1, hnsw_ef_s=200

Faiss

Faiss 是 facebook 开源的 ann 算法库，包含了倒排(ivf), PQ, SQ 等多种类型的索引，同时多种索引还可以组合使用。我们主要使用 faiss 的 ivf 类索引，同时支持 PQ ， SQ 等向量压缩方法，以减少索引的内存使用。

构造语法

Faiss index的创建语法如下, 包括3个参数

Basic information（必须）: 数据的维度信息的必须的，除此之外还可以指定metric type。Example：'DIM=960, METRIC=COSINE'
Index key（可选）：Index信息，不同的index决定了准确度，性能和资源使用，没有最优的index，不同index的选择是这3个维度的tradeoff, 最常用的index是IVFPQ，IVF{$nlist},PQ{$m}需要指定nlist和m这2个参数。Example：'IVF1024,PQ64',
Build parameters （可选）：这个参数是控制准确度和性能tradeoff的参数。Example: 'nprobe=512, k_factor_rf=32‘

CREATE TABLE test_ann
(
    `id` UInt64,
    `vector` Array(Float32),
    INDEX v1 vector TYPE Faiss('Basic information', 'Index key', 'Build parameters')
)
ENGINE = CnchMergeTree
ORDER BY id
SETTINGS index_granularity = 128

构造 API

Faiss index的使用有3种方式 (推荐使用第一种, 对于构建时间有要求、内存紧张的场景，优先推荐 IVF_PQ_FS 类型索引)

提供几种常用的index类型，并提供微调参数给用户使用。并会根据数据量自动adaptively的调整其中少部分参数，仅保留跟数据静态特征相关的参数给用户。

支持的index类型，加粗的参数表示必须的，IVF*索引实际还要一个参数nlist，跟part数据量相关，引擎会自动推算（计算方法在下面）

FLAT('dim', 'metric')
- Dim: 维度，通常是2的幂
- Metric：L2，cosine
IVF_FLAT('dim', 'metric', 'nprobe')
- nprobe: 默认是1，越大准确率越好，性能越差，取值为2的n次方，例如32，64，范围介于[1, std::min(nlist/2, 查询里的top n)]
IVF_PQ('dim', 'metric', 'm', 'nbit', 'nprobe')
- m: the number of subvectors that we will split our vectors into, 经验取值通常为64或者32，取值范围为2的n次方，范围介于[1, std::min(dim, 64)]
- nbit: the number of bits that each subquantizer can use，默认8，通常不用动，取值为偶数, 通常范围为[1, 16]
IVF_PQ_FS('dim', 'metric', 'm', 'refine', 'k_factor_rf')
- fs scan with refine（RFlat）
- m同上
- refine,默认使用RFlat，准确度最高，还可以使用SQ4, SQ6, SQ8, SQfp16。准确度RFlat>SQfp16>SQ8>SQ6>SQ4，内存资源使用也从多到少。
- k_factor_rf：默认是1，越大准确率越好，性能越差，正整数，通常取值范围[1, 64]，可以在查询语句中单独定义。

例子, 每种索引各给了一个建表示例

#FLAT
CREATE TABLE tab_flat(id Int32, vector Array(Float32), INDEX faiss_index vector TYPE FLAT('dim=4') GRANULARITY 1) ENGINE = CnchMergeTree ORDER BY id SETTINGS index_granularity = 128
#IVF_FLAT
CREATE TABLE tab_ivf_flat(id Int32, vector Array(Float32), INDEX faiss_index vector TYPE IVF_FLAT('dim=4') GRANULARITY 1) ENGINE = CnchMergeTree ORDER BY id SETTINGS index_granularity = 128

#IVF_PQ
CREATE TABLE tab_ivf_pq(id Int32, vector Array(Float32), INDEX faiss_index vector TYPE IVF_PQ('dim=4','m=4','nbit=4') GRANULARITY 1) ENGINE = CnchMergeTree ORDER BY id SETTINGS index_granularity = 128

#IVF_PQ_FS
CREATE TABLE tab_ivf_pq_fs(id Int32, vector Array(Float32), INDEX faiss_index vector TYPE IVF_PQ_FS('dim=4','m=4') GRANULARITY 1) ENGINE = CnchMergeTree ORDER BY id SETTINGS index_granularity = 128

#IVF_PQ_FS_SQ8
CREATE TABLE tab_ivf_pq_fs_sq8(id Int32, vector Array(Float32), INDEX faiss_index vector TYPE IVF_PQ_FS('dim=4','m=2', 'refine=SQ8') GRANULARITY 1) ENGINE = CnchMergeTree ORDER BY id SETTINGS index_granularity = 128

-- 查询
SELECT *, d
FROM xxx.yyy
ORDER BY L2Distance(vector, [...]) as d limit 3 
settings 
enable_new_ann=1, 
nprobe=256,
k_factor_rf=32;

nlist计算逻辑 (聚类中心数量，越大构建速度越慢）
建表的时候可以nlist参数指定每个part构建index时候nlist的上限

static size_t decideIndexNlist(size_t n)
{
    if (n < ONE_MILLION)
        return std::min(int(n), int(4 * std::floor(std::sqrt(n))));
    else if (n >= ONE_MILLION && n < ONE_MILLION * 2)
        return 1024;
    else if (n >= ONE_MILLION * 2 && n < TEN_MILLION)
        return 4096;
    else if (n > TEN_MILLION)
        return 65536;
}

自动选择index，用户不需要指定具体的index类型，引擎会根据用户的数据量自动选择适合的index。建表的时候用户只需要指定维度信息即可。

最简单的构造 Faiss 索引的语句如下：

CREATE TABLE test_ann
(
    `id` UInt64,
    `vector` Array(Float32),
    INDEX v1 vector TYPE Faiss('DIM=960')
)
ENGINE = CnchMergeTree
ORDER BY id
SETTINGS index_granularity = 128

# 指定维度和metric type
CREATE TABLE test_ann
(
    `id` UInt64,
    `vector` Array(Float32),
    INDEX v1 vector TYPE Faiss('DIM=960, METRIC=COSINE')
)
ENGINE = CnchMergeTree
ORDER BY id
SETTINGS index_granularity = 128

#查询
SELECT *, d
FROM xxx.yyy
ORDER BY L2Distance(vector, [...]) as d limit 3 settings enable_new_ann=1;

自动选择适合的index的逻辑

static const String IVF2048 = "IVF2048";
static const String IVF4096 = "IVF4096"; // recommended for 2 * ONE_MILLION <=n < 5 * ONE_MILLION
static const String IVF8192 = "IVF8192"; // recommended for 5 * ONE_MILLION <=n < TEN_MILLION
static const String IVF65536 = "IVF65536"; // recommended for ONE_MILLION <=n < TEN_MILLION
static const String IVF262144  = "IVF262144"; // recommended for TEN_MILLION <=n < HUNDRED_MILLION
static const String IVF1048576  = "IVF1048576"; // recommended for HUNDRED_MILLION <=n < ONE_BILLION

static String getAdaptiveIndexKey(size_t n, size_t d)
{
    if (n < 256)
    {
        // for PQ, if nbits is 8, at least 256 vectors is required
        return "Flat";
    }
    else
    {
        // for simplicity, use ivf pq for auto indexing, adjust nlist and m adaptively, and d % m must be 0,
        // so we required d is power of 2 here, usually dimension is multiple of 32 or 64, pq32 and pq64 are widely used.
        std::ostringstream os;
        size_t pq_m = std::min(64, int(largestPowerOf2LessThan(d)));
        if (d % pq_m != 0)
            throw Exception("d must be power of 2 for ivfpq index", ErrorCodes::BAD_ARGUMENTS);

        size_t nlist;
        if (n < ONE_MILLION)
        {
            nlist = std::min(1024, int(4 * std::floor(std::sqrt(n))));
            os << "IVF" << nlist << ",PQ" << pq_m;
        }
        else if (n >= ONE_MILLION && n < ONE_MILLION * 2)
        {
            os << IVF2048 << ",PQ" << pq_m;
        }   
        else if (n >= ONE_MILLION * 2 && n < TEN_MILLION)
        {
            os << IVF4096 << ",PQ" << pq_m;
        }
        else if (n >= TEN_MILLION && n < HUNDRED_MILLION)
        {
            // exmaple IVF262144_FS="IVF262144(IVF512,PQ64x4fs,RFlat)"
            os << "IVF262144(IVF512,PQ" << pq_m << "x4fs,RFlat),PQ" << pq_m; 
        }
        else if (n >= HUNDRED_MILLION && n < ONE_BILLION)
        {
            os << IVF1048576 << ",PQ" << pq_m;
        }
        else
            throw Exception("Not supported yet for 1B vectors in one part", ErrorCodes::NOT_IMPLEMENTED);

        return os.str();
    }
}

查询语法

查询时候可以通过settings参数对准确率进行调优支持的参数如下，固定pattern API会用到前面2个参数nprobe和k_factor_rf，通过FAISS创建的自动index，还会用到后两个。

nprobe
k_factor_rf
quantizer_nprobe
quantizer_k_factor_rf

Example query:

-- 基本查询
SELECT *, d
FROM xxx.yyy
ORDER BY L2Distance(vector, [...]) as d limit 3 
settings 
enable_new_ann=1;


-- IVF_FLAT, IVF_PQ 可以设置nprobe
SELECT *, d
FROM xxx.yyy
ORDER BY L2Distance(vector, [...]) as d limit 3 
settings 
enable_new_ann=1, nprobe=256;

-- IVF_PQ_FS 可以设置nprobe和k_factor_rf
SELECT *, d
FROM xxx.yyy
ORDER BY L2Distance(vector, [...]) as d limit 3 
settings 
enable_new_ann=1, nprobe=256, k_factor_rf=32;

-- 对于使用FAISS自动index创建的index，还可以指定quantizer_nprobe和quantizer_k_factor_rf
select 
    id, label 
from test_ann 
order by L2Distance(vector, [1.0, 1.0, 1.0]) 
limit 100
settings 
enable_new_ann=1,
nprobe=128, -- 计算使用 128 个聚类中心
k_factor_rf=32，-- refine index 使用至少 32 个结果做结果重排
quantizer_nprobe=96， -- nested ivf index 使用 96 个聚类中心计算
quantizer_k_factor_rf=16，-- nested refine index 使用至少 16 个结果做重排

注意事项

Faiss不同于hnsw，构建index的时候会做训练，对cpu消耗比较大，有如下一些注意事项：
使用 faiss index 的时候需要把 merge 上线参数 max_total_rows_to_merge 设置到 50000000，避免build过大的index。

alter table zhaojie.test_1b_3 MODIFY SETTING merge_selector_config='{"name": "simple", "max_total_rows_to_merge": 50000000}'

根据内存大小设置 index cache 大小，参数是 vector_index_cache_size，默认200GB，是一个LRU实现
如果对冷度敏感，内存足够，可以打开 enable_cache_vector_index_after_write 参数，在写入的时候把build的index加载到内存中
测试的时候可以关注下cpu的使用，如果cpu使用过高，
- 限制merge并发，max_replicated_merges_in_queue
- 可以考虑限制下train index的并发，参数是vector_index_build_threads，默认是8

Filter By Distance

当前可以通过 where 语句来支持基于 distance 的过滤，语法示例如下：

select id, dist from test_ann where dist < 0.9 
order by cosineDistance(embedding_column, [query_vector]) as dist
limit 100

说明：

针对 where dist < max_score 以及 where dist > min_score 这类单一 filter 条件查询，bytehouse 有相对应的优化，目前推荐这种使用方式

Python API

clickhouse_driver: TCP 连接

建立连接

from clickhouse_driver import Client
client = Client(host="server", # server ip
                         port=9000,  # server tcp port
                         user="test", # user
                         password="password", # password
                         connect_timeout=3000,
                         send_receive_timeout=3000) # connect timeout

建表

schema = f"""\
                CREATE TABLE IF NOT EXISTS {database}.{table}(
                    id UInt64,
                    embedding Array(Float32),
                    CONSTRAINT cons_vec_len CHECK length(embedding) = {dim},
                    INDEX vec_idx embedding TYPE HNSW('METRIC={metric.upper()}, DIM={dim}')
                ) ENGINE = {engine} ORDER BY id\
                """


client.execute(schema)

插入向量

#  embeddings(list[list[float]]): list of embeddings
#  ids(list[int]): list of ids
data = zip(ids, embeddings)
values = [list(elem) for elem in data]
client.execute(f'INSERT INTO {database}.{table} (id, embedding) VALUES', values)

查询

# query: list[float]
q_str = f"""
        SELECT id
        FROM {database}.{collection}
        ORDER BY {metric}Distance(embedding, {str(query)}) 
        LIMIT {k}
        settings enable_new_ann=1, hnsw_ef_s={search_param["ef"]}
        """
        
results = client.execute(q_str)
result_ids = [int(i) for (i,) in results]

最佳实践

建表时参数：

index_granularity=128, index_granularity_bytes=0,enable_vector_index_preload = 1

ES 分数对应

转换为 es 相关 score 的公式

L2 similarity 公式：1 / (1 + pow(dist, 2)) * 100 AS similarity
Cosine similarity 公式：(1 - (dist * 0.5)) * 100 AS similarity