向量是一种常见的非结构化数据表现形式。基于向量相似度的 KNN 计算广泛使用于图像搜索、多模态搜索、推荐、大模型推理等场景。我们在 ByteHouse 中提供向量数据的管理与近似度查询功能,同时通过支持多种常见近似向量查询(ANN)算法来提升检索性能,以提供对非结构化数据的处理能力。当前Bytehouse支持 HNSW(hnswlib)、Faiss 两个算法库, 后续还会对 DiskANN 等算法库提供支持。
CREATE TABLE test_ann ( `id` UInt64, `label` String, `vector` Array(Float32), INDEX v1 vector TYPE HNSW('DIM=960, METRIC=COSINE'), CONSTRAINT cons_vec_len CHECK length(vector) = 960 -- 用于数据插入检查 ) ENGINE = CnchMergeTree ORDER BY id SETTINGS index_granularity = 1024, -- 缓解读放大问题 index_granularity_bytes = 0 -- 避免生成 adaptive mark
ALTER TABLE test_ann ADD INDEX v1 vector TYPE HNSW('DIM=960, METRIC=COSINE')
说明:
Not allow to create ANN INDEX
的报错,可以先执行SET enable_new_ann = 1;
使用索引查询方式如下:
select id, label, dist from test_ann order by cosineDistance(vector, [query_vector]) as dist limit 100 settings enable_new_ann=1
说明:
此外,还支持带有 prewhere 的混合查询使用方式,比如
select id, label, dist from test_ann prewhere id > 1000 order by cosineDistance(vector, [query_vector]) as dist limit 100 settings enable_new_ann=1
图类型索引,通过一种最小世界建立图的思路,为每个节点维护 M 个最近邻节点。同时,维护一个分层图结构,越向上越稀疏,查询时从最上层开始,找到最近点以后继续向下一层计算,直到最底层,得到查询向量的 k 个最近邻。
HNSW 使用时可以定义参数 M 和 EF_CONSTRUCTION 两个参数来在性能和准确度之间做权衡,一般来说 M 越大,EF_CONSTRUCTION 越大,索引构建时间越长,准确度越高,搜索 latency 越高。M 的缺省值为 16,EF_CONSTRUCTION 的缺省值为 200。
一个典型的构造 HNSW 索引的语句如下:
CREATE TABLE test_ann ( `id` UInt64, `vector` Array(Float32), INDEX v1 vector TYPE HNSW('DIM=960, METRIC=COSINE, M=32, EF_CONSTRUCTION=512') ) ENGINE = CnchMergeTree ORDER BY id SETTINGS index_granularity = 1024
查询时,可以通过设定 hnsw_ef_s 参数来控制准确度和 latency,该值越大,准确度越高,latency 越长,大于 ef_construction 参数后,理论上准确度不会有更大提升。使用方法如下
select id, dist from test_ann order by cosineDistance(vector, [query_vector]) as dist limit 100 settings enable_new_ann=1, hnsw_ef_s=200
Faiss 是 facebook 开源的 ann 算法库,包含了倒排(ivf), PQ, SQ 等多种类型的索引,同时多种索引还可以组合使用。我们主要使用 faiss 的 ivf 类索引,同时支持 PQ , SQ 等向量压缩方法,以减少索引的内存使用。
Faiss index的创建语法如下, 包括3个参数
CREATE TABLE test_ann ( `id` UInt64, `vector` Array(Float32), INDEX v1 vector TYPE Faiss('Basic information', 'Index key', 'Build parameters') ) ENGINE = CnchMergeTree ORDER BY id SETTINGS index_granularity = 128
Faiss index的使用有3种方式 (推荐使用第一种, 对于构建时间有要求、内存紧张的场景,优先推荐 IVF_PQ_FS 类型索引)
支持的index类型,加粗的参数表示必须的,IVF*索引实际还要一个参数nlist,跟part数据量相关,引擎会自动推算(计算方法在下面)
例子, 每种索引各给了一个建表示例
#FLAT CREATE TABLE tab_flat(id Int32, vector Array(Float32), INDEX faiss_index vector TYPE FLAT('dim=4') GRANULARITY 1) ENGINE = CnchMergeTree ORDER BY id SETTINGS index_granularity = 128 #IVF_FLAT CREATE TABLE tab_ivf_flat(id Int32, vector Array(Float32), INDEX faiss_index vector TYPE IVF_FLAT('dim=4') GRANULARITY 1) ENGINE = CnchMergeTree ORDER BY id SETTINGS index_granularity = 128 #IVF_PQ CREATE TABLE tab_ivf_pq(id Int32, vector Array(Float32), INDEX faiss_index vector TYPE IVF_PQ('dim=4','m=4','nbit=4') GRANULARITY 1) ENGINE = CnchMergeTree ORDER BY id SETTINGS index_granularity = 128 #IVF_PQ_FS CREATE TABLE tab_ivf_pq_fs(id Int32, vector Array(Float32), INDEX faiss_index vector TYPE IVF_PQ_FS('dim=4','m=4') GRANULARITY 1) ENGINE = CnchMergeTree ORDER BY id SETTINGS index_granularity = 128 #IVF_PQ_FS_SQ8 CREATE TABLE tab_ivf_pq_fs_sq8(id Int32, vector Array(Float32), INDEX faiss_index vector TYPE IVF_PQ_FS('dim=4','m=2', 'refine=SQ8') GRANULARITY 1) ENGINE = CnchMergeTree ORDER BY id SETTINGS index_granularity = 128 -- 查询 SELECT *, d FROM xxx.yyy ORDER BY L2Distance(vector, [...]) as d limit 3 settings enable_new_ann=1, nprobe=256, k_factor_rf=32;
nlist计算逻辑 (聚类中心数量,越大构建速度越慢)
建表的时候可以nlist参数指定每个part构建index时候nlist的上限
static size_t decideIndexNlist(size_t n) { if (n < ONE_MILLION) return std::min(int(n), int(4 * std::floor(std::sqrt(n)))); else if (n >= ONE_MILLION && n < ONE_MILLION * 2) return 1024; else if (n >= ONE_MILLION * 2 && n < TEN_MILLION) return 4096; else if (n > TEN_MILLION) return 65536; }
最简单的构造 Faiss 索引的语句如下:
CREATE TABLE test_ann ( `id` UInt64, `vector` Array(Float32), INDEX v1 vector TYPE Faiss('DIM=960') ) ENGINE = CnchMergeTree ORDER BY id SETTINGS index_granularity = 128 # 指定维度和metric type CREATE TABLE test_ann ( `id` UInt64, `vector` Array(Float32), INDEX v1 vector TYPE Faiss('DIM=960, METRIC=COSINE') ) ENGINE = CnchMergeTree ORDER BY id SETTINGS index_granularity = 128 #查询 SELECT *, d FROM xxx.yyy ORDER BY L2Distance(vector, [...]) as d limit 3 settings enable_new_ann=1;
自动选择适合的index的逻辑
static const String IVF2048 = "IVF2048"; static const String IVF4096 = "IVF4096"; // recommended for 2 * ONE_MILLION <=n < 5 * ONE_MILLION static const String IVF8192 = "IVF8192"; // recommended for 5 * ONE_MILLION <=n < TEN_MILLION static const String IVF65536 = "IVF65536"; // recommended for ONE_MILLION <=n < TEN_MILLION static const String IVF262144 = "IVF262144"; // recommended for TEN_MILLION <=n < HUNDRED_MILLION static const String IVF1048576 = "IVF1048576"; // recommended for HUNDRED_MILLION <=n < ONE_BILLION static String getAdaptiveIndexKey(size_t n, size_t d) { if (n < 256) { // for PQ, if nbits is 8, at least 256 vectors is required return "Flat"; } else { // for simplicity, use ivf pq for auto indexing, adjust nlist and m adaptively, and d % m must be 0, // so we required d is power of 2 here, usually dimension is multiple of 32 or 64, pq32 and pq64 are widely used. std::ostringstream os; size_t pq_m = std::min(64, int(largestPowerOf2LessThan(d))); if (d % pq_m != 0) throw Exception("d must be power of 2 for ivfpq index", ErrorCodes::BAD_ARGUMENTS); size_t nlist; if (n < ONE_MILLION) { nlist = std::min(1024, int(4 * std::floor(std::sqrt(n)))); os << "IVF" << nlist << ",PQ" << pq_m; } else if (n >= ONE_MILLION && n < ONE_MILLION * 2) { os << IVF2048 << ",PQ" << pq_m; } else if (n >= ONE_MILLION * 2 && n < TEN_MILLION) { os << IVF4096 << ",PQ" << pq_m; } else if (n >= TEN_MILLION && n < HUNDRED_MILLION) { // exmaple IVF262144_FS="IVF262144(IVF512,PQ64x4fs,RFlat)" os << "IVF262144(IVF512,PQ" << pq_m << "x4fs,RFlat),PQ" << pq_m; } else if (n >= HUNDRED_MILLION && n < ONE_BILLION) { os << IVF1048576 << ",PQ" << pq_m; } else throw Exception("Not supported yet for 1B vectors in one part", ErrorCodes::NOT_IMPLEMENTED); return os.str(); } }
查询时候可以通过settings参数对准确率进行调优支持的参数如下,固定pattern API会用到前面2个参数nprobe和k_factor_rf,通过FAISS创建的自动index,还会用到后两个。
Example query:
-- 基本查询 SELECT *, d FROM xxx.yyy ORDER BY L2Distance(vector, [...]) as d limit 3 settings enable_new_ann=1; -- IVF_FLAT, IVF_PQ 可以设置nprobe SELECT *, d FROM xxx.yyy ORDER BY L2Distance(vector, [...]) as d limit 3 settings enable_new_ann=1, nprobe=256; -- IVF_PQ_FS 可以设置nprobe和k_factor_rf SELECT *, d FROM xxx.yyy ORDER BY L2Distance(vector, [...]) as d limit 3 settings enable_new_ann=1, nprobe=256, k_factor_rf=32; -- 对于使用FAISS自动index创建的index,还可以指定quantizer_nprobe和quantizer_k_factor_rf select id, label from test_ann order by L2Distance(vector, [1.0, 1.0, 1.0]) limit 100 settings enable_new_ann=1, nprobe=128, -- 计算使用 128 个聚类中心 k_factor_rf=32,-- refine index 使用至少 32 个结果做结果重排 quantizer_nprobe=96, -- nested ivf index 使用 96 个聚类中心计算 quantizer_k_factor_rf=16,-- nested refine index 使用至少 16 个结果做重排
alter table zhaojie.test_1b_3 MODIFY SETTING merge_selector_config='{"name": "simple", "max_total_rows_to_merge": 50000000}'
当前可以通过 where 语句来支持基于 distance 的过滤,语法示例如下:
select id, dist from test_ann where dist < 0.9 order by cosineDistance(embedding_column, [query_vector]) as dist limit 100
说明:
from clickhouse_driver import Client client = Client(host="server", # server ip port=9000, # server tcp port user="test", # user password="password", # password connect_timeout=3000, send_receive_timeout=3000) # connect timeout
schema = f"""\ CREATE TABLE IF NOT EXISTS {database}.{table}( id UInt64, embedding Array(Float32), CONSTRAINT cons_vec_len CHECK length(embedding) = {dim}, INDEX vec_idx embedding TYPE HNSW('METRIC={metric.upper()}, DIM={dim}') ) ENGINE = {engine} ORDER BY id\ """ client.execute(schema)
# embeddings(list[list[float]]): list of embeddings # ids(list[int]): list of ids data = zip(ids, embeddings) values = [list(elem) for elem in data] client.execute(f'INSERT INTO {database}.{table} (id, embedding) VALUES', values)
# query: list[float] q_str = f""" SELECT id FROM {database}.{collection} ORDER BY {metric}Distance(embedding, {str(query)}) LIMIT {k} settings enable_new_ann=1, hnsw_ef_s={search_param["ef"]} """ results = client.execute(q_str) result_ids = [int(i) for (i,) in results]
建表时参数:
转换为 es 相关 score 的公式