Question about the index storage and search process #269

Ryanhya · 2025-05-15T03:52:25Z

Ryanhya
May 15, 2025

After reading the documents, I have the following questions:

Is it the same as pgvecto.rs that the separate index storage is utilized in VectorChord?
WAL for index is also not supported in VectorChord?
Is the search executed from the top layer to the bottom layer in the hierarchical clustering structure? For example, if list=[64, 4096] and probes=[16, 1024], does the search first gets the top 16 clusters in the 64 clusters of the top layer and then get the top 1024 clusters within the region of the retrieved 16 clusters?

xieydd · 2025-05-15T04:05:21Z

xieydd
May 15, 2025
Maintainer

Hi @Ryanhya, response to your question：

VectorChord using Postgres native storage, not the same as pgvecto.rs
VectorChord natively supports WAL of Postgres
Perhaps @usamoi can give the most accurate answer to this question.

0 replies

usamoi · 2025-05-15T04:24:15Z

usamoi
May 15, 2025
Maintainer

Is the search executed from the top layer to the bottom layer in the hierarchical clustering structure? For example, if list=[64, 4096] and probes=[16, 1024], does the search first gets the top 16 clusters in the 64 clusters of the top layer and then get the top 1024 clusters within the region of the retrieved 16 clusters?

Yes.

0 replies

Ryanhya · 2025-05-15T06:00:44Z

Ryanhya
May 15, 2025
Author

@xieydd @usamoi :)Thanks! A few more questions:

I learn about parameter vchordrq.epsilon to control the lower bounds of distances, which helps to refine the search results. But I am still a little confused about this concept. Intuitively the lower the distance is, the more similarity the vectors share. Why the vectors with distances less than the parameter need rerank?
Does VectorChord support incremental in-place update for vector field? And what's the impact on the established IVF index (e.g., do the centroids shift due to updated vectors) ?

0 replies

usamoi · 2025-05-15T06:30:06Z

usamoi
May 15, 2025
Maintainer

I learn about parameter vchordrq.epsilon to control the lower bounds of distances, which helps to refine the search results. But I am still a little confused about this concept. Intuitively the lower the distance is, the more similarity the vectors share. Why the vectors with distances less than the parameter need rerank?

The algorithm works like:

let the set of all vectors be S, an empty set be T.
choose the vector which lowerbound is smallest in S, compute the distance of this vector, remove this vector from S, and add this vector to T.
repeat step 2 until smallest lowerbound in S is greater that smallest distance in T, remove the vector which distance is smallest in T, and return this vector to PostgreSQL.
every time PostgreSQL wants a row, repeat step 3.

Therefore, the smaller the estimated lowerbounds, the more distances the algorithm needs to compute to be confident that a certain vector is the closest to the query vector, and thus the slower the algorithm becomes.

In the document, we refer to this algorithm as "rerank" which may have caused some confusion.

Does VectorChord support incremental in-place update for vector field? And what's the impact on the established IVF index (e.g., do the centroids shift due to updated vectors) ?

Centroids do shift due to updated vectors. So if you do a full table update, you'd better execute REINDEX INDEX CONCURRENTLY.

0 replies

Ryanhya · 2025-05-15T07:49:22Z

Ryanhya
May 15, 2025
Author

The algorithm works like:

let the set of all vectors be S, an empty set be T.

choose the vector which lowerbound is smallest in S, compute the distance of this vector, remove this vector from S, and add this vector to T.

repeat step 2 until smallest lowerbound in S is greater that smallest distance in T, remove the vector which distance is smallest in T, and return this vector to PostgreSQL.

every time PostgreSQL wants a row, repeat step 3.

Therefore, the smaller the estimated lowerbounds, the more distances the algorithm needs to compute to be confident that a certain vector is the closest to the query vector, and thus the slower the algorithm becomes.

In the document, we refer to this algorithm as "rerank" which may have caused some confusion.

So the lower bound is estimated by RaBitQ algorithm using quantized vectors and the distance is calculated using the full-precision raw vectors?
The parameter still seems to be unclear in these 4 steps. Is there another termination condition in step 3, e.g., repeat step 2 until smallest lowerbound in S is greater than smallest distance in T or is greater than the parameter? In that case, the smaller the parameter of lower bound is, the narrower the search space is, which improves the search speed but results in bad recall. The analysis is consistent with that in the documents.

Centroids do shift due to updated vectors. So if you do a full table update, you'd better execute REINDEX INDEX CONCURRENTLY.

What if an incremental update that only influences limited several rows?

0 replies

usamoi · 2025-05-15T08:02:14Z

usamoi
May 15, 2025
Maintainer

So the lower bound is estimated by RaBitQ algorithm using quantized vectors and the distance is calculated using the full-precision raw vectors?

Yes.

The parameter still seems to be unclear in these 4 steps. Is there another termination condition in step 3, e.g., repeat step 2 until smallest lowerbound in S is greater than smallest distance in T or is greater than the parameter?

No. vchordrq.epsilon is part of the formula for lowerbound estimation, and its adjustment will affect all the lowerbounds. It does not participate in any other part of the algorithm.

What if an incremental update that only influences limited several rows?

It doesn't matter. Centroids only shift after a significant portion of the entire table has been updated, and the statistical properties of the new data differ from those of the old data.

0 replies

Ryanhya · 2025-05-15T08:08:18Z

Ryanhya
May 15, 2025
Author

So the lower bound is estimated by RaBitQ algorithm using quantized vectors and the distance is calculated using the full-precision raw vectors?

Yes.

The parameter still seems to be unclear in these 4 steps. Is there another termination condition in step 3, e.g., repeat step 2 until smallest lowerbound in S is greater than smallest distance in T or is greater than the parameter?

No. vchordrq.epsilon is part of the formula for lowerbound estimation, and its adjustment will affect all the lowerbounds. It does not participate in any other part of the algorithm.

What if an incremental update that only influences limited several rows?

It doesn't matter. Centroids only shift after a significant portion of the entire table has been updated, and the statistical properties of the new data differ from those of the old data.

I see. Thanks for your patient reply! My problem is solved :).

0 replies

Ryanhya · 2025-05-29T13:22:57Z

Ryanhya
May 29, 2025
Author

Hi! I conducted a simple experiment compared to pgvector recently, using SIFT10M dataset. All the indices are built with the default configurations or the recommended settings in the documents. For example, as for HNSW index, m is set to 16 and ef_construction is set to 64. As for VectorChord, lists is recommended to be set to 12639 according to 10 million vectors.

Although I change different values for epsilon and probes, it still fails to reach a higher QPS with an acceptable recall.

The overall results is: (the line of VectorChord is obtained by varying epsilon from 0.6 to default value 1.9 with fixed probes=40)

It seems that the results is not consistent with the claims in the document. So I wonder the parameter settings in your blog.

I will appreciate it if you can give me more insightful advices on improving the performances :).

0 replies

VoVAllen · 2025-05-29T13:42:32Z

VoVAllen
May 29, 2025
Maintainer

SIFT's dimension is 128, which is too small for typical workloads. And for in memory benchmark, don't modify epsilon, keep it as 1.9 and just increase the nprobes

0 replies

VoVAllen · 2025-05-29T13:49:47Z

VoVAllen
May 29, 2025
Maintainer

And can you post the full create index config?

0 replies

Ryanhya · 2025-05-29T14:08:55Z

Ryanhya
May 29, 2025
Author

The used sql is:

CREATE INDEX ON items USING vchordrq (embedding vector_l2_ops) WITH (options = $$
build.pin = true
[build.internal]
lists = [12639]
kmeans_iterations = 25
build_threads = 16
$$);

Thanks for your advices :). I will attempt to use another high-dimensional dataset and test the performances. I will report the latest results here.

0 replies

VoVAllen · 2025-05-29T14:27:01Z

VoVAllen
May 29, 2025
Maintainer

The best approach is to use your real-world case data instead of any existing dataset, and create your own query set and labels. The results will vary depending on whether you use top 10 or top 100, the data distribution, query distribution (whether it matches the data or is out-of-distribution), target recall, filter conditions, and other factors.

We have indeed observed that vectorchord is much faster than hnsw on certain datasets, while on others the performance is completely opposite, so existing public datasets can't be a direct reference for your dataset.

0 replies

Ryanhya · 2025-05-29T14:42:20Z

Ryanhya
May 29, 2025
Author

Thanks :)! I will give a try.

0 replies

Ryanhya · 2025-06-04T06:19:32Z

Ryanhya
Jun 4, 2025
Author

@VoVAllen Hi!

Considering convenience and privacy, I still used two public datasets to evaluate the performance:

Datasets	Type	Dimension	Base vectors	Query vectors	Distance
GIST1M	float32	960	1,000,000	1,000	L2
GloVe1M	float32	200	1,183,514	10,000	cosine

The overall result is:

At first I only used GIST1M with the following SQL to create index (the default configurations were still utilized for other indices):

CREATE INDEX ON gist1m USING vchordrq (embedding vector_l2_ops) WITH (options = $$
[build.internal]
lists = [2000]
sampling_factor = 256
kmeans_iterations = 10
build_threads = 16
$$);

I followed your advice that increasing probes with fixed epsilon=1.9. The performance is still not ideal as SIFT10M. I double-checked your blog and guessed the distance metric might be the key factor. So I roughly read the original RaBitQ paper. It seemed that the authors mainly focused on inner product and cosine similarity in high-dimensional vector space. Based on this discovery, I accessed GloVe1M with available ground truth sets calculated by cosine distance.

The used SQL is:

CREATE INDEX ON glove1m USING vchordrq (embedding vector_cosine_ops) WITH (options = $$
[build.internal]
spherical_centroids = true
lists = [2000]
sampling_factor = 256
kmeans_iterations = 10
build_threads = 16
$$);

Although the dimension is relatively small, VectorChord begins to show its advantages when the recall target is at least 0.8.

What makes me feel curious and interesting is that the line of VectorChord on GIST1M is almost smooth. Firstly I thought it was caused by the imbalanced clusters distribution. But I abandoned the assumption because the recall was increased quickly with only a slight decrease in QPS, which is not aligned with my empirical knowledge. Do you have any ideas about this?

By the way, which KMeans algorithm is utilized in VectorChord? Is it the classic Lloyd's version?

0 replies

VoVAllen · 2025-06-04T06:53:22Z

VoVAllen
Jun 4, 2025
Maintainer

@Ryanhya It's normal lloyd and the performance here must be something wrong because we've tested against GIST dataset in https://blog.vectorchord.ai/vectorchord-store-400k-vectors-for-1-in-postgresql. The performance is like

and it's consistent with your experiment for pgvector. Can you try make kmeans_iterations larger to 25? And also do EXPLAIN (BUFFERS) SELECT xxx to check whether there're buffers read from the disk

0 replies

Ryanhya · 2025-06-04T08:59:57Z

Ryanhya
Jun 4, 2025
Author

Can you try make kmeans_iterations larger to 25?

I did and built another new index, but the trend seemed to be the same.

And also do EXPLAIN (BUFFERS) SELECT xxx to check whether there're buffers read from the disk

The log of the first query (omit the raw query vector data) is shown as:

Limit  (cost=0.00..500003.01 rows=1 width=16)
  ->  Index Scan using gist1m_vchord_ivf_2000_25 on gist1m_vchord_ivf  (cost=0.00..500003.01 rows=1 width=16)
        Order By: (embedding <-> '[...]'::vector)
Planning:
  Buffers: shared hit=72

The other 999 queries' logs are almost the same as above except the raw query data.

Meanwhile, I recorded the latency distribution of the 1000 queries when probes=10:

P50	P90	P95	P99	P99.9
346.278ms	380.542ms	401.394ms	463.259ms	508.982ms

0 replies

VoVAllen · 2025-06-04T09:05:26Z

VoVAllen
Jun 4, 2025
Maintainer

@Ryanhya Can you show the full explain result? with EXPLAIN (ANALYZE, BUFFERS) SELECT xxx?

0 replies

Ryanhya · 2025-06-04T09:14:17Z

Ryanhya
Jun 4, 2025
Author

The full logs of EXPLAIN (ANALYZE, BUFFERS) SELECT of the first 5 queries are:

query_id: 0
log_info:
 Limit  (cost=0.00..500003.01 rows=1 width=16) (actual time=283.750..299.324 rows=100 loops=1)
  Buffers: shared hit=1448
  ->  Index Scan using gist1m_vchord_ivf_2000_25 on gist1m_vchord_ivf  (cost=0.00..500003.01 rows=1 width=16) (actual time=283.749..299.300 rows=100 loops=1)
        Order By: (embedding <-> '[...]'::vector)
        Buffers: shared hit=1448
Planning:
  Buffers: shared hit=72
Planning Time: 0.595 ms
Execution Time: 299.628 ms
--------------------------
query_id: 1
log_info:
 Limit  (cost=0.00..500003.01 rows=1 width=16) (actual time=318.303..345.069 rows=100 loops=1)
  Buffers: shared hit=4335
  ->  Index Scan using gist1m_vchord_ivf_2000_25 on gist1m_vchord_ivf  (cost=0.00..500003.01 rows=1 width=16) (actual time=318.301..345.044 rows=100 loops=1)
        Order By: (embedding <-> '[...]'::vector)
        Buffers: shared hit=4335
Planning:
  Buffers: shared hit=72
Planning Time: 0.736 ms
Execution Time: 346.151 ms
--------------------------
query_id: 2
log_info:
 Limit  (cost=0.00..500003.01 rows=1 width=16) (actual time=296.644..314.936 rows=100 loops=1)
  Buffers: shared hit=1631
  ->  Index Scan using gist1m_vchord_ivf_2000_25 on gist1m_vchord_ivf  (cost=0.00..500003.01 rows=1 width=16) (actual time=296.641..314.911 rows=100 loops=1)
        Order By: (embedding <-> '[...]'::vector)
        Buffers: shared hit=1631
Planning:
  Buffers: shared hit=72
Planning Time: 0.778 ms
Execution Time: 315.402 ms
--------------------------
query_id: 3
log_info:
 Limit  (cost=0.00..500003.01 rows=1 width=16) (actual time=311.276..340.032 rows=100 loops=1)
  Buffers: shared hit=4646
  ->  Index Scan using gist1m_vchord_ivf_2000_25 on gist1m_vchord_ivf  (cost=0.00..500003.01 rows=1 width=16) (actual time=311.273..340.012 rows=100 loops=1)
        Order By: (embedding <-> '[...]'::vector)
        Buffers: shared hit=4646
Planning:
  Buffers: shared hit=72
Planning Time: 0.818 ms
Execution Time: 341.230 ms
--------------------------
query_id: 4
log_info:
 Limit  (cost=0.00..500003.01 rows=1 width=16) (actual time=329.151..355.769 rows=100 loops=1)
  Buffers: shared hit=4683
  ->  Index Scan using gist1m_vchord_ivf_2000_25 on gist1m_vchord_ivf  (cost=0.00..500003.01 rows=1 width=16) (actual time=329.148..355.748 rows=100 loops=1)
        Order By: (embedding <-> '[...]'::vector)
        Buffers: shared hit=4683
Planning:
  Buffers: shared hit=72
Planning Time: 0.746 ms
Execution Time: 356.818 ms

0 replies

usamoi · 2025-06-04T09:20:03Z

usamoi
Jun 4, 2025
Maintainer

Can you try residual_quantization = true with GIST-960? We enabled this option for all previous tests on the GIST960 dataset.

DROP INDEX gist1m_vchord_ivf_2000_25;
CREATE INDEX gist1m_vchord_ivf_2000_25 ON gist1m USING vchordrq (embedding vector_l2_ops) WITH (options = $$
residual_quantization = true
[build.internal]
lists = [2000]
build_threads = 16
$$);

13 replies

usamoi Jun 4, 2025
Maintainer

I am not very familiar with pg, can I upgrade it directly without restarting the pg instance?

No.

Ryanhya Jun 5, 2025
Author

I already upgrade VectorChord to 0.4.2:

                                                      List of installed extensions
        Name        | Version |   Schema   |                                         Description                                         
--------------------+---------+------------+---------------------------------------------------------------------------------------------
 pg_stat_statements | 1.9     | public     | track planning and execution statistics of all SQL statements executed
 plpgsql            | 1.0     | pg_catalog | PL/pgSQL procedural language
 vchord             | 0.4.2   | public     | vchord: Vector database plugin for Postgres, written in Rust, specifically designed for LLM
 vector             | 0.8.0   | public     | vector data type and ivfflat and hnsw access methods

The SQL for creating index is:

CREATE INDEX gist1m_vchord_ivf_8192_10 ON gist1m_vchord_ivf USING vchordrq (embedding vector_l2_ops) WITH (options = $$
residual_quantization = true
[build.internal]
lists = [8192]
build_threads = 16
$$);

The result is surprising! There might be some errors or potential bottlenecks on 0.3.0? I think I should re-test the performance using the latest version.

By the way, lists=[8192] is not in the recommended region based on the scale. Is there any empirical knowledge for this setting?

All in all, thanks a lot for resolving my problems :)!

Ryanhya Jun 5, 2025
Author

The performance on both SIFT10M and GolVe1M is almost consistent with above figure!

I am curious about the code and algorithm improvement ideas from 0.3.0 to 0.4.2.

usamoi Jun 5, 2025
Maintainer

code and algorithm improvement ideas from 0.3.0 to 0.4.2

It's mainly https://github.com/tensorchord/VectorChord/releases/tag/0.4.0.

But I still believe there is something wrong with the data above, perhaps a measurement error or an undiscovered bug in 0.3.0. Although we made some algorithm improvements, they shouldn't have caused such a significant impact.

Ryanhya Jun 5, 2025
Author

I see. All the experimental results are produced by using the same scripts in the same context mentioned above. I follow the same test steps so I have no idea about my potential mistakes. :(

But the performance improvement is really exciting. :) Thanks for your great contributions!

Question about the index storage and search process #269

Uh oh!

Ryanhya May 15, 2025

Replies: 19 comments · 13 replies

Uh oh!

xieydd May 15, 2025 Maintainer

Uh oh!

Uh oh!

usamoi May 15, 2025 Maintainer

Uh oh!

Ryanhya May 15, 2025 Author

Uh oh!

Uh oh!

usamoi May 15, 2025 Maintainer

Uh oh!

Ryanhya May 15, 2025 Author

Uh oh!

usamoi May 15, 2025 Maintainer

Uh oh!

Ryanhya May 15, 2025 Author

Uh oh!

Ryanhya May 29, 2025 Author

Uh oh!

Uh oh!

VoVAllen May 29, 2025 Maintainer

Uh oh!

VoVAllen May 29, 2025 Maintainer

Uh oh!

Ryanhya May 29, 2025 Author

Uh oh!

VoVAllen May 29, 2025 Maintainer

Uh oh!

Ryanhya May 29, 2025 Author

Uh oh!

Uh oh!

Ryanhya Jun 4, 2025 Author

Uh oh!

VoVAllen Jun 4, 2025 Maintainer

Uh oh!

Ryanhya Jun 4, 2025 Author

Uh oh!

VoVAllen Jun 4, 2025 Maintainer

Uh oh!

Ryanhya Jun 4, 2025 Author

Uh oh!

Uh oh!

usamoi Jun 4, 2025 Maintainer

Uh oh!

usamoi Jun 4, 2025 Maintainer

Uh oh!

Ryanhya Jun 5, 2025 Author

Uh oh!

Ryanhya Jun 5, 2025 Author

Uh oh!

usamoi Jun 5, 2025 Maintainer

Uh oh!

Ryanhya Jun 5, 2025 Author

Ryanhya
May 15, 2025

Replies: 19 comments 13 replies

xieydd
May 15, 2025
Maintainer

usamoi
May 15, 2025
Maintainer

Ryanhya
May 15, 2025
Author

usamoi
May 15, 2025
Maintainer

Ryanhya
May 15, 2025
Author

usamoi
May 15, 2025
Maintainer

Ryanhya
May 15, 2025
Author

Ryanhya
May 29, 2025
Author

VoVAllen
May 29, 2025
Maintainer

VoVAllen
May 29, 2025
Maintainer

Ryanhya
May 29, 2025
Author

VoVAllen
May 29, 2025
Maintainer

Ryanhya
May 29, 2025
Author

Ryanhya
Jun 4, 2025
Author

VoVAllen
Jun 4, 2025
Maintainer

Ryanhya
Jun 4, 2025
Author

VoVAllen
Jun 4, 2025
Maintainer

Ryanhya
Jun 4, 2025
Author

usamoi
Jun 4, 2025
Maintainer

usamoi Jun 4, 2025
Maintainer

Ryanhya Jun 5, 2025
Author

Ryanhya Jun 5, 2025
Author

usamoi Jun 5, 2025
Maintainer

Ryanhya Jun 5, 2025
Author