Replies: 19 comments 13 replies
-
|
Hi @Ryanhya, response to your question:
|
Beta Was this translation helpful? Give feedback.
-
Yes. |
Beta Was this translation helpful? Give feedback.
-
|
@xieydd @usamoi :)Thanks! A few more questions:
|
Beta Was this translation helpful? Give feedback.
-
The algorithm works like:
Therefore, the smaller the estimated lowerbounds, the more distances the algorithm needs to compute to be confident that a certain vector is the closest to the query vector, and thus the slower the algorithm becomes. In the document, we refer to this algorithm as "rerank" which may have caused some confusion.
Centroids do shift due to updated vectors. So if you do a full table update, you'd better execute |
Beta Was this translation helpful? Give feedback.
-
So the lower bound is estimated by RaBitQ algorithm using quantized vectors and the distance is calculated using the full-precision raw vectors?
What if an incremental update that only influences limited several rows? |
Beta Was this translation helpful? Give feedback.
-
Yes.
No.
It doesn't matter. Centroids only shift after a significant portion of the entire table has been updated, and the statistical properties of the new data differ from those of the old data. |
Beta Was this translation helpful? Give feedback.
-
I see. Thanks for your patient reply! My problem is solved :). |
Beta Was this translation helpful? Give feedback.
-
|
Hi! I conducted a simple experiment compared to pgvector recently, using SIFT10M dataset. All the indices are built with the default configurations or the recommended settings in the documents. For example, as for HNSW index, Although I change different values for The overall results is: (the line of VectorChord is obtained by varying It seems that the results is not consistent with the claims in the document. So I wonder the parameter settings in your blog. I will appreciate it if you can give me more insightful advices on improving the performances :). |
Beta Was this translation helpful? Give feedback.
-
|
SIFT's dimension is 128, which is too small for typical workloads. And for in memory benchmark, don't modify epsilon, keep it as 1.9 and just increase the nprobes |
Beta Was this translation helpful? Give feedback.
-
|
And can you post the full create index config? |
Beta Was this translation helpful? Give feedback.
-
|
The used sql is: Thanks for your advices :). I will attempt to use another high-dimensional dataset and test the performances. I will report the latest results here. |
Beta Was this translation helpful? Give feedback.
-
|
The best approach is to use your real-world case data instead of any existing dataset, and create your own query set and labels. The results will vary depending on whether you use top 10 or top 100, the data distribution, query distribution (whether it matches the data or is out-of-distribution), target recall, filter conditions, and other factors. We have indeed observed that vectorchord is much faster than hnsw on certain datasets, while on others the performance is completely opposite, so existing public datasets can't be a direct reference for your dataset. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks :)! I will give a try. |
Beta Was this translation helpful? Give feedback.
-
|
@VoVAllen Hi! Considering convenience and privacy, I still used two public datasets to evaluate the performance:
The overall result is: At first I only used GIST1M with the following SQL to create index (the default configurations were still utilized for other indices): I followed your advice that increasing The used SQL is: Although the dimension is relatively small, VectorChord begins to show its advantages when the recall target is at least 0.8. What makes me feel curious and interesting is that the line of VectorChord on GIST1M is almost smooth. Firstly I thought it was caused by the imbalanced clusters distribution. But I abandoned the assumption because the recall was increased quickly with only a slight decrease in QPS, which is not aligned with my empirical knowledge. Do you have any ideas about this? By the way, which KMeans algorithm is utilized in VectorChord? Is it the classic Lloyd's version? |
Beta Was this translation helpful? Give feedback.
-
|
@Ryanhya It's normal lloyd and the performance here must be something wrong because we've tested against GIST dataset in https://blog.vectorchord.ai/vectorchord-store-400k-vectors-for-1-in-postgresql. The performance is like and it's consistent with your experiment for pgvector. Can you try make kmeans_iterations larger to 25? And also do |
Beta Was this translation helpful? Give feedback.
-
I did and built another new index, but the trend seemed to be the same.
The log of the first query (omit the raw query vector data) is shown as: The other 999 queries' logs are almost the same as above except the raw query data. Meanwhile, I recorded the latency distribution of the 1000 queries when
|
Beta Was this translation helpful? Give feedback.
-
|
@Ryanhya Can you show the full explain result? with |
Beta Was this translation helpful? Give feedback.
-
|
The full logs of |
Beta Was this translation helpful? Give feedback.
-
|
Can you try DROP INDEX gist1m_vchord_ivf_2000_25;
CREATE INDEX gist1m_vchord_ivf_2000_25 ON gist1m USING vchordrq (embedding vector_l2_ops) WITH (options = $$
residual_quantization = true
[build.internal]
lists = [2000]
build_threads = 16
$$); |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
After reading the documents, I have the following questions:
list=[64, 4096]andprobes=[16, 1024], does the search first gets the top 16 clusters in the 64 clusters of the top layer and then get the top 1024 clusters within the region of the retrieved 16 clusters?Beta Was this translation helpful? Give feedback.
All reactions