How superlinked concatenate different vectors #91

aman-gupta-doc · 2024-12-30T13:22:24Z

aman-gupta-doc
Dec 30, 2024

Hi,

I’ve explored some of Superlinked’s codebase and noticed that when inserting data, it creates a concatenated vector with a dimension equal to the sum of all individual vectors. Could you clarify how this process works? Does it simply append all the vectors, append and normalize them, or use some other algorithm?

Additionally, are there any research papers or studies that discuss this approach? Specifically, I’d like to know if there is any research showing enhanced performance or other benefits when using concatenated vectors. If available, I’d appreciate it if you could share those references.

svonava · 2025-01-10T02:59:18Z

svonava
Jan 10, 2025
Maintainer

Hi @aman-gupta-doc - you are right, Superlinked normalizes and aggregates the vectors in different ways - when it comes to combining vectors of different Spaces, this happens through concatenation.

We are working on benchmarks that we will be able to share more broadly that demonstrate the merit of this approach - but it is quite easy to see that if you have for example numerical properties on your data objects, which you want to bring into the embedding itself (i.e. not just use as metadata for filtering for example), then doing that by stringifying the number and encoding it together with some more text with a text encoder model is inferior to encoding that number into the embedding with a dedicated numerical encoder.

You can read a bit more about this here: https://docs.superlinked.com/getting-started/why-superlinked

Let me know if you have any questions!

0 replies

aman-gupta-doc · 2025-01-10T07:05:00Z

aman-gupta-doc
Jan 10, 2025
Author

Thank you, @svonava, for the detailed explanation and the link to the documentation! I find the approach of combining vectors through concatenation and using dedicated numerical encoders to be particularly intriguing. I’m looking forward to the benchmarks you’re working on—it would be great to see how this methodology performs in practice.

0 replies

lschneidpro · 2025-01-25T23:35:51Z

lschneidpro
Jan 25, 2025

Hi, like @aman-gupta-doc, I was initially a bit confused. Coming from a search engineering background, I would have approached this differently by calculating cosine similarity on separate vectors rather than concatenated ones and potentially using metadata filters to speed up the search process.

I tried out one of your notebooks here: Vector Sampler Notebook.

After giving it more thought, your approach makes sense—cosine similarity would effectively cancel out attributes with a value of 0, leading to the same result as long as some vector re-weighting is applied. That said, I’d be very interested to see some benchmarks, perhaps with comparisons to a text-to-SQL approach for specific use cases.

I’m also curious about embedding binary filters. Is it genuinely worth embedding categorical or numerical data instead of relying on metadata filters? I’d need to dive deeper into this.

However, one primary concern I see with Superlinked is the issue of concatenating vectors. This approach could eventually hit the vector size limit in vector databases, especially with the latest embedding models that have dimensions in the range of 1000+ (MongoDB, for instance, seems to have a limit of around 4096, if I’m not mistaken).

4 replies

svonava Jan 28, 2025
Maintainer

Hi @lschneidpro, thank you for your thoughts! I'll try to address the individual points:

Why concatenate vectors instead of doing separate cosine-similarity queries and then merging the results with e.g. RRF - we wrote this article to explain the issues of trying to merge independent orderings vs aggregating a combination of signals into one score and sorting by that score, which is what effectively happens if you do cosine-similarity queries on the concatenated vectors.
How to handle Natural Language Queries that refer to multiple properties of the data - i.e. let's say a query wants "popular red sweater under 100". To handle such query, you need:
2.1 A subsystem that understands what properties of your data the query refers to (popularity bias, image contents bias, price hard filter). Our open source release uses GPT4o for this step. Inference of these query "intents" (represented in superlinked as query parameters) is a much more constrained problem compared to full text-to-SQL -> this makes it lower latency/more reliable to solve with an LLM.
2.2 A subsystem that executes the "parsed" query - for example, you want a system that can reach into the content of image and text properties ("red christmas sweater"), system that can apply hard filters on the results and system that can blend multiple "bias" type signals. SQL, even supported with embeddings and pgvector, can't do these things efficiently and this creates a problem for all text-to-SQL-style approaches.
Biases vs filters - this depends on your use-case, but often we see people modeling a bias (e.g. "popular sweater") with a filter (popularity > 4.0) - this leads to over / under filtering and you still have to re-rank the results (because filters alone don't give you ranking). We design superlinked to be capable of both bias & filter inference and execution to navigate the tradeoff smoothly. If you are sure that for a given schema property you never want bias-style queries only filters, then don't create a space for that property and just add it to index fields and use it for filters. Practical examples for filters here and space combinations here. This same logic applies to categories - if you want to filter on categories, use filter, if you want to bias towards certain category sets in a fuzzy way, use the categorical space (there is a big upgrade to categorical space coming, we noticed it is normalized incorrectly in some cases).
Vector sizing concerns - indeed MongoDB has a very strict limit by default, but if you have a large account with them they can adjust the limits. We support 2 other Vector DBs that do not have a dimensional limit, so it's just a matter of cost. Some of these providers allow you to enable cost-saving features like quantization and SSD-based indexing and we try to be reasonably efficient with how many dimension each space takes up.

Does that make sense? Happy to answer further questions & I'm looking forward to your feedback!

lschneidpro Jan 28, 2025

I don’t necessarily think this approach is wrong, and I was curious about it since I’m exploring solutions for NLU in query handling. However, this framework seems more tailored for developers with limited search engine experience who want to build a search solution quickly—fitting the current trend of AI agent-focused tools.

That said, it introduces some limitations to meet my needs. By abstracting so much of the search engine, it becomes challenging to implement advanced features like ML-based re-ranking, boosting, hybrid search, or effectively leveraging user behaviour data.

Coming from a Solr/Elasticsearch background, I don’t think text-to-SQL compares fairly. For example, if I have a query parser (possibly powered by an LLM) and a user searches for “popular sweater,” instead of simply filtering (e.g., popularity > 4.0), I could use a function_score to dynamically boost popular items within the result set.

svonava Jan 28, 2025
Maintainer

I'd love to hear about areas where we could add more value for advanced users - by the way, our full product offering contains commercial components for creating and querying Superlinked indices at scale (backed by a compiler that takes the Superlinked Schema->Index DAG and generates a full GPU-enabled Apache Spark-based pipeline out of it useful for batch updates and re-calculation during index schema migrations).

Comments about the advanced features you mentioned:

ML-based re-ranking: We aim to minimize the need for re-ranking in cases where a linear combination of scores suffices (and we help you apply that combination across the whole index, not just on the retrieved candidates, by packing the operation into a single cosine distance lookup). If you train your own linear weights model, you can just use it with the superlinked query API. If you want to train a more complex model (like a cross encoder), you can apply it to the superlinked results to re-rank on your own. Does that cover the need or do I miss something?
Boosting: You can assign weights to all index spaces at query time - see an example here. Under the hood, such weighted query is executed as one approximate nearest neighbor search in the underlying vector database (basically a logN operation). function_score in elastic forces a full-scan as far as I know, which is not feasible for larger indexes - often this makes people switch back to re-scoring using function_score, which runs into all the general re-ranking problems (balancing latency vs quality by re-scoring enough but not too many items & the associated tuning & evals of this). On the other hand, function_score is more flexible because it allows multiplicative and other combination modes, wheras our method only allows a weighted linear combination.
Hybrid search: Do you mean keyword search combined with embedding search into a single score? In superlinked any keyword-type constraint has to be interpreted as a filter - yes this can be limiting in some cases, but we found that you either really want to match all occurences of the keyword and then filtering on it is acceptable, or you keep it fuzzy and stick to a high-quality embedding model that tracks your domain/vocabulary well.
User behavior data: Superlinked treats behavioral events as first-class citizens in the data schema and allows you to use events to aggregate content features into user vectors and then to use the user vectors to bias the retrieval to achieve personalization. See an example notebook for this here. This is especially powerful with our Spark compiler because it allows you to process 100s of millions of events with proper parallelization, instead of having to apply them to your index one by one. We do all that pre-computation outside of the Vector DB and write the final vectors to the DB all at once, which is very efficient. We can also apply event data incrementally for real-time updates between the batch runs.

Additionally, some of our users train their own behavioral embeddings (ala collaborative filtering-style models) and they use them alongside all the other signals using the CustomSpace.

Do you have an eval setup for your NLU project? Would love to check it out!

svonava Feb 27, 2025
Maintainer

@lschneidpro does that make sense Luca? sorry for a wall of text! :-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How superlinked concatenate different vectors #91

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How superlinked concatenate different vectors #91

Uh oh!

aman-gupta-doc Dec 30, 2024

Replies: 3 comments · 4 replies

Uh oh!

svonava Jan 10, 2025 Maintainer

Uh oh!

aman-gupta-doc Jan 10, 2025 Author

Uh oh!

lschneidpro Jan 25, 2025

Uh oh!

svonava Jan 28, 2025 Maintainer

Uh oh!

lschneidpro Jan 28, 2025

Uh oh!

svonava Jan 28, 2025 Maintainer

Uh oh!

svonava Feb 27, 2025 Maintainer

aman-gupta-doc
Dec 30, 2024

Replies: 3 comments 4 replies

svonava
Jan 10, 2025
Maintainer

aman-gupta-doc
Jan 10, 2025
Author

lschneidpro
Jan 25, 2025

svonava Jan 28, 2025
Maintainer

svonava Jan 28, 2025
Maintainer

svonava Feb 27, 2025
Maintainer