Skip to content

Conversation

OmidAfroozeh
Copy link
Contributor

Hi DuckDB team,

This PR implements Unified String Dictionary. The idea is taken from Optimistically Compressed Hash Tables & Strings in the
USSR
and is further adopted and integrated to be used in DuckDB.

Motivation

According to workload studies, strings are the most common data types in databases. Yet, they are rather troublesome to process as they can be large and do not fit nicely into memory hierarchy. Furthermore, typical operations such as applying a hash function, equality check, and copying can be very expensive as compared to numerical data which require far less instructions. To speed up string processing, we implemend a dictionary-like data structure that will attempt to unify strings that come from different data sources such as per-block dictionaries and different columns. This dictionary will enable compressed execution for various operations in DuckDB.

Overview

USD is a per-query dictionary which contains the most valuable strings of a query. The goal of this data structure is to speed up operations on strings with regards to hashing and equality checks. Furthermore, strings within USD only need to be materialized once in memory, and further copying of those strings can only be limited to string_t. It is consisted of two main components; Data region and a linear probing hash table. Strings and their hash values are materialized in the data region. The hash table is used for fast look ups for the strings within USD and is only used during insertion phase.

Insert operation

The main operation that USD supports is the insert operation. It takes a string_t as input and it will apply the DuckDB's hash function on it. Using the hash value's first few bits, it will slot into the hash table. If empty, it attempts to materialize the string and its hash value in the data region. Afterwards, it finalizes the hash table bucket with the index into the data region alongside extra hash bits (salt). Future occurances of the same strings will use the stored salt value in the bucket to be extra certain that the values are the same before executing a full string comparison. If the strings match, the input string_t simply points to the materialized string.

With regards to concurrency, atomic operations are used to implement a lock-free USD. Threads will contend over hash table buckets and, the winner will "dirty" the bucket using the salt and a sentinel value. Then, it goes to reserve the number of 8-byte chunks it needs from the data region and materialize the string and hash there. The loser and other threads that observe the dirtied bucket will use the salt to make sure if they should wait until the bucket value is finalized or if they should continue probing.

Valuable strings

A string is considered to be valuable if it can be guaranteed that the performance gain from inserting it into USD is more than the cost of the insertion itself. Currently, the following strings are accounted for as valuable strings:

  • Dictionary Vectors emitted from storage: It is expected that many of the strings that are dictionary encoded are repeating values. Thus, by inserting the deduplicated unique values once, we can gain performance for all the repeating instances.
  • Constant Vectors
  • Flat Vectors: We also try to insert the unique string values on the build side of the join if we can determine that the join is on a primary-foreign key relation.

How to recognize USD strings

In this iteration, pointer tagging is used to recognize a string that is backed by USD. Thus, we set the highest bit in the pointer contained within string_t. Further changes are made to make sure that the tag is cleared before dereferencing the pointer in string_t; Even in cases where it does not go through the string_t API. It should be noted that because of relying on the pointer in string_t, only non-inlined strings are considered for insertion into USD.

Unified String Dictionary Insertion operator

I decided to add a USD insertion operator to insert strings that are emitted from storage and are on their way to the materializing operators. This operator makes it possible to control the flow of strings into the USD. For example, we try to avoid columns with high domain cardinality. Also, we keep track of dictionary IDs in this operator.

Optimizer Rule

An optimizer rule is also implemented to make sure USD is enabled only if the query can make use of it. To this end, we traverse the query plan in a depth-first search and will attempt to add the USD insertion operator once between the leaf and the first "target" operator.

Currently the following target operators are considered:

  • logical_aggregate_and_groupby and logical_distinct: These operator can potentially benefit from faster hashing, equality check and also single string allocation in the query intermediate. We have limited this to not be activated if it is a single column group by due to TryAddCompressedGroups as that is a superior optimization compared to USD.
  • logical_comparison_join: same reasons as above but also it might be useful to put "payload" columns into USD as well to reduce the memory pressure and perform less copying.

Lifetime

The USD is created by the optimizer when it sees a suitable query and it lives throughtout the query execution. Once the query has finished, it gets moved from the ClientContext into the QueryResult class. This is because the result may rely on the strings materialized in USD. It gets destroyed along with the QueryResult.

Capacity

The original paper chose 512kB as the static size for the data region to be L2 cache resident. Thus, for example, in the hash table only 16 bits are needed to index into the 64k slots of 8-bytes in the data region. We decided to allow for larger USDs as well. There could be workloads that benefit much more from the faster string operation compared to the performance loss from cache-misses.

The baseline USD size is still 512kB but it is possible to intialize USD with a usd_sf parameter. We use this power of two parameter to calculate the extra bits required to slot into the data region.

System modification required

  • Hash function: the hash function for string is modified to check if the non-inlined string is backed by USD. If so, it can directly access the hash value by a simple arithmetic and pointer dereference.
  • USD backed strings as inlined strings: To prevent the system from copying USD strings, several changes are made to TupleDataCollection, ColumnDataCollection, and various row operations. The goal is to treat USD strings like inlined strings.

Performance results

Experiments were ran with 8 threads on a MacBook Air M2 with 16gb of memory. They were all run 5 times and the average run time is reported.

Aggregation experiments

Synthetic data: 10 million rows, 100 unique strings in total per column
Query: Simple double column group by

select str1, str2 from varchars group by str1, str2;

double_column_groupby_165295ad

Join experiments

  • Join on integer keys with one string payload column each: Goal of this experiment is to show that even payload columns in join can benefit from being the USD. The main benefits of USD here is that less copying and memory allocation is required.
    Synthetic data: 10 million rows, 100 unique strings of length 32 per string column.
    Query:
select * from varchars_a join varchars_b on varchars_a.id = varchars_b.id;

Result:

Main (s) USD (s)
0.265 0.187
  • Join on string columns: The customer and nation tables in TPC-H are modified to use UUID as varchars (36 characters) for their integer keys. The experiment was ran on scale factor 30.
    Query:
select * from nation JOIN customer  on nation.n_nationkey_uuid_str = customer.c_nationkey_uuid_str;

Result:

Main (s) USD (s)
0.195 0.158

TPC-H

The only query that gets faster with this optimization is query 16 as it contains a triple column groupby and one of these columns is p_type which has a 150 domain cardinality.
Result:

scale factor Main (s) USD (s)
1 0.0147 0.0132
10 0.0890 0.0708
30 0.251 0.201

TODOs

There are a few issues and loose ends that I am aware of with this PR that I hope to address soon.

  • The commented tests fail in debug mode as a result of the new operator
  • Filter operator could be added as a target operator
  • It is also possible to unify and enhance the string_t API so that there would not be the same logic in different places. This could also be a separate PR.
  • This optimization should technically lead to less spilling as it leads to less memory pressure for materializing operators. However, I have yet been unable to properly verify it. For example, I have made some changes in row_external.cpp but I am not sure if these are triggered when a join or aggregation goes out of core.
  • Currently, in the USD insertion operator, we assess if a column is from a high cardinality domain by inserting the first few dictionaries and monitoring the ratio of unique values within them. Perhaps more can be with the use of other heuristics or tuning the number of dictionaries in the grace period to avoid performance regression. As a result, I have been extremely conservative with the inserted number of unique values per column.

Final note

There is a quite vast design space with implementing and integrating this feature in DuckDB. Although this is the way I chose to implement it, there are various things that could be changed and improved. I am excited to hear what you think about this and I am open to any suggestion and requests.

Thanks!

@carlopi
Copy link
Contributor

carlopi commented Jul 10, 2025

Brief comment on my side: I think this is cool work, both allowing a payload to string_t and being able to unifying strings, but the impact on the codebase is significant and unsure if justified, given the added complexity and added limits on future implementation choices.

Also unsure if this would be made more easier to implement when additional infrastructure on top of string_t is offered like with #17992 (see String class).

And a question: I am not sure what happens if one where to modify a string_t backed by the dictionary, how that's handled (say that two different string_t point to the same dictionary data, and one of the strings is modified)? Or if that is forbidden, and if so, how is that forced at the code level?

@OmidAfroozeh
Copy link
Contributor Author

OmidAfroozeh commented Jul 10, 2025

Brief comment on my side: I think this is cool work, both allowing a payload to string_t and being able to unifying strings, but the impact on the codebase is significant and unsure if justified, given the added complexity and added limits on future implementation choices.

Thanks for the comment glad you liked it! Indeed, it does add some complexity to the codebase. I am unsure about the limits for future unless you are planning on using the upper bits in the pointer for some other purposes.

Also unsure if this would be made more easier to implement when additional infrastructure on top of string_t is offered like with #17992 (see String class).

Thanks for pointing that out. I think for example we could rely on an allocator for the buffer allocation for USD since right now it gets it from the OS. But other than that, I am not sure how it would simplify implementation.

And a question: I am not sure what happens if one where to modify a string_t backed by the dictionary, how that's handled (say that two different string_t point to the same dictionary data, and one of the strings is modified)? Or if that is forbidden, and if so, how is that forced at the code level?

Indeed, if you were to modify the content of the string in place, it would affect all the other instances that are pointing to it. However, I don't think it is utilized. As I understand with my knowledge of the codebase, string_t is an immutable data structure with regards to its prefix and length. And also I don't believe the content of the string itself on the heap is ever modified in place. In most cases, like in ColumnDataCollection and TupleDataCollection, a full copy of the string is made and a new string_t is constructed that points to the newly copied string.

But for instance, let's say there is a use case that might need to modify the string_t pointer in place (set to point elsewhere). It is possible and no harm is done. It only causes the modified string_t to not gain all the benefits from being in the USD.

A great benefit of this data structure is its flexibility as it does not need to be enforced on all strings. Strings within it can gain all the benefits and the other ones are untouched.

@carlopi
Copy link
Contributor

carlopi commented Jul 10, 2025

I meant via: GetDataWritable() and Finalize() interface. I don't know if there is or there might be a usage for that, I might be pushing for a corner case.

@OmidAfroozeh
Copy link
Contributor Author

I see. Indeed, that's a fair point. I am not sure if we can really enforce it to not happen to USD strings. Like you mentioned, it is a corner case. Essentially, you would want to modify a string in place and also want it to not affect the other instances of it. Perhaps, just extra care is required to make a full copy of the string in these situations.

@hannes
Copy link
Member

hannes commented Aug 11, 2025

Thanks for filing this, but regrettably we can't merge this PR. It proposes a fundamental design change with far-reaching and unknown consequences for current and future development of DuckDB.

Please also refer to our contribution guidelines, specifically

Discuss your intended changes with the core team on Github

@hannes hannes closed this Aug 11, 2025
hannes added a commit that referenced this pull request Aug 13, 2025
This PR implements caching for hashes of string dictionaries. This avoid
hashing the same values twice, which can be expensive, especially for
long strings.

## Implementation
When creating a dictionary `Vector` from a storage dictionary using
`Vector::Dictionary`, we now create an uninitialized `VectorChildBuffer`
to store the hashes. If `VectorOperations::Hash` is called on the
dictionary `Vector`, we check if this buffer is there, and if it is
initialized. If it's uninitialized, we hash the storage dictionary, and
cache the hashes in the buffer. If it's initialized, we simply grab the
cached hashes.

This essentially gets us half the benefit of the (now closed) PR #18184
for string dictionaries that are used as keys to aggregation (as shown
in the benchmark below). However, this PR has a tiny code footprint, and
also applies to dictionaries with higher unique counts (for example, it
speeds up Q34 of ClickBench by ~5%).

## Benchmark
```sql
-- generate similar data to PR #18184
create or replace table test as
select
  format('{:a>256}', cast(range % 100 as varchar)) a256, a256 b256,
  a256[-128:] a128, a128 b128,
  a256[-64:] a64, a64 b64,
  a256[-32:] a32, a32 b32,
  a256[-16:] a16, a16 b16
from range(10_000_000);
-- run each query 3 times
.mode trash
.timer on
select a16, b16 from test group by all;select a16, b16 from test group by all;select a16, b16 from test group by all;
select a32, b32 from test group by all;select a32, b32 from test group by all;select a32, b32 from test group by all;
select a64, b64 from test group by all;select a64, b64 from test group by all;select a64, b64 from test group by all;
select a128, b128 from test group by all;select a128, b128 from test group by all;select a128, b128 from test group by all;
select a256, b256 from test group by all;select a256, b256 from test group by all;select a256, b256 from test group by all;
```

Average of 3 runs:

| String Length | Current [s] | This PR [s] | Speedup [x] |
|-:|-:|-:|-:|
| 16 | 0.031 | **0.024** | 1.29x |
| 32 | 0.028 | **0.020** | 1.40x |
| 64 | 0.034 | **0.024** | 1.42x |
| 128 | 0.051 | **0.028** | 1.82x |
| 256 | 0.088 | **0.039** | 2.26x |

I have no idea why the strings of length 32 are faster (for both Current
and This PR).

Anyways, we get a 2.26x speedup for the longest strings. The benefit
comes from caching the hashes. PR #18184 has a >5x speedup, as it also
benefits from using pointer comparisons instead of full string
comparisons.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants