-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Unified String Dictionary #18184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unified String Dictionary #18184
Conversation
Brief comment on my side: I think this is cool work, both allowing a payload to Also unsure if this would be made more easier to implement when additional infrastructure on top of And a question: I am not sure what happens if one where to modify a string_t backed by the dictionary, how that's handled (say that two different |
Thanks for the comment glad you liked it! Indeed, it does add some complexity to the codebase. I am unsure about the limits for future unless you are planning on using the upper bits in the pointer for some other purposes.
Thanks for pointing that out. I think for example we could rely on an allocator for the buffer allocation for USD since right now it gets it from the OS. But other than that, I am not sure how it would simplify implementation.
Indeed, if you were to modify the content of the string in place, it would affect all the other instances that are pointing to it. However, I don't think it is utilized. As I understand with my knowledge of the codebase, string_t is an immutable data structure with regards to its prefix and length. And also I don't believe the content of the string itself on the heap is ever modified in place. In most cases, like in But for instance, let's say there is a use case that might need to modify the A great benefit of this data structure is its flexibility as it does not need to be enforced on all strings. Strings within it can gain all the benefits and the other ones are untouched. |
I meant via: |
I see. Indeed, that's a fair point. I am not sure if we can really enforce it to not happen to USD strings. Like you mentioned, it is a corner case. Essentially, you would want to modify a string in place and also want it to not affect the other instances of it. Perhaps, just extra care is required to make a full copy of the string in these situations. |
Thanks for filing this, but regrettably we can't merge this PR. It proposes a fundamental design change with far-reaching and unknown consequences for current and future development of DuckDB. Please also refer to our contribution guidelines, specifically
|
This PR implements caching for hashes of string dictionaries. This avoid hashing the same values twice, which can be expensive, especially for long strings. ## Implementation When creating a dictionary `Vector` from a storage dictionary using `Vector::Dictionary`, we now create an uninitialized `VectorChildBuffer` to store the hashes. If `VectorOperations::Hash` is called on the dictionary `Vector`, we check if this buffer is there, and if it is initialized. If it's uninitialized, we hash the storage dictionary, and cache the hashes in the buffer. If it's initialized, we simply grab the cached hashes. This essentially gets us half the benefit of the (now closed) PR #18184 for string dictionaries that are used as keys to aggregation (as shown in the benchmark below). However, this PR has a tiny code footprint, and also applies to dictionaries with higher unique counts (for example, it speeds up Q34 of ClickBench by ~5%). ## Benchmark ```sql -- generate similar data to PR #18184 create or replace table test as select format('{:a>256}', cast(range % 100 as varchar)) a256, a256 b256, a256[-128:] a128, a128 b128, a256[-64:] a64, a64 b64, a256[-32:] a32, a32 b32, a256[-16:] a16, a16 b16 from range(10_000_000); -- run each query 3 times .mode trash .timer on select a16, b16 from test group by all;select a16, b16 from test group by all;select a16, b16 from test group by all; select a32, b32 from test group by all;select a32, b32 from test group by all;select a32, b32 from test group by all; select a64, b64 from test group by all;select a64, b64 from test group by all;select a64, b64 from test group by all; select a128, b128 from test group by all;select a128, b128 from test group by all;select a128, b128 from test group by all; select a256, b256 from test group by all;select a256, b256 from test group by all;select a256, b256 from test group by all; ``` Average of 3 runs: | String Length | Current [s] | This PR [s] | Speedup [x] | |-:|-:|-:|-:| | 16 | 0.031 | **0.024** | 1.29x | | 32 | 0.028 | **0.020** | 1.40x | | 64 | 0.034 | **0.024** | 1.42x | | 128 | 0.051 | **0.028** | 1.82x | | 256 | 0.088 | **0.039** | 2.26x | I have no idea why the strings of length 32 are faster (for both Current and This PR). Anyways, we get a 2.26x speedup for the longest strings. The benefit comes from caching the hashes. PR #18184 has a >5x speedup, as it also benefits from using pointer comparisons instead of full string comparisons.
Hi DuckDB team,
This PR implements Unified String Dictionary. The idea is taken from Optimistically Compressed Hash Tables & Strings in the
USSR and is further adopted and integrated to be used in DuckDB.
Motivation
According to workload studies, strings are the most common data types in databases. Yet, they are rather troublesome to process as they can be large and do not fit nicely into memory hierarchy. Furthermore, typical operations such as applying a hash function, equality check, and copying can be very expensive as compared to numerical data which require far less instructions. To speed up string processing, we implemend a dictionary-like data structure that will attempt to unify strings that come from different data sources such as per-block dictionaries and different columns. This dictionary will enable compressed execution for various operations in DuckDB.
Overview
USD is a per-query dictionary which contains the most valuable strings of a query. The goal of this data structure is to speed up operations on strings with regards to hashing and equality checks. Furthermore, strings within USD only need to be materialized once in memory, and further copying of those strings can only be limited to
string_t
. It is consisted of two main components; Data region and a linear probing hash table. Strings and their hash values are materialized in the data region. The hash table is used for fast look ups for the strings within USD and is only used during insertion phase.Insert operation
The main operation that USD supports is the insert operation. It takes a
string_t
as input and it will apply the DuckDB's hash function on it. Using the hash value's first few bits, it will slot into the hash table. If empty, it attempts to materialize the string and its hash value in the data region. Afterwards, it finalizes the hash table bucket with the index into the data region alongside extra hash bits (salt). Future occurances of the same strings will use the stored salt value in the bucket to be extra certain that the values are the same before executing a full string comparison. If the strings match, the inputstring_t
simply points to the materialized string.With regards to concurrency, atomic operations are used to implement a lock-free USD. Threads will contend over hash table buckets and, the winner will "dirty" the bucket using the salt and a sentinel value. Then, it goes to reserve the number of 8-byte chunks it needs from the data region and materialize the string and hash there. The loser and other threads that observe the dirtied bucket will use the salt to make sure if they should wait until the bucket value is finalized or if they should continue probing.
Valuable strings
A string is considered to be valuable if it can be guaranteed that the performance gain from inserting it into USD is more than the cost of the insertion itself. Currently, the following strings are accounted for as valuable strings:
How to recognize USD strings
In this iteration, pointer tagging is used to recognize a string that is backed by USD. Thus, we set the highest bit in the pointer contained within
string_t
. Further changes are made to make sure that the tag is cleared before dereferencing the pointer instring_t
; Even in cases where it does not go through thestring_t
API. It should be noted that because of relying on the pointer instring_t
, only non-inlined strings are considered for insertion into USD.Unified String Dictionary Insertion operator
I decided to add a USD insertion operator to insert strings that are emitted from storage and are on their way to the materializing operators. This operator makes it possible to control the flow of strings into the USD. For example, we try to avoid columns with high domain cardinality. Also, we keep track of dictionary IDs in this operator.
Optimizer Rule
An optimizer rule is also implemented to make sure USD is enabled only if the query can make use of it. To this end, we traverse the query plan in a depth-first search and will attempt to add the USD insertion operator once between the leaf and the first "target" operator.
Currently the following target operators are considered:
logical_aggregate_and_groupby
andlogical_distinct
: These operator can potentially benefit from faster hashing, equality check and also single string allocation in the query intermediate. We have limited this to not be activated if it is a single column group by due toTryAddCompressedGroups
as that is a superior optimization compared to USD.Lifetime
The USD is created by the optimizer when it sees a suitable query and it lives throughtout the query execution. Once the query has finished, it gets moved from the
ClientContext
into theQueryResult
class. This is because the result may rely on the strings materialized in USD. It gets destroyed along with theQueryResult
.Capacity
The original paper chose 512kB as the static size for the data region to be L2 cache resident. Thus, for example, in the hash table only 16 bits are needed to index into the 64k slots of 8-bytes in the data region. We decided to allow for larger USDs as well. There could be workloads that benefit much more from the faster string operation compared to the performance loss from cache-misses.
The baseline USD size is still 512kB but it is possible to intialize USD with a usd_sf parameter. We use this power of two parameter to calculate the extra bits required to slot into the data region.
System modification required
TupleDataCollection
,ColumnDataCollection
, and various row operations. The goal is to treat USD strings like inlined strings.Performance results
Experiments were ran with 8 threads on a MacBook Air M2 with 16gb of memory. They were all run 5 times and the average run time is reported.
Aggregation experiments
Synthetic data: 10 million rows, 100 unique strings in total per column
Query: Simple double column group by
Join experiments
Synthetic data: 10 million rows, 100 unique strings of length 32 per string column.
Query:
Result:
Query:
Result:
TPC-H
The only query that gets faster with this optimization is query 16 as it contains a triple column groupby and one of these columns is p_type which has a 150 domain cardinality.
Result:
TODOs
There are a few issues and loose ends that I am aware of with this PR that I hope to address soon.
row_external.cpp
but I am not sure if these are triggered when a join or aggregation goes out of core.Final note
There is a quite vast design space with implementing and integrating this feature in DuckDB. Although this is the way I chose to implement it, there are various things that could be changed and improved. I am excited to hear what you think about this and I am open to any suggestion and requests.
Thanks!