Skip to content

Conversation

adamchainz
Copy link
Member

There’s no need to raise an error if two items have the same hash value. Instead, we can secondarily sort by the key values to get a consistent ordering. Since hashes are based upon the seed, two tests that collide with one seed are highly unlikely to collide with another seed, so they get shuffled rather than always run in alphabetical order.

There’s no need to raise an error if two items have the same hash value. Instead, we can secondarily sort by the key values to get a consistent ordering. Since hashes are based upon the seed, two tests that collide with one seed are highly unlikely to collide with another seed, so they get shuffled rather than always run in alphabetical order.
@adamchainz
Copy link
Member Author

Post-merge follow up to #14137.

The birthday problemn basically guarantees hash collisions across large code bases, or smaller ones with enough iterations - and with all the Django installs and test runs in the world, there will certainly be enough iterations in time. We can avoid raising the error like so.

@adamchainz adamchainz requested a review from cjerdonek July 8, 2021 11:04
@adamchainz adamchainz changed the title Removed raising RuntimeError from shuffle() method. Refs #24522 -- Removed raising RuntimeError from shuffle() method. Jul 8, 2021
@adamchainz adamchainz requested a review from felixxm July 8, 2021 12:16
@cjerdonek
Copy link
Contributor

Some high-level comments:

The RuntimeError serves an important purpose because it also serves to catch two items having the same key, which is a bug the user should know about. (Indeed, that's the case that is tested.) If you want to handle the case of hash collisions with different keys, I would keep the error but change its text, etc. to handle only that case.

Regarding handling hash collisions, since the shuffling is supposed to be random, it would be better if the resolution was also random instead of always favoring tests that sort early. This could be done by adding an additional hashed string at the end as necessary, for all items that hash to the same value. This could be done e.g. by hashing with a seed that is modified in a deterministic way from the original seed, say by appending a character. Adding the hashed string at the end ensures the ordering relative to other items won't be impacted.

An alternative way of resolving this issue would be to use a stronger hash algorithm, or make it easier for people to use a stronger hash algorithm, for people that actually care about getting an occasional hash collision (if this can even be said to be occasional).

@adamchainz
Copy link
Member Author

The RuntimeError serves an important purpose because it also serves to catch two items having the same key, which is a bug the user should know about. (Indeed, that's the case that is tested.) If you want to handle the case of hash collisions with different keys, I would keep the error but change its text, etc. to handle only that case.

In what case is it actually possible to have two items with the same key? Afaict they're always test ID's or class paths, which means they're unique. And if they're non-unique, perhaps the user is deliberately running the same test twice in one process, perhaps to debug an isolation error.

If there's a legitimate concern about items with the same keys, that should be added as a separate check in the test suite, so we can protect users not using --shuffle. Indeed, it would make a good PR for unittest itself.

Regarding handling hash collisions, since the shuffling is supposed to be random, it would be better if the resolution was also random instead of always favoring tests that sort early. This could be done by adding an additional hashed string at the end as necessary, for all items that hash to the same value. This could be done e.g. by hashing with a seed that is modified in a deterministic way from the original seed, say by appending a character. Adding the hashed string at the end ensures the ordering relative to other items won't be impacted.

Hash collisions are already rare so I don't think there's any concern with sorting the items that collide. Adding a second hashes just regresses the problem, because what if there are collisions with the secondary hash?

An alternative way of resolving this issue would be to use a stronger hash algorithm, or make it easier for people to use a stronger hash algorithm, for people that actually care about getting an occasional hash collision (if this can even be said to be occasional).

That just "pushes the boat out" mathematically. I'd prefer to resolve this in a deterministic "that can never happen" way.

Out of interest I computed the current probability of a collision in a given test run. Taking the birthday problem formula and plugging in MD5's 128-bit space, and a million tests, the chance of collision comes out as ~5 * 10 ** -19. So yes it is very low.

In [2]: d = Decimal(2**128)

In [3]: n = Decimal(100_000)

In [4]: 1 - ((d-1)/d)**(n*(n-1)/2)
Out[4]: Decimal('4.999950000E-19')

@cjerdonek
Copy link
Contributor

In what case is it actually possible to have two items with the same key? Afaict they're always test ID's or class paths, which means they're unique. And if they're non-unique, perhaps the user is deliberately running the same test twice in one process, perhaps to debug an isolation error.

Having the same test occur twice isn't compatible with shuffle(), so I think it's important to let the user know if / when that case happens instead of masking it. The RuntimeError does that, so I think it makes sense to keep it in some form.

Adding a second hashes just regresses the problem, because what if there are collisions with the secondary hash?

Yes, that's what I was getting at when I said "as necessary" (as many as necessary). If it's worth handling the less than one-in-a-quintillion case (for a test suite of 100,000 tests), one might as well implement a true shuffle.

@adamchainz adamchainz closed this Nov 27, 2021
@adamchainz adamchainz deleted the shuffle_no_error branch November 27, 2021 18:59
felixxm pushed a commit to knyghty/django that referenced this pull request Nov 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants