-
-
Notifications
You must be signed in to change notification settings - Fork 33.1k
Refs #24522 -- Removed raising RuntimeError from shuffle() method. #14611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There’s no need to raise an error if two items have the same hash value. Instead, we can secondarily sort by the key values to get a consistent ordering. Since hashes are based upon the seed, two tests that collide with one seed are highly unlikely to collide with another seed, so they get shuffled rather than always run in alphabetical order.
Post-merge follow up to #14137. The birthday problemn basically guarantees hash collisions across large code bases, or smaller ones with enough iterations - and with all the Django installs and test runs in the world, there will certainly be enough iterations in time. We can avoid raising the error like so. |
Some high-level comments: The Regarding handling hash collisions, since the shuffling is supposed to be random, it would be better if the resolution was also random instead of always favoring tests that sort early. This could be done by adding an additional hashed string at the end as necessary, for all items that hash to the same value. This could be done e.g. by hashing with a seed that is modified in a deterministic way from the original seed, say by appending a character. Adding the hashed string at the end ensures the ordering relative to other items won't be impacted. An alternative way of resolving this issue would be to use a stronger hash algorithm, or make it easier for people to use a stronger hash algorithm, for people that actually care about getting an occasional hash collision (if this can even be said to be occasional). |
In what case is it actually possible to have two items with the same key? Afaict they're always test ID's or class paths, which means they're unique. And if they're non-unique, perhaps the user is deliberately running the same test twice in one process, perhaps to debug an isolation error. If there's a legitimate concern about items with the same keys, that should be added as a separate check in the test suite, so we can protect users not using
Hash collisions are already rare so I don't think there's any concern with sorting the items that collide. Adding a second hashes just regresses the problem, because what if there are collisions with the secondary hash?
That just "pushes the boat out" mathematically. I'd prefer to resolve this in a deterministic "that can never happen" way. Out of interest I computed the current probability of a collision in a given test run. Taking the birthday problem formula and plugging in MD5's 128-bit space, and a million tests, the chance of collision comes out as ~5 * 10 ** -19. So yes it is very low.
|
Having the same test occur twice isn't compatible with
Yes, that's what I was getting at when I said "as necessary" (as many as necessary). If it's worth handling the less than one-in-a-quintillion case (for a test suite of 100,000 tests), one might as well implement a true shuffle. |
…nd Client classes.
There’s no need to raise an error if two items have the same hash value. Instead, we can secondarily sort by the key values to get a consistent ordering. Since hashes are based upon the seed, two tests that collide with one seed are highly unlikely to collide with another seed, so they get shuffled rather than always run in alphabetical order.