-
Notifications
You must be signed in to change notification settings - Fork 61
feat: add stratify param support to ml.model_selection.train_test_split method #815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
test_dfs.append(test) | ||
|
||
train_df = cast( | ||
bpd.DataFrame, bpd.concat(train_dfs).drop(columns="bigframes_stratify_col") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
concat complexity was much discussed in various threads (like RAG notebook, google next demo), e.g. https://screenshot.googleplex.com/7jehfHCAVJrmc3p. The number of unique values in the stratify col could be large to run into that. It would be a good idea to test and document where the limit lies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Concat will grow BFET and SQL size. It may encounter SQL size or OOM errors when unique values size is too large. Added a note.
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> 🦕