feat: add stratify param support to ml.model_selection.train_test_split method #815

GarrettWu · 2024-07-02T20:09:56Z

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

shobsi · 2024-07-09T23:09:08Z

bigframes/ml/model_selection.py

+            test_dfs.append(test)
+
+        train_df = cast(
+            bpd.DataFrame, bpd.concat(train_dfs).drop(columns="bigframes_stratify_col")


concat complexity was much discussed in various threads (like RAG notebook, google next demo), e.g. https://screenshot.googleplex.com/7jehfHCAVJrmc3p. The number of unique values in the stratify col could be large to run into that. It would be a good idea to test and document where the limit lies.

Yes. Concat will grow BFET and SQL size. It may encounter SQL size or OOM errors when unique values size is too large. Added a note.

feat: add stratify param to ml.model_selection.train_test_split

8615e37

GarrettWu requested review from junyazhang and shobsi July 2, 2024 20:09

GarrettWu requested review from a team as code owners July 2, 2024 20:09

product-auto-label bot added the size: m Pull request size is medium. label Jul 2, 2024

blunderbuss-gcf bot assigned orrbradford Jul 2, 2024

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. label Jul 2, 2024

fix mypy

82d34d3

GarrettWu assigned GarrettWu and unassigned orrbradford Jul 2, 2024

Merge branch 'main' into garrettwu-split

fd0e813

shobsi approved these changes Jul 9, 2024

View reviewed changes

GarrettWu added 3 commits July 10, 2024 18:51

add notes for limit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add stratify param support to ml.model_selection.train_test_split method #815

feat: add stratify param support to ml.model_selection.train_test_split method #815

Uh oh!

GarrettWu commented Jul 2, 2024

Uh oh!

shobsi Jul 9, 2024

Uh oh!

GarrettWu Jul 10, 2024

Uh oh!

feat: add stratify param support to ml.model_selection.train_test_split method #815

feat: add stratify param support to ml.model_selection.train_test_split method #815

Uh oh!

Conversation

GarrettWu commented Jul 2, 2024

Uh oh!

shobsi Jul 9, 2024

Choose a reason for hiding this comment

Uh oh!

GarrettWu Jul 10, 2024

Choose a reason for hiding this comment

Uh oh!