Releases: skrub-data/skrub
Releases · skrub-data/skrub
Skrub release 0.7.0
Release 0.7.0
✨ Highlights
- Data Ops can now be tuned with Optuna.
- It is now possible to pass extra named arguments to an estimator through
DataOps.skb.apply. - The
TableReportnow supports numpy arrays. - The minimum supported version of Python has been increased to 3.10, and the minimum supported versions of scikit-learn and requests are now 1.4.2 and 2.27.1 respectively.
- Added support for the upcoming Pandas 3.0.
- A lot of bugs have been fixed.
16 new contributors participated in this release 🎉
New features
- It is now possible to tune the choices in a
DataOpwith Optuna. #1661 by @jeromedockes. DataOp.skb.applynow allows passing extra named arguments to the estimator's methods through the parametersfit_kwargs,predict_kwargsetc. #1642 by @jeromedockes.- TableReport now displays the mean statistic for boolean columns. #1647 by @abenechehab.
DataOp.skb.get_varsallows inspecting all the variables, or all the named dataops, in aDataOp. This lets us easily know what keys should be present in theenvironmentdictionary we pass toDataOp.skb.evalor toSkrubLearner.fit,SkrubLearner.predict, etc. #1646 by @jeromedockes.DataOp.skb.iter_cv_splitsiterates over the training and testing environments produced by a CV splitter -- similar toDataOp.skb.train_test_splitbut for multiple cross-validation splits. #1653 by @jeromedockes.TableReportnow supportsnp.array. #1676 by @Nismamjad1.DataOp.skb.full_reportnow accepts a new parameter,title, that is displayed in the html report. #1654 by @MarieSacksick.TableReportnow includes theopen_tabparameter, which lets the user select which tab should be opened when theTableReportis rendered. #1737 by @rcap107.
Changes
- The minimum supported version of Python has been increased to 3.10. Additionally, the minimum supported versions of scikit-learn and requests are 1.4.2 and 2.27.1 respectively. Support for python 3.14 has been added. #1572 by @rcap107.
- The
DataOp.skb.full_reportmethod now deletes reports created withoutput_dir=Noneafter 7 days. #1657 by @dierickxsimon - The
tabular_pipelineuses aSquashingScalerinstead of aStandardScalerfor centering and scaling numerical features when linear models are used. #1644 by @dierickxsimon. - The transformer
ToFloat, previously calledToFloat32, is now public. #1687 by @MarieSacksick. - Improved the error message raised when a Polars lazyframe is passed to
TableReport, clarifying that.collect()must be called first. #1767 by @fatiben2002. - Computing the associations in
TableReportis now deterministic and can be controlled by the new parametersubsampling_seedof the global configuration. #1775 by @thomass-dev. - Added
cast_to_strparameter toCleanerto prevent unintended conversion of list/object-like columns to strings unless explicitly enabled. #1789 by @PilliSiddharth.
Bugfixes
- The
skrub.cross_validatefunction now raises a specific exception if the wrong variable type is passed. #1799 by @emassoulie. - Fixed various issues with some transformers by adding
get_feature_names_outto all single column transformers. #1666 by @rcap107. - Issues occurring when
DataOp.skb.applywas passed a DataOp as the estimator have been fixed in #1671 by @jeromedockes. TableReportcould raise an error while trying to check if Polars columns with some dtypes (lists, structs) are sorted. It would not indicate Polars columns sorted in descending order. Fixed in #1673 by @jeromedockes.- Fixed nightly checks and added support for upcoming library versions, including Pandas v3.0. #1664 by @auguste-probabl and @rcap107.
- Fixed the use of
TableReportandCleanerwith Polars dataframes containing a column with empty string as name. #1722 by @MarieSacksick. - Fixed an issue where
TableReportwould fail when computing associations for Polars dataframes if PyArrow was not installed. #1742 by @rcap107. - Fixed an issue in the Data Ops report generation in cases where the DataOp contained escape characters or were spanning multiple lines. #1764 by @rcap107.
- Added
get_feature_names_outtoCleanerfor consistency with theTableVectorizerand other transformers. #1762 by @rcap107. - Improve error message when
TextEncoderis used without the optional transformers dependencies. #1769 by @fxzhou22. - Accessing
.skb.applied_estimatoron aDataOpafter calling.skb.set_name(),.skb.set_description(),.skb.mark_as_X()or.skb.mark_as_y()used to raise an error, this has been fixed in #1782 by @jeromedockes. - Fixed potential issues that could arise in
ParamSearch.plot_resultswhen NaN values were present in the cross-validation results. #1800 by @rcap107.
New Contributors
- @csejourne made their first contribution in #1503
- @divakaivan made their first contribution in #1598
- @star1327p made their first contribution in #1599
- @abenechehab made their first contribution in #1647
- @dierickxsimon made their first contribution in #1644
- @kudos07 made their first contribution in #1670
- @DimitriPapadopoulos made their first contribution in #1692
- @auguste-probabl made their first contribution in #1664
- @amirakahub made their first contribution in #1715
- @Nismamjad1 made their first contribution in #1676
- @Alispirale made their first contribution in #1717
- @basile-desjuzeur made their first contribution in #1685
- @fatiben2002 made their first contribution in #1767
- @fxzhou22 made their first contribution in #1769
- @PilliSiddharth made their first contribution in #1789
- @emassoulie made their first contribution in #1799
Full Changelog: 0.6.2...0.7.0
0.6.2
New features
- The
DataOp.skb.full_report()now displays the time each node took to evaluate. #1596 by Jérôme Dockès. - The User guide has been reworked and expanded.
Changes and deprecations
- Ken embeddings are now deprecated. #1546 by Vincent Maladiere.
- The accepted values for the parameter
howof.skb.apply()have changed. The new values are "auto", "cols", "frame", and "no_wrap". #1628 by Jérôme Dockès. - The parameter
splitterof.skb.train_test_split()has been renamedsplit_func. #1630 by Jérôme Dockès.
Main bugfixes
- Fixed the display of DataOp objects in Google Colab cell outputs. #1590 by Jérôme Dockès.
- Fixed the range from which
choose_float()andchoose_int()sample values when log=False and n_steps is None. It was between low and low + high, now it is between low and high. #1603 by Jérôme Dockès. - The SkrubLearner used to do a prediction on the train set during fit(), this has been fixed. #1610 by Jérôme Dockès.
Full Changelog: 0.6.1...0.6.2
Skrub release 0.6.1
Bugfixes
get_feature_names_outnow works correctly when used by GapEncoder, DropCols, SelectCols: from within a scikit-learn Pipeline. In addition, DropCols’sget_feature_names_outmethod now returns the names of the columns that are not dropped, rather than the names of the columns that are dropped. #1543 by Riccardo Cappuzzo.
Full Changelog: 0.6.0...0.6.1
Skrub release 0.6.0
🚀 Highlights
- Major feature! Skrub DataOps are a powerful new way of combining dataframe transformations over multiple tables, and machine learning pipelines. DataOps can be combined to form compled data plans, that can be used to train and tune machine learning models. Then, the DataOps plans can be exported as Learners (skrub.SkrubLearner), standalone objects that can be used on new data. More detail about the DataOps can be found in the User guide and in the examples.
- The TableReport has been improved with many new features. Series are now supported directly. It is now possible to skip computing column associations and generating plots when the number of columns in the dataframe exceeds a user-defined threshold. Columns with high cardinality and sorted columns are now highlighted in the report.
- selectors, ApplyToCols and ApplyToFrame are now available, providing utilities for selecting columns to which a transformer should be applied in a flexible way. For more details, see the User guide and the example.
- The SquashingScaler has been added: it robustly rescales and smoothly clips numerical columns, enabling more robust handling of numerical columns with neural networks. See the example
🎆 New Features
- The Skrub DataOps are new mechanism for building machine-learning pipelines that handle multiple tables and easily describing their hyperparameter spaces. Main PR: #1233 by Jérôme Dockès. Additional work from other contributors can be found here: Vincent Maladiere provided very important help by trying the DataOps on many use-cases and datasets, providing feedback and suggesting improvements, improving the examples (including creating all the figures in the examples) and adding jitter to the parallel coordinate plots, Riccardo Cappuzzo experimented with the DataOps, suggested improvements and improved the examples, Gaël Varoquaux , Guillaume Lemaitre, Adrin Jalali, Olivier Grisel and others participated through many discussions in defining the requirements and the public API. See the examples for an introduction.
- The selectors module provides utilities for selecting columns to which a transformer should be applied in a flexible way. The module was created in #895 by Jérôme Dockès and added to the public API in #1341 by Jérôme Dockès.
- The DropUninformative transformer is now available. This transformer employs different heuristics to detect columns that are not likely to bring useful information for training a model. The current implementation includes detection of columns that contain only a single value (constant columns), only missing values, or all unique values (such as IDs). #1313 by Riccardo Cappuzzo.
- get_config(), set_config() and config_context() are now available to configure settings for dataframes display and expressions. patch_display() and unpatch_display() are deprecated and will be removed in the next release of skrub. #1427 by Vincent Maladiere. The global configuration includes the parameter
cardinality_thresholdthat controls the threshold value used to warn user if they have high cardinality columns in their dataset. #1498 by rouk1. Additionally, the parameterfloat_precisioncontrols the number of significant digits displayed for floating-point values in reports. #1470 by George S. - Added the SquashingScaler, a transformer that robustly rescales and smoothly clips numerical columns, enabling more robust handling of numerical columns with neural networks. #1310 by Vincent Maladiere and David Holzmüller.
datasets.toy_order()is now available to create a toy dataframe and corresponding targets for examples. #1485 by Antoine Canaguier-Durand.- ApplyToCols and ApplyToFrame are now available to apply transformers on a set of columns independently and jointly respectively. #1478 by Vincent Maladiere.
Changes
⚠️ The default high cardinality encoder for both TableVectorizer and tabular_learner() (now tabular_pipeline()) has been changed from GapEncoder to StringEncoder. #1354 by Riccardo Cappuzzo.- The tabular_learner function has been deprecated in favor of tabular_pipeline() to honor its scikit-learn pipeline cultural heritage, and remove the ambiguity with the data ops Learner. #1493 by Vincent Maladiere.
- StringEncoder now exposes the stop_words argument, which is passed to the underlying vectorizer (TfidfVectorizer, or HashingVectorizer). #1415 by Vincent Maladiere.
- A new parameter
max_association_columnshas been added to the TableReport to skip association computation when the number of columns exceeds the specified value. #1304 by Victoria Shevchenko. - The packaging dependency was removed. #1307 by Jovan Stojanovic
- TextEncoder, StringEncoder and GapEncoder now compute the total standard deviation norm during training, which is a global constant, and normalize the vector outputs by performing element-wise division on all entries. #1274 by Vincent Maladiere.
- The `DropIfTooMa...