Releases: Desbordante/desbordante-core
Desbordante 2.4.1
Hi everyone! This is a minor release, which adds support for Python 3.14.
Desbordante 2.4.0
Hi everyone! Almost seven months have passed since our last release, and we are excited to introduce the new version of Desbordante. This update is packed with features and includes the following user-facing improvements:
-
Validation of Constant Denial Constraints (DCs): We have enhanced the validation functionality for denial constraints by adding the ability to handle constant DCs. Additionally, we included examples demonstrating how to use the validation of this pattern, including data cleaning examples that utilize exception discovery capabilities.
-
Graph Functional Dependencies (GFDs) Discovery: Desbordante now includes an algorithm for searching GFDs within datasets. Preliminary experiments have shown that processing a couple of million vertices can be done in a reasonable time on a standard computer. An example of its usage has been added.
-
Validation of Differential Dependencies (DDs): We have introduced basic functionality for validating DDs. Currently, it does not support passing user-defined metrics, only the built-in ones can be used. As usual for validation tasks, there is an exception discovery feature, along with an example of how to use it.
-
Validation of Conditional Functional Dependencies (CFDs): We have added the capability to validate CFDs from the core of Desbordante. The new version operates on partitions and should be advantageous when validating multiple CFDs. In the future, we plan to introduce a straightforward algorithm (without index construction) that will be optimal for validating a single CFD. A separate usage example has been included, along with the ability to discover exceptions.
-
Validation of Matching Dependencies (MDs): We have added validation for the most expressive type of pattern, which includes exception discovery capabilities. This pattern is capable of capturing subtle inconsistencies in data by utilizing various matching functions. Custom user functions are supported. An example demonstrating data cleaning techniques using this pattern has been added.
-
Revamped Example for Fuzzy Algebraic Constraints (AC): The example for fuzzy algebraic constraints has been completely redesigned. Now, users can understand the algorithm and its parameters without referring to the article, along with a clear example illustrating the underlying concepts.
-
Colab Notebooks with Examples: We aim to lower the entry barrier for potential users of Desbordante, and to this end, we now provide notebooks with examples in addition to examples in the Desbordante-core repository. These notebooks can be run in Google Colab (without needing to install our pip package on your machine) to familiarize users with the profiler and help them choose the appropriate pattern. Ultimately, we plan to include clickable links in the README.md to the notebook versions of the examples. In this release, over ten patterns have received their example versions in the form of notebooks.
-
Serialization of Found Patterns: We have added the ability to serialize certain types of patterns and plan to implement this across all supported patterns in the future. Besides the obvious benefits, serialization serves as the foundation for Reflejo — a profiling module that operates in digest mode. Reflejo is similar to traditional profilers in that it takes a dataset, a search profile, and returns the identified patterns. Unlike the core package, Reflejo focuses on providing an overview of the dataset rather than user-directed spot checks. Additionally, it offers various functionalities such as pattern search tuning based on time limits, management of profiling results, and failure response scenarios. We will be releasing this pip package soon, so stay tuned for announcements.
Finally, we have published our first major article about Desbordante, which provides a high-level overview of the tool, presents its architecture, and outlines our vision and future plans. You can read it here: https://dl.acm.org/doi/10.1145/3703323.3703725.
Desbordante 2.3.2
Hello everyone! This release primarily focuses on bug fixes and improvements related to Numerical Association Rule (NAR) discovery. Here are the key changes:
- Optional Seed Parameter: Introduced an optional seed parameter in the NAR discovery algorithm to ensure repeatability across subsequent invocations.
- Consistent NAR Discovery Example: Updated the NAR discovery example to guarantee consistent results across different machines and invocation.
- Improved Python Bindings: Fixed the IND, AIND, and SFD Python bindings to prevent further invocations from disrupting previously found pattern instances.
- CMake Version Requirement: The README now clearly states the requirement for CMake version 3.25.
- Build System Upgrade: Transitioned from Make to Ninja for faster build times.
- MacOS Performance Improvements: Several algorithms (FDep, FastADC, FastOD) have received a significant speed boost, achieving performance parity with their Linux counterparts.
Desbordante 2.3.1
This release only fixes pypi.org metadata.
The changelog of v.2.3.0 can be seen here.
Desbordante 2.3.0
Hello everyone! We are excited to share that with this release, we've added support for macOS, making it possible to install the Desbordante-core pip package on macOS via PyPi for CPython versions 3.8 through 3.13, as well as for PyPy versions 3.7 through 3.10. Additionally, we have introduced support for CPython 3.13 on Linux. This release also includes several important updates and new features:
-
Numerical Association Rule Discovery: We have added support for numerical association rule discovery, a pattern specifically designed for extracting knowledge from data. Currently, we include an approximate discovery algorithm based on an evolutionary approach, which may miss some rules. A usage example is provided for reference.
-
EulerFD Algorithm: A novel algorithm, EulerFD, has been introduced for the approximate discovery of exact functional dependencies. This algorithm is very fast but may miss some dependencies and introduce additional (non-existing) ones. To the best of our knowledge, it is currently the fastest existing approximate algorithm for this task. Two usage examples for approximate functional dependency discovery algorithms are included.
-
Approximate Denial Constraints: We have added support for the discovery of approximate denial constraints. This recently developed pattern allows us to extract facts from data expressed by a Boolean formula. Discovery is performed using the FastADC algorithm, which is currently the fastest algorithm for this task. A usage example is provided.
-
Parallelization of HyFD Algorithm: The HyFD algorithm, one of the most efficient algorithms for exact functional dependency discovery, has been parallelized. As a result, we have achieved up to a 10x improvement in running times.
-
Improved Differential Dependency Discovery Algorithm: The SPLIT differential dependency discovery algorithm has been enhanced for improved performance, resulting in faster execution times (up to 17x). Also, its memory consumption has been significantly reduced (25x), allowing it to process larger datasets.
Desbordante 2.2.1
This release only fixes pypi.org metadata.
The changelog of v.2.2.0 can be seen here.
Desbordante 2.2.0
Almost a year passed from our initial release of the core package. Over this time we have experienced a great deal of interest, getting 18K downloads and 400 stars. Therefore, we have decided to make this release a bit special by trying to introduce as many new interesting patterns into Desbordante as possible. While there are several of them, the obvious star of this release is matching dependencies—a pattern which will greatly help in data deduplication, data cleaning and many other data quality tasks.
Overall, this release contains pattern implementations accumulated in over half a year. We hope you will find it useful!
Changes:
- Added discovery of matching dependencies—a very expressive type of dependency, capable of capturing subtle inconsistencies in data by using various matching functions.
- Added discovery of many types of approximate functional dependencies. Before, we defined the error value of an approximate FD to be calculated using the g1 metric. Our new definition permits use of any error metric, as the alternative metrics are currently gaining popularity. Therefore, we are expanding the number of supported metrics in Desbordante and in this release we added discovery for the
$\mu+$ ,$\tau$ ,$pdep$ , and$\rho$ metrics. - Added discovery of soft functional dependencies and corellations.
- Added validation of variable heterogeneous denial constraints.
- Added discovery and validation of approximate inclusion dependencies (using the
$g^{‘}_3$ metric). Inclusion dependencies can help users to recover foreign keys, or to find joinable columns in a table or a collection of tables. Supporting an approximate version of this pattern will allow users to perform these tasks even when dealing with dirty data. - Added validation of probabilistic functional dependencies.
- Examples were reorganized into three categories: 1) basic, which showcase a single pattern, 2) advanced, which illustrate various pattern nuances, and 3) expert, which demonstrate instances of complex programs that aim to provide tangible benefits for end-users by solving real-life problems using pattern discovery or validation.
Miscellaneous:
- Added HPIValid algorithm for discovery of UCCs. To the best of our knowledge, it is currently the most performant algorithm for this task, therefore we made it the default one.
- Added examples for UCC and AUCC mining.
- Fixed the AR example and added the support output to the Python bindings.
- Fixed lifetime issues with FD and UCC objects.
All novel patterns are coming with usage examples. Please note that the console version of Desbordante will be updated a bit later.
Desbordante 2.1.0
Release Notes
This minor release serves as a necessary step for isolating code of the console interface and moving it into a separate repository. Our final goal is to create a dedicated Python package called desbordante-cli, which will be implemented purely in Python. It will depend on the core desbordante package that contains the C++ code for pattern mining and validation.
As such, we plan to make minor releases of the core package in the future, followed by the console ones. These releases will contain fewer features, but will come out a lot more frequently. The idea here is to make a release as soon as each individual algorithm is ready rather than accumulating several of them as we did previously. Once a sufficient number of features have been accumulated, a major release will be published, primarily for promotion purposes. It will not provide any new functionality, but will include all the accumulated changes since the last major release.
Changes:
- We have added support for a novel class of algorithms — the dynamic ones. The idea is to track changes in the dataset in order to update their result on-the-fly rather than processing the whole table again. As a result, they can be up to several orders of magnitude faster than classic (static) ones in some situations. Along with devising dynamic infrastructure, we have implemented the first dynamic algorithm — a dynamic functional dependency validator. A Python interface and an example are provided.
- We have added support for discovery of differential dependencies. Differential dependency is a relatively novel type of pattern which is very handy for detecting a particular relationship between two column sets. It can be seen as an extension of functional dependency which works well on dirty data. See the article about the pattern for more information. Its implemented discovery algorithm (Split) comes with a Python interface and an example.
- Discovery of association rules is now available via the Python and console interfaces. An example is also available.
Miscellaneous:
- Greatly improved the metric functional dependency verification example.
- Added approximate inclusion dependency discovery algorithms to the C++ core. Python interface, console interface, and an example are still in development.
- Fixed Python bindings for association rules: the AR objects can be properly copied now.
- Extended simple statistics module with ten string-related statistics; they are available via the Python interface.
- Fixed a CLI-breaking bug related to the CFD discovery algorithm.
- Improved column type deduction in the C++ core.
Desbordante 2.0.0
Release Notes
This major release brings a lot of improvements. Its primary focus is Desbordante’s core: we add several new primitives for pattern discovery.
Changes:
- New feature: discovery of exact order dependencies. This primitive allows you to discover patterns related to orderings of columns, e.g. pay increasing with grade. It is available with two different axiomatizations — set-based and list-based. The latter is faster, but may miss some dependencies, while the former is more accurate, but computationally more demanding. Note that they present dependencies in different formats.
- New feature: discovery of probabilistic functional dependencies for both existing metrics: PerTuple and PerValue. This primitive helps in discovering a special case of approximate functional dependencies that better detects multiple violations in a small set of clusters. Provided examples illustrate the differences between existing AFD formulation and those PFDs, as well as show some of the potential use cases of PFDs.
- New feature: discovery of inclusion dependencies. This primitive can help users to recover primary key — foreign key relationships, or to find joinable columns in a table or a collection of tables. It is available as an exact algorithm (Spider) and as an approximate one (Faida), with Faida potentially producing errors but being much faster.
- New feature: we extend the set of supported data types by adding graphs. We started with supporting graph functional dependencies (GFD), and Desbordante can now validate GFDs. GFDs allow users to define patterns in graphs, specifying conditions both on graph structure and node content. Graph dependencies can be a bit tricky, so we provide illustrated examples.
- We’ve made discovery of conditional functional dependencies available in Python. This primitive can be considered as:
- An another way to define approximate functional dependencies, which, unlike other approaches, offers rich semantics (context), helping in understanding complex cases when the exact FD does not hold;
- An AFDs discovery algorithm which provides control over how frequent and how consistent this pattern is;
- A building block for many existing data repair algorithms.
- We’ve also made validation of approximate unique column combinations available in Python. This primitive is suitable for defining keys in tables and for detecting partial duplicates over a subset of columns. As is usually the case with any validation primitive, we additionally provide discovery of exceptions and computation of improved thresholds.
For all introduced primitives, we provide descriptive examples. All primitives are supported in the console version of Desbordante, with the help file containing references to papers in which these primitives are described.
Miscellaneous:
- We have established a github organization and gathered all repositories related to our project in one place.
- We have extended the coverage of the option for limiting the maximum size of the left-hand side to all functional dependency discovery algorithms. This should allow users to speed up the FD discovery if they do not need dependencies with large LHSes.
- We’ve added many new example scripts. Since the project is currently under-documented, we hope this will be helpful for our potential users. You can see them here.
- To improve our overall documentation level, we have also published several guides — see the papers section.
Desbordante 1.1.0
Release Notes
Key enhancements of this minor release concern Python bindings. Namely, we've organized our algorithms into intuitive Python submodules based on primitives and we've provided default algorithms for each one, simplifying usage.
Detailed changes are the following:
- Every primitive available in the library now gets its own submodule, mining and verifying are kept separately:
- Every primitive’s submodule contains the structures relevant to it. For example, the UCC class may now be accessed as
desbordante.ucc.UCC. - All algorithms for mining/verifying a primitive are located in the respective primitive’s
algorithmssubmodule. For example, the UccVerifier algorithm may now be accessed asdesbordante.ucc_verification.algorithms.UccVerifier. The same holds true for simple statistics. The algorithm to extract them may be accessed asdesbordante.statistics.algorithms.DataStats. - Every
algorithmssubmodule has a default algorithm for ease of use (example:desbordante.fd.algorithms.Default)
- Every primitive’s submodule contains the structures relevant to it. For example, the UCC class may now be accessed as
- Restored exceptions for metric dependency verification
- Various enhancements to the FD class:
- FD str representation now uses column names instead of indices
- Added FD methods to facilitate easy conversion to Python structures
- Added hashing and equality methods (i.e. FDs can now be inserted into sets)
- The “table” option no longer gets special treatment:
- Removed
.load_data(path, separator, has_header, **kwargs)overload - The option can now be set like a normal one (ex:
algo.load_data(table=(path, separator, has_header), …)oralgo.load_data(table=dataframe, …))
- Removed
- The names and descriptions of options available for an algorithm are now listed in its docstring
- Fixed bug with error option for afd_verification