git clone [email protected]:haneensa/ProvEnhance.git
$ BUILD_BENCHMARK=1 BUILD_TPCH=1 BUILD_LINEAGE=1 BUILD_PYTHON=1 make $ cd tools/pythonpkg $ BUILD_LINEAGE=true python3 setup.py install
cd ProvEnhance/scripts/baselines/provsql
$ ./smokedduck/dbtoaster.sh
$ python3 smokedduck/test_flights.py
$ ./smokedduck/nsf_benchmark.sh
$ ./smokedduck/benchmark.sh
$ ./smokedduck/benchmark_forward_vs_backward.sh
$ python3 fade_scripts/ablation_abalysis.py
$ python3 fade_scripts/dbt.py
$ python3 fade_scripts/search_plot.py
$ python3 fade_scripts/optimizer.py
$ python3 fade_scripts/plot_scale.py
SmokedDuck is a fork of DuckDB instrumented to provide fine-grained (row) provenance also known as lineage. It is designed to capture lineage with low overhead and query lineage quickly. We're released this experimental prototype to support researchers and academics in using lineage to support provenance-powered use cases.
Please reach out to us if you're considering building something interesting with SmokedDuck or if you run into any issues. You can reach us quickly by creating a GitHub issue or emailing us (Eugene Wu, [email protected]; Haneen Mohammed, [email protected]; Charlie Summers, [email protected]).
If you run into performance issues, also please reach out to us. This version of SmokedDuck is a rebuild of a version we wrote a paper on (currently under review) and we haven't re-implemented all performance improvements yet, particularly for lineage querying. We would love to prioritize any issues that are meaningful to folks attempting to use this project.
See the Roadmap section below for the current set of priorities.
- Support Window functions.
- Automatically create a ticket when a new major DuckDB version is released so we don't fall behind.
- Full testing over all operator code paths to ensure complete lineage coverage.
- Track captured lineage metadata so that we need to disable the optimizer during lineage querying.
- Automatically disable short-circuiting when capturing lineage to avoid breaking provenance model semantics.
- Port optimized lineage querying (using low cost lineage indexes) to this version of SmokedDuck.
- Create a Lineage Buffer Pool and explore compressed lineage to ensure we can stay within a user-specified budget of memory/disk.
- Support additional aggregation types including custom aggregation functions in KSemirings.
DuckDB is a high-performance analytical database system. It is designed to be fast, reliable, portable, and easy to use. DuckDB provides a rich SQL dialect, with support far beyond basic SQL. DuckDB supports arbitrary and nested correlated subqueries, window functions, collations, complex types (arrays, structs), and more. For more information on using DuckDB, please refer to the DuckDB documentation.
We currently support running SmokedDuck in Python. If you would like to use another language, please contact us.
pip install smokedduckimport smokedduck
import pandas as pd
# Creating connection
con = smokedduck.connect(':default:')
# Loading example data
p1 = pd.DataFrame({'a': [42, 43, 44, 45], 'b': ['a', 'b', 'a', 'b']})
p2 = pd.DataFrame({'b': ['a', 'a', 'c', 'b'], 'c': [4, 5, 6, 7]})
con.execute('create table t1 as (select * from p1)')
con.execute('create table t2 as (select * from p2)')
# Executing base query
con.execute('SELECT t1.b, sum(a + c) FROM t1 join (select b, avg(c) as c from t2 group by b) as t2 on t1.b = t2.b group by t1.b', capture_lineage='lineage').df()
# Printing lineage that was captured from base query
print(con.lineage().df())Output
t1 t2 out_index
0 0 3 0
1 0 1 0
2 2 2 1
3 0 0 0
The SmokedDuck object returned from smokedduck.connect(...) complies with the same interface as DuckDB's DuckDBPyConnection so check out it's interface for more details. We edit and expand upon the interface in the following ways:
execute(self, query: str, capture_lineage: str = None, parameters: object = None, multiple_parameter_sets: bool = False) -> duckdb.DuckDBPyRelation: changes the existing DuckDBexecute(...)method with the ability to pass in a provenance model to thecapture_lineageparameter as a string. See the Provenance Models section below to see the details. Also returns a DuckDBPyRelation instead of a DuckDBPyConnection.
lineage(self) -> duckdb.DuckDBPyConnection: queries the last executed query with thelineageprovenance model.why(self) -> duckdb.DuckDBPyConnection: queries the last executed query with thewhyprovenance model.polynomial(self) -> duckdb.DuckDBPyConnection: queries the last executed query with thepolynomialprovenance model.ksemimodule(self) -> duckdb.DuckDBPyConnection: queries the last executed query with theksemimoduleprovenance model.backward(self, backward_ids: list, model: str = 'lineage') -> duckdb.DuckDBPyConnection: executes a backward lineage query where the output indexes are filtered by the indexes in the passedbackward_idslist. The provenance model defaults tolineagebut can be overwritten.forward(self, forward_table: str, forward_ids: list, model: str = 'lineage') -> duckdb.DuckDBPyConnection: executes a forward lineage query where the input indexes from theforward_tableare filtered by the indexes in the passedforward_idslist. The provenance model defaults tolineagebut can be overwritten.
The following details how each provenance model affects capturing and querying:
| Provenance Model | Reference | Capture Mechanics | Querying Mechanics |
|---|---|---|---|
lineage |
Yingwei Cui, Jennifer Widom, and Janet L. Wiener. 2000. Tracing the lineage of view data in a warehousing environment. ACM Trans. Database Syst. 25, 2 (June 2000), 179–227. https://doi.org/10.1145/357775.357777 | Lineage captures the index of every tuple that contributes to each row in the output result set. | Lineage querying returns a column for each joined table in the base query. The cell value is the index of the row that contributed to the out_index. If multiple rows from a table contribute to an output such as when there's an aggregation, multiple rows are returned by the lineage query (similar to Perm). Self-joins lead to multiple columns that correspond to that table. |
why |
Peter Buneman, Sanjeev Khanna, and Wang Chiew Tan. 2001. Why and Where: A Characterization of Data Provenance. In Proceedings of the 8th International Conference on Database Theory (ICDT '01). Springer-Verlag, Berlin, Heidelberg, 316–330. | Why provenance captures the same data as Lineage and only differs in querying. | Why provenance produces a single row per out_index, creating a list of lists containing input indexes. Each internal list identifies one set of witnesses that contribute to the query. Base query aggregation results in multiple internal lists. |
polynomial |
Todd J. Green, Grigoris Karvounarakis, and Val Tannen. 2007. Provenance semirings. In Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems (PODS '07). Association for Computing Machinery, New York, NY, USA, 31–40. https://doi.org/10.1145/1265530.1265535 | Polynomial (referring to Provenance Polynomials) captures the same data as Lineage and only differs in querying. | Polynomial returns how each out_index was created in the form of a polynomial. Items multiplied together must co-occur, and those added together can either occur in order for the out_index to exist. In general, joins produce multiplication and aggregations product addition. |
ksemimodule |
Yael Amsterdamer, Daniel Deutch, and Val Tannen. 2011. Provenance for aggregate queries. In Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems (PODS '11). Association for Computing Machinery, New York, NY, USA, 153–164. https://doi.org/10.1145/1989284.1989302 | K-Semimodule captures the same data as Lineage, and also captures all query intermediates that are materialized before an aggregation. | Querying K-Semimodule lineage results in a partial recalculation of a base query aggregate over whichever input or output tuples are selected. We're expanding the set of aggregate functions and complexity of queries that are supported. |
Note that all provenance models turn off DuckDB short-circuiting to ensure complete lineage is captured. In particular, semi and anti joins are turned off within the optimizer, since otherwise not all input indexes that match the join condition would be considered. Let us know if you're interested in a provenance capture mode that supports short-circuiting.
For CSV files and Parquet files, data import is as simple as referencing the file in the FROM clause:
SELECT * FROM 'myfile.csv';
SELECT * FROM 'myfile.parquet';Refer to our Data Import section for more information.
The website contains a reference of functions and SQL constructs available in DuckDB.
For development, DuckDB requires CMake, Python3 and a C++11 compliant compiler. Run make in the root directory to compile the sources. For development, use make debug to build a non-optimized debug version. You should run make unit and make allunit to verify that your version works properly after making changes. To test performance, you can run BUILD_BENCHMARK=1 BUILD_TPCH=1 make and then perform several standard benchmarks from the root directory by executing ./build/release/benchmark/benchmark_runner. The detail of benchmarks is in our Benchmark Guide.
For SmokedDuck development, you can build SmokedDuck into a local virtualenv with the following commands:
# Create virtualenv
python -m venv venvPythonLineage
source venvPythonLineage/bin/activate
# Install other useful packages
pip install numpy
pip install pandas
# Build SmokedDuck locally
cd tools/pythonpkg
BUILD_LINEAGE=true python setup.py install
# Run SmokedDuck
cd ../..
python smokedduck/minimal_test.py