Skip to content

merge_adjacent_files not compacting a table with lots of single row files #536

@h2o1

Description

@h2o1

What happens?

We have a situation where some of our DuckLake tables in our application do not appear to get compacted when calling ducklake_merge_adjacent_files.

One of the tables that we have been testing with contains about 2838 rows; looking through the ducklake_data_files table and filtering by the table_id of that table, we see that there are 2555 data files linked to the table, most of them containing only 1 row, with the notable exception of the initial file, which contined 157 rows. There are no entries in ducklake_deleted_files associated with the table, and none of the existing files contain any lightweight snapshot entries in the partial_file_info column. We also do not have a custom target_file_size set. Yet, for some reason the table just refuses to get compacted down, and as a result our queries running against this table end up experiencing significant slowdown.

I have attempted to reproduce the issue but have not been able to so far. I have created tables with approximately the same number of rows as well as distribution amongst the files, but while I have sometimes seen a table compact only "partially" (in the sense that a handful of single-row files survive the compaction), I have not been able to reproduce the issue where no files at all are getting compacted.

I have attached the rows of the ducklake_data_files table that pertain to the table that we investigated.

ducklake_data_file_tb_13.csv

To Reproduce

Unfortunately I do not have a fully reproducible example at this point, but I am still trying to get one and will update here if I succeed.

OS:

macOS 15.5

DuckDB Version:

1.4.1

DuckLake Version:

f134ad8

DuckDB Client:

Python

Hardware:

No response

Full Name:

Oliver Hsu

Affiliation:

Ascend.io

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

No - Other reason (please specify in the issue body)

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions