Skip to content

Conversation

@ychin
Copy link
Contributor

@ychin ychin commented Dec 6, 2025

This is a somewhat rare issue when using histogram to diff files, as the algorithm will generate a diff output that looks redundant and wrong to a human. I provided a synthetic example in the commit message, but for one from the real world, do the following command in the Git repo:

git show -U0 --diff-algorithm=histogram 2c8999027c -- po/ga.po

Scroll to the line "@@ -7239,3 +5831,5 @@", and we can see the following diff hunk:

-#: builtin/diff.c
-msgid "Not a git repository"
-msgstr "Ní stór git"
+msgid "cannot come back to cwd"
+msgstr "ní féidir teacht ar ais chuig cwd"
+
+msgid "Not a git repository"
+msgstr "Ní stór git é"

We can see that the "Not a git repository" line is identical on both sides, which means it should not have been in the diff results to begin with. Under other diff algorithms (or histogram diff with this fix), said line is not considered to be part of the diff.

Also, when I was implementing this, an alternative I was considering was to add a bespoke linear-time algorithm to remove matching lines on both sides. Just calling the fall-back diff seems the easiest and cleanest and so I went with that.

@ychin ychin force-pushed the xdiff-fix-compact-remove-redundant-lines branch from a23cc27 to 9fd5d5b Compare December 6, 2025 12:25
@ychin
Copy link
Contributor Author

ychin commented Dec 6, 2025

/preview

@gitgitgadget-git
Copy link

Preview email sent as [email protected]

@ychin ychin force-pushed the xdiff-fix-compact-remove-redundant-lines branch from 9fd5d5b to e87e267 Compare December 6, 2025 20:24
After a diff algorithm has been run, the compaction phase
(xdl_change_compact()) shifts and merges change groups to produce a
cleaner output. However, this shifting could create a new matched group
where both sides now have matching lines. This results in a
wrong-looking diff output which contains redundant lines that are the
same on both files.

Fix this by detecting this situation, and re-diff the texts on each side
to find similar lines, using the fall-back Myer's diff. Only do this for
histogram diff as it's the only algorithm where this is relevant. Below
contains an example, and more details.

For an example, consider two files below:

    file1:
        A

        A
        A
        A

        A
        A
        A

    file2:
        A

        A
        x
        A

        A
        A
        A

When using Myer's diff, the algorithm finds that only the "x" has been
changed, and produces a final diff result (these are line diffs, but
using word-diff syntax for ease of presentation):

        A A[-A-]{+x+}A AAA

When using histogram diff, the algorithm first discovers the LCS "A
AAA", which it uses as anchor, then produces an intermediate diff:

        {+A Ax+}A AAA[- AAA-].

This is a longer diff than Myer's, but it's still self-consistent.
However, the compaction phase attempts to shift the first file's diff
group upwards (note that this shift crosses the anchor that histogram
had used), leading to the final results for histogram diff:

        [-A AA-]{+A Ax+}A AAA

This is a technically correct patch but looks clearly redundant to a
human as the first 3 lines should not be in the diff.

The fix would detect that a shift has caused matching to a new group,
and re-diff the "A AA" and "A Ax" parts, which results in "A A"
correctly re-marked as unchanged. This creates the now correct histogram
diff:

        A A[-A-]{+x+}A AAA

This issue is not applicable to Myer's diff algorithm as it already
generates a minimal diff, which means a shift cannot result in a smaller
diff output (the default Myer's diff in xdiff is not guaranteed to be
minimal for performance reasons, but it typically does a good enough
job).

It's also not applicable to patience diff, because it uses only unique
lines as anchor for its splits, and falls back to Myer's diff within
each split. Shifting requires both ends having the same lines, and
therefore cannot cross the unique line boundaries established by the
patience algorithm. In contrast histogram diff uses non-unique lines as
anchors, and therefore shifting can cross over them.

This issue is rare in a normal repository. Below is a table of
repositories (`git log --no-merges -p --histogram -1000`), showing how
many times a re-diff was done and how many times it resulted in finding
matching lines (therefore addressing this issue) with the fix. In
general it is fewer than 1% of diff's that exhibit this offending
behavior:

| Repo (1k commits)  | Re-diff | Found matching lines |
|--------------------|---------|----------------------|
| llvm-project       |  45     | 11                   |
| vim                | 110     |  9                   |
| git                |  18     |  2                   |
| WebKit             | 168     |  1                   |
| ripgrep            |  22     |  1                   |
| cpython            |  32     |  0                   |
| vscode             |  13     |  0                   |

Signed-off-by: Yee Cheng Chin <[email protected]>
@ychin ychin force-pushed the xdiff-fix-compact-remove-redundant-lines branch from e87e267 to 34a370e Compare December 6, 2025 20:49
@ychin
Copy link
Contributor Author

ychin commented Dec 6, 2025

/submit

@gitgitgadget-git
Copy link

Submitted as [email protected]

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-git-2120/ychin/xdiff-fix-compact-remove-redundant-lines-v1

To fetch this version to local tag pr-git-2120/ychin/xdiff-fix-compact-remove-redundant-lines-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-git-2120/ychin/xdiff-fix-compact-remove-redundant-lines-v1

@gitgitgadget-git
Copy link

This patch series was integrated into seen via c4a7679.

@gitgitgadget-git gitgitgadget-git bot added the seen label Dec 7, 2025
@gitgitgadget-git
Copy link

This branch is now known as yc/histogram-hunk-shift-fix.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via a72eca9.

@gitgitgadget-git
Copy link

There was a status update in the "Cooking" section about the branch yc/histogram-hunk-shift-fix on the Git mailing list:

The final clean-up phase of the diff output could turn the result of
histogram diff algorithm suboptimal, which has been corrected.

Comments?
source: <[email protected]>

@gitgitgadget-git
Copy link

This patch series was integrated into seen via df81430.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 0e463b3.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via ab92a48.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 05158b4.

@gitgitgadget-git
Copy link

There was a status update in the "Cooking" section about the branch yc/histogram-hunk-shift-fix on the Git mailing list:

The final clean-up phase of the diff output could turn the result of
histogram diff algorithm suboptimal, which has been corrected.

Comments?
source: <[email protected]>

@gitgitgadget-git
Copy link

This patch series was integrated into seen via a1f5dba.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 3347cb0.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via b0b360c.

@gitgitgadget-git
Copy link

There was a status update in the "Cooking" section about the branch yc/histogram-hunk-shift-fix on the Git mailing list:

The final clean-up phase of the diff output could turn the result of
histogram diff algorithm suboptimal, which has been corrected.

Comments?
source: <[email protected]>

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 83c0974.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 0e3214e.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 9f0f0c2.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via e9cb01f.

@gitgitgadget-git
Copy link

There was a status update in the "Cooking" section about the branch yc/histogram-hunk-shift-fix on the Git mailing list:

The final clean-up phase of the diff output could turn the result of
histogram diff algorithm suboptimal, which has been corrected.

Comments?
source: <[email protected]>

@dhukilabi-lab
Copy link

****

@dhukilabi-lab
Copy link

This is a somewhat rare issue when using histogram to diff files, as the algorithm will generate a diff output that looks redundant and wrong to a human. I provided a synthetic example in the commit message, but for one from the real world, do the following command in the Git repo:

git show -U0 --diff-algorithm=histogram 2c8999027c -- po/ga.po

Scroll to the line "@@ -7239,3 +5831,5 @@", and we can see the following diff hunk:

-#: builtin/diff.c
-msgid "Not a git repository"
-msgstr "Ní stór git"
+msgid "cannot come back to cwd"
+msgstr "ní féidir teacht ar ais chuig cwd"
+
+msgid "Not a git repository"
+msgstr "Ní stór git é"

We can see that the "Not a git repository" line is identical on both sides, which means it should not have been in the diff results to begin with. Under other diff algorithms (or histogram diff with this fix), said line is not considered to be part of the diff.

Also, when I was implementing this, an alternative I was considering was to add a bespoke linear-time algorithm to remove matching lines on both sides. Just calling the fall-back diff seems the easiest and cleanest and so I went with that.

This link send my Gmail

@dhukilabi-lab
Copy link

Send link my Gmail

@gitgitgadget-git
Copy link

There was a status update in the "Cooking" section about the branch yc/histogram-hunk-shift-fix on the Git mailing list:

The final clean-up phase of the diff output could turn the result of
histogram diff algorithm suboptimal, which has been corrected.

Comments?
source: <[email protected]>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants