A better chunking algorithm? #1665

rgmz · 2024-12-11T03:42:51Z

Description:

This is a follow-up to #1652.

It attempts to 'intelligently' chunk inputs on subsequent newlines \n\n, rather than indiscriminately breaking text apart. The assumption is that it's safe(r) for a chunk to end at (1), rather than (2) or (3). Of course, the buffer cannot grow indefinitely or the original issue would regress; it will only search for consecutive newlines up to a maxPeekSize.

  // TODO: optimization could be introduced here
  if mimetype, err := filetype.Match(buf[:n]); err != nil {
    return nil
  } else if mimetype.MIME.Type == "application" {
    return nil // skip binary files
  }

>>> (1) <<<  // If the chunk doesn't end in a newli>>> (2) <<<ne, peek|maxPeekSize| until we find one.
  // This hopefully avoids splitting
>>> (3) <<< // See: https://github.com/gitleaks/gitleaks/issues/1651
  var (

The original issue with 60 private keys is solved by this change:

$ ./gitleaks dir /tmp/60keys.txt --no-color

10:22PM INF scan completed in 88ms
10:22PM WRN leaks found: 60

Thoughts

Needs to be tested on Windows or files with CRLF.
Reading one byte at a time might be slow (idk, seems pretty fast)
It might make sense for whitespace not to reset the counter. That way empty lines can match the criteria.
```
  }
   
// Foo
```

Checklist:

Does your PR pass tests?
Have you written new tests for your changes?
Have you lint your code locally prior to submission?

zricethezav · 2024-12-12T16:00:56Z

@rgmz

Needs to be tested on Windows or files with CRLF.

Ideally I would have a windows machine to test on. We could just merge without testing on windows and if an issue arrises patch it real quick.

It might make sense for whitespace not to reset the counter. That way empty lines can match the criteria.

Could you elaborate on this a little bit?

rgmz · 2024-12-12T16:18:12Z

Could you elaborate on this a little bit?

It's common to have blank lines only consisting of whitespace.

Foo\n
<tab>\n
Bar

The current logic resets the counter if \n isn't followed by another \n. In my mind, the above is as good of a place for the chunk to end as an empty line.

Foo\n
\n
Bar

zricethezav · 2024-12-12T16:27:11Z

Gotcha, I agree with you that the condition for the counter should be expanded. It doesn't have to be in this PR, we can always follow up.

feat(dir): better chunking

a58258e

zricethezav approved these changes Dec 12, 2024

View reviewed changes

zricethezav merged commit 0bf13fc into gitleaks:master Dec 12, 2024
1 check passed

rgmz deleted the feat/selective-chunking branch December 12, 2024 16:27

This was referenced Dec 20, 2024

Gitleaks detection is inconsistent in large files #1432

Closed

Refactor chunking algorithm changes #1676

Merged

alayne222 pushed a commit to alayne222/gitleaks that referenced this pull request May 28, 2025

feat(dir): better chunking (gitleaks#1665)

be7186b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

A better chunking algorithm? #1665

A better chunking algorithm? #1665

Uh oh!

rgmz commented Dec 11, 2024 •

edited

Loading

Uh oh!

zricethezav commented Dec 12, 2024

Uh oh!

rgmz commented Dec 12, 2024

Uh oh!

zricethezav commented Dec 12, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

A better chunking algorithm? #1665

A better chunking algorithm? #1665

Uh oh!

Conversation

rgmz commented Dec 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description:

Thoughts

Checklist:

Uh oh!

zricethezav commented Dec 12, 2024

Uh oh!

rgmz commented Dec 12, 2024

Uh oh!

zricethezav commented Dec 12, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rgmz commented Dec 11, 2024 •

edited

Loading