Skip to content

[fileconsumer] Gzip offset accuracy #39656

@djaglowski

Description

@djaglowski

I think the problem is that we need to track two things but only have one offset.

  1. An offset, measured in compressed bytes, where the section reader should start decompression from. This offset should only move when an entire section has been 100% consumed. (Tracked by current Offset)
  2. An offset, measured in uncompressed bytes, relative to the start of the section reader's buffer. This should be updated to match exactly after an emitted token. (Let's call is SecondaryOffset)

The problem still is that any time EOF ends with a partial token, we don't have enough info to update the primary Offset. So then next time we visit the file we'd just create another section reader that spans the entire file again.

I think the solution to this problem is to change the strategy for creating section readers. If we create two small section readers, tokenize through the first, and find the first end of token in the second, then we can emit a token with the last bytes of the first section and the first bytes of the second. More importantly, we'll also know that the first section is 100% consumed which allows us to update the primary Offset. The size of these two buffers would have to grow dynamically because sometimes we'd find no end of token in either reader.

Anyways, this is complicated but something along these lines seems necessary if we want to accurately consume gzip files. I'm not familiar enough with other compression formats to say if similar strategy could help consume those as well.

Bigger picture, I think all this complexity hints that reader.Reader should be an interface with multiple implementations. We can determine the file type at reader creation, create the appropriate reader type. Each type of reader can manage offsets and consumption as necessary for the type of compression. We either need to keep a shared reader.Metadata for storage, or use fancy unmarshaling to interpret the storage. e.g. check the "type" first, then unmarshal into the corresponding metadata.

Originally posted by @djaglowski in #38510 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    never staleIssues marked with this label will be never staled and automatically removedpkg/stanza/fileconsumer

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions