[fileconsumer] Gzip offset accuracy

I think the problem is that we need to track two things but only have one offset.
1. An offset, measured in compressed bytes, where the section reader should start decompression from. This offset should only move when an entire section has been 100% consumed. (Tracked by current `Offset`)
2. An offset, measured in uncompressed bytes, _relative to the start of the section reader's buffer_. This should be updated to match exactly after an emitted token. (Let's call is `SecondaryOffset`)

The problem still is that any time EOF ends with a partial token, we don't have enough info to update the primary `Offset`. So then next time we visit the file we'd just create another section reader that spans the entire file again.

I think the solution to this problem is to change the strategy for creating section readers. If we create _two small section readers_, tokenize through the first, and find the first end of token in the second, then we can emit a token with the last bytes of the first section and the first bytes of the second. More importantly, we'll also know that the first section is 100% consumed which allows us to update the primary `Offset`. The size of these two buffers would have to grow dynamically because sometimes we'd find no end of token in either reader.

Anyways, this is complicated but something along these lines seems necessary if we want to accurately consume gzip files. I'm not familiar enough with other compression formats to say if similar strategy could help consume those as well.

Bigger picture, I think all this complexity hints that `reader.Reader` should be an interface with multiple implementations. We can determine the file type at reader creation, create the appropriate reader type. Each type of reader can manage offsets and consumption as necessary for the type of compression. We either need to keep a shared `reader.Metadata` for storage, or use fancy unmarshaling to interpret the storage. e.g. check the "type" first, then unmarshal into the corresponding metadata.

_Originally posted by @djaglowski in https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/38510#discussion_r2058853612_
            

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[fileconsumer] Gzip offset accuracy #39656

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[fileconsumer] Gzip offset accuracy #39656

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions