-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Description
I think the problem is that we need to track two things but only have one offset.
- An offset, measured in compressed bytes, where the section reader should start decompression from. This offset should only move when an entire section has been 100% consumed. (Tracked by current
Offset
) - An offset, measured in uncompressed bytes, relative to the start of the section reader's buffer. This should be updated to match exactly after an emitted token. (Let's call is
SecondaryOffset
)
The problem still is that any time EOF ends with a partial token, we don't have enough info to update the primary Offset
. So then next time we visit the file we'd just create another section reader that spans the entire file again.
I think the solution to this problem is to change the strategy for creating section readers. If we create two small section readers, tokenize through the first, and find the first end of token in the second, then we can emit a token with the last bytes of the first section and the first bytes of the second. More importantly, we'll also know that the first section is 100% consumed which allows us to update the primary Offset
. The size of these two buffers would have to grow dynamically because sometimes we'd find no end of token in either reader.
Anyways, this is complicated but something along these lines seems necessary if we want to accurately consume gzip files. I'm not familiar enough with other compression formats to say if similar strategy could help consume those as well.
Bigger picture, I think all this complexity hints that reader.Reader
should be an interface with multiple implementations. We can determine the file type at reader creation, create the appropriate reader type. Each type of reader can manage offsets and consumption as necessary for the type of compression. We either need to keep a shared reader.Metadata
for storage, or use fancy unmarshaling to interpret the storage. e.g. check the "type" first, then unmarshal into the corresponding metadata.
Originally posted by @djaglowski in #38510 (comment)