-
Notifications
You must be signed in to change notification settings - Fork 731
feat: 4184 gguf parser (ai artifact cataloger) part 1 #4279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Christopher Phillips <[email protected]>
Signed-off-by: Christopher Phillips <[email protected]>
Signed-off-by: Christopher Phillips <[email protected]>
Signed-off-by: Christopher Phillips <[email protected]>
Signed-off-by: Christopher Phillips <[email protected]>
Signed-off-by: Christopher Phillips <[email protected]>
| pkg.ELFBinaryPackageNoteJSONPayload{}, | ||
| pkg.ElixirMixLockEntry{}, | ||
| pkg.ErlangRebarLockEntry{}, | ||
| pkg.GGUFFileMetadata{}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we've scrubbed the terminology "metadata" as a suffix unless it's accurate (see the metadata type naming rules of thumb in the dev docs). What is the essence of what's being captured in this struct? (leaving this open-ended now as I review the rest of the PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're going with GGUFFileHeader here
| assert.Equal(t, metadata.ModelFormat, extractedMetadata.ModelFormat) | ||
| assert.Equal(t, metadata.ModelName, extractedMetadata.ModelName) | ||
| assert.Equal(t, metadata.ModelVersion, extractedMetadata.ModelVersion) | ||
| assert.Equal(t, metadata.License, extractedMetadata.License) | ||
| assert.Equal(t, metadata.Architecture, extractedMetadata.Architecture) | ||
| assert.Equal(t, metadata.Quantization, extractedMetadata.Quantization) | ||
| assert.Equal(t, metadata.Parameters, extractedMetadata.Parameters) | ||
| assert.Equal(t, metadata.GGUFVersion, extractedMetadata.GGUFVersion) | ||
| assert.Equal(t, metadata.TensorCount, extractedMetadata.TensorCount) | ||
| assert.Equal(t, metadata.Hash, extractedMetadata.Hash) | ||
| assert.Equal(t, metadata.TruncatedHeader, extractedMetadata.TruncatedHeader) | ||
| assert.Equal(t, metadata.Header, extractedMetadata.Header) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, why not a cmp.Diff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function was dropped in the first PR review pass. For ones that were not dropped I'll go back and refactor to use cmp.Diff
syft/pkg/type.go
Outdated
| case HomebrewPkg: | ||
| return "homebrew" | ||
| case ModelPkg: | ||
| return "generic/model" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks like it was preemtively added, but isn't needed yet since the cataloger isn't emitting purls for gguf files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed from case statement - keeping the ModelPkg type
| Package aiartifact provides concrete Cataloger implementations for AI artifacts and machine learning models, | ||
| including support for GGUF (GPT-Generated Unified Format) model files. | ||
| */ | ||
| package aiartifact |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| package aiartifact | |
| package ai |
or model , but having artifact here seems unnecessary (that is, we don't have an pythonartifact package, we have a python package).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
model seems more right since it aligns with the type and is a little more generic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've renamed this to model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind - we talked offline about this "Alex 'I'm sold on AI'"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was surprised to see lots of manual parsing -- did https://github.com/gpustack/gguf-parser-go not have what we needed?
| } | ||
|
|
||
| if kvCount > maxKVPairs { | ||
| log.Warnf("GGUF file has suspicious KV count: %d (max: %d)", kvCount, maxKVPairs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| log.Warnf("GGUF file has suspicious KV count: %d (max: %d)", kvCount, maxKVPairs) | |
| log.Debugf("GGUF file has suspicious KV count: %d (max: %d)", kvCount, maxKVPairs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are more spots in the parser that are warns, but should really be debugs. A warning is to let the user know about something that they might have control to do something about. This is more informational.
| } | ||
|
|
||
| if tensorCount > maxTensors { | ||
| log.Warnf("GGUF file has suspicious tensor count: %d (max: %d)", tensorCount, maxTensors) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| log.Warnf("GGUF file has suspicious tensor count: %d (max: %d)", tensorCount, maxTensors) | |
| log.Debugf("GGUF file has suspicious tensor count: %d (max: %d)", tensorCount, maxTensors) |
Signed-off-by: Christopher Phillips <[email protected]>
This comment has been minimized.
This comment has been minimized.
Signed-off-by: Christopher Phillips <[email protected]>
Signed-off-by: Christopher Phillips <[email protected]>
* main: chore(deps): update tools to latest versions (#4302) chore(deps): bump github.com/github/go-spdx/v2 from 2.3.3 to 2.3.4 (#4301) chore(deps): bump github/codeql-action from 4.30.8 to 4.30.9 (#4299) support universal (fat) mach-o binary files (#4278) chore(deps): bump sigstore/cosign-installer from 3.10.0 to 4.0.0 (#4296) chore(deps): bump anchore/sbom-action from 0.20.7 to 0.20.8 (#4297) convert posix path back to windows (#4285) Remove duplicate image source providers (#4289) chore(deps): bump anchore/sbom-action from 0.20.6 to 0.20.7 (#4293) feat: add option to fetch remote licenses for pnpm-lock.yaml files (#4286) Add PDM parser (#4234) chore(deps): update tools to latest versions (#4291) fix: panic during java archive maven resolution (#4290) Extract zip archive with multiple entries (#4283) chore: update to use old configuration on new cosign (#4287) chore(deps): update anchore dependencies (#4282) chore(deps): bump github.com/mholt/archives from 0.1.3 to 0.1.5 (#4280) add docs to configs (#4281)
Signed-off-by: Christopher Phillips <[email protected]>
Signed-off-by: Christopher Phillips <[email protected]>
Signed-off-by: Christopher Phillips <[email protected]>
| } | ||
|
|
||
| // Compute SHA256 hash | ||
| hash := sha256.Sum256(jsonBytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe swap for xxh64?
| "github.com/anchore/syft/syft/pkg/cataloger/generic" | ||
| ) | ||
|
|
||
| const unknownGGUFData = "unknown" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in these cases we should leave the version blank, the post-cataloging compliance checks will cover this (which are user configurable)
| result := make(map[string]interface{}) | ||
|
|
||
| // Limit KV pairs to avoid bloat | ||
| const maxKVPairs = 200 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should probably be a cataloger configuration (not app config)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jk -- after we chatted about it we should lean on the fact that this is capped at 50 MB. We should not limit this to 200 entries.
| headerReader := &ggufHeaderReader{reader: reader} | ||
| headerData, err := headerReader.readHeader() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| headerReader := &ggufHeaderReader{reader: reader} | |
| headerData, err := headerReader.readHeader() | |
| headerData, err := readHeader(reader) |
probably dont need the extra type
| "github.com/anchore/syft/syft/pkg/cataloger/internal/pkgtest" | ||
| ) | ||
|
|
||
| func TestGGUFCataloger_Globs(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these should be using the same patterns for glob tests we have today (test wiring of regexes to catalogers, dont test functionality) ... e.g.
| ExpectsResolverContentQueries(test.expected). |
| } | ||
|
|
||
| // createTestGGUFInDir creates a minimal test GGUF file in the specified directory | ||
| func createTestGGUFInDir(t *testing.T, dir, filename string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now this is dead code
| expectedRelationships: nil, | ||
| }, | ||
| { | ||
| name: "catalog multiple GGUF files", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really a useful test -- this is asserting that the parser function and generic cataloger work well together
| expectedRelationships: nil, | ||
| }, | ||
| { | ||
| name: "catalog GGUF in nested directories", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really a globs test, can be deleted
| // Create a valid GGUF | ||
| validData := newTestGGUFBuilder(). | ||
| withVersion(3). | ||
| withTensorCount(100). | ||
| withStringKV("general.architecture", "llama"). | ||
| withStringKV("general.name", "valid-model"). | ||
| build() | ||
| os.WriteFile(filepath.Join(dir, "valid.gguf"), validData, 0644) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be deleted
| tester.TestCataloger(t, NewGGUFCataloger()) | ||
| } | ||
|
|
||
| func TestGGUFCataloger_Name(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dont need this test
| assert.Equal(t, "gguf-cataloger", cataloger.Name()) | ||
| } | ||
|
|
||
| func TestGGUFCataloger_EmptyDirectory(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be deleted
| tester.TestCataloger(t, NewGGUFCataloger()) | ||
| } | ||
|
|
||
| func TestGGUFCataloger_MixedFiles(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be deleted
| tester.TestCataloger(t, NewGGUFCataloger()) | ||
| } | ||
|
|
||
| func TestGGUFCataloger_CaseInsensitiveGlob(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be refactored as a glob test
| Parameters: 8030000000, | ||
| GGUFVersion: 3, | ||
| TensorCount: 291, | ||
| Header: map[string]any{}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be exercised here
| // Header contains the remaining key-value pairs from the GGUF header that are not already | ||
| // represented as typed fields above. This preserves additional metadata fields for reference | ||
| // (namespaced with general.*, llama.*, etc.) while avoiding duplication. | ||
| Header map[string]interface{} `json:"header,omitempty" cyclonedx:"header"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of questions:
- can this be map[string]string? This isn't using the
remainstruct tag, so it feels like there is some room here to enforce this as such - we've tried to scrub (but not everywhere) map types and use the key-value type instead to enforce stable ordering while preserving the order we found the data in; for example in the java type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that Header is a good enough name here
| // ModelName is the name of the model (from general.name or filename) | ||
| ModelName string `json:"modelName" cyclonedx:"modelName"` | ||
|
|
||
| // ModelVersion is the version of the model (if available in header, else "unknown") | ||
| ModelVersion string `json:"modelVersion,omitempty" cyclonedx:"modelVersion"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could probably be raised up on pkg.Package only and be omitted entirely from this struct
| // If model name is not in metadata, use filename | ||
| if syftMetadata.ModelName == "" { | ||
| syftMetadata.ModelName = extractModelNameFromPath(reader.Path()) | ||
| } | ||
|
|
||
| // If version is still unknown, try to infer from name | ||
| if syftMetadata.ModelVersion == unknownGGUFData { | ||
| syftMetadata.ModelVersion = extractVersionFromName(syftMetadata.ModelName) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these could be persisted in a local var and passed into the package contructor instead of being on the metadata struct
| // ModelFormat is always "gguf" | ||
| ModelFormat string `json:"modelFormat" cyclonedx:"modelFormat"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should remove this
|
|
||
| // extractVersionFromName tries to extract version from model name | ||
| func extractVersionFromName(_ string) string { | ||
| // Look for version patterns like "v1.0", "1.5b", "3.0", etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this wasn't implemented
| License string `json:"license,omitempty" cyclonedx:"license"` | ||
|
|
||
| // GGUFVersion is the GGUF format version (e.g., 3) | ||
| GGUFVersion uint32 `json:"ggufVersion" cyclonedx:"ggufVersion"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this could be at the top of the struct
| } | ||
|
|
||
| // computeMetadataHash computes a stable hash of the metadata for use as a global identifier | ||
| func computeMetadataHash(metadata *pkg.GGUFFileHeader) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from an earlier conversation I think this is meant to be a hash of all of the KV pairs in sorted order. Then you can tell if the exact model (when given just the header and nothing else) was used in multiple places (in multiple SBOMs for instance). We should be careful to use all KV pairs, not any truncated list.
| FileSize int64 `json:"fileSize,omitempty" cyclonedx:"fileSize"` | ||
|
|
||
| // Hash is a content hash of the metadata (for stable global identifiers across remotes) | ||
| Hash string `json:"hash,omitempty" cyclonedx:"hash"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should beef up teh description for this to explain that this is a hash of all k-v pairs. The name should probably change to convey this too (maybe). And this should be colocated in the struct to the Header field
| if metadata.Hash == "" { | ||
| metadata.Hash = computeMetadataHash(metadata) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really teh responsibility of the caller
| TensorCount: ggufFile.Header.TensorCount, | ||
| Header: convertGGUFMetadataKVs(ggufFile.Header.MetadataKV), | ||
| TruncatedHeader: false, // We read the full header | ||
| Hash: "", // Will be computed in newGGUFPackage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Hash: "", // Will be computed in newGGUFPackage | |
| Hash: computeHeaderHash(ggufFile.Header.MetadataKV), // Will be computed in newGGUFPackage |
| GGUFVersion: uint32(ggufFile.Header.Version), | ||
| TensorCount: ggufFile.Header.TensorCount, | ||
| Header: convertGGUFMetadataKVs(ggufFile.Header.MetadataKV), | ||
| TruncatedHeader: false, // We read the full header |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't being set (this is determined earlier but not used)
| Header map[string]interface{} `json:"header,omitempty" cyclonedx:"header"` | ||
|
|
||
| // TruncatedHeader indicates if the header was truncated during parsing (for very large headers) | ||
| TruncatedHeader bool `json:"truncatedHeader,omitempty" cyclonedx:"truncatedHeader"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| TruncatedHeader bool `json:"truncatedHeader,omitempty" cyclonedx:"truncatedHeader"` | |
| Truncated bool `json:"truncated,omitempty" cyclonedx:"truncated"` |
the field comment can explain this.
Description
Add new AI artifact cataloger for Syft that detects and parses GGUF (GPT-Generated Unified Format) model files. This includes a proper package type, parser, and metadata structure so Syft can identify and describe .gguf models like any other package in the
artifactssection of the syft sbom.What’s New
ModelPkg Type = "model"Happy to renamespace the cataloger from
aiartifact-> anything that comes up in this reviewIntegration hooks added in package_tasks.go and packagemetadata/names.go. Cataloger will work across both directory and image scans.
Output Support
Syft JSON: includes full gguf-file-metadata with all parsed fields.
CycloneDX 1.6 ML-BOM: emits machine-learning-model components with GGUF metadata encoded as properties.
SPDX: Separate PR
Demo
TODO
Fast Follow
Type of change
Checklist: