Skip to content

Conversation

@spiffcs
Copy link
Contributor

@spiffcs spiffcs commented Oct 14, 2025

Description

Add new AI artifact cataloger for Syft that detects and parses GGUF (GPT-Generated Unified Format) model files. This includes a proper package type, parser, and metadata structure so Syft can identify and describe .gguf models like any other package in the artifacts section of the syft sbom.

What’s New

  • ModelPkg type added to syft/pkg/type.go ModelPkg Type = "model"
  • GGUFFileMetadata struct introduced (syft/pkg/gguf.go) with fields for model format, architecture, quantization, tensor counts, etc. A general holding of key value pairs is also included for values seen not as frequently.
  • Header parser (cataloger/aiartifact/parse_gguf.go) that reads the GGUF magic number, extracts key metadata (name, license, architecture, etc.), and applies safety limits for corrupted files.
  • Cataloger (cataloger/aiartifact/cataloger.go) that locates .gguf files, reads just the header (up to 10 MB), and creates type=model packages tagged with ai-artifact, model, gguf, and ml.

Happy to renamespace the cataloger from aiartifact -> anything that comes up in this review

Integration hooks added in package_tasks.go and packagemetadata/names.go. Cataloger will work across both directory and image scans.

Output Support

Syft JSON: includes full gguf-file-metadata with all parsed fields.
CycloneDX 1.6 ML-BOM: emits machine-learning-model components with GGUF metadata encoded as properties.
SPDX: Separate PR

Demo

TODO

Fast Follow

  • OCI Support for Docker Model Registry
  • Hugging Face Source support

Type of change

  • New feature (non-breaking change which adds functionality)
  • Documentation (updates the documentation)

Checklist:

  • I have added unit tests that cover changed behavior
  • I have tested my code in common scenarios and confirmed there are no regressions
  • I have added comments to my code, particularly in hard-to-understand sections

Signed-off-by: Christopher Phillips <[email protected]>
Signed-off-by: Christopher Phillips <[email protected]>
Signed-off-by: Christopher Phillips <[email protected]>
@github-actions github-actions bot added the json-schema Changes the json schema label Oct 14, 2025
@spiffcs spiffcs changed the title feat: 4184 gguf parser (ai artifact cataloger) feat: 4184 gguf parser (ai artifact cataloger) part 1 Oct 14, 2025
@spiffcs spiffcs linked an issue Oct 14, 2025 that may be closed by this pull request
Signed-off-by: Christopher Phillips <[email protected]>
Signed-off-by: Christopher Phillips <[email protected]>
@spiffcs spiffcs marked this pull request as ready for review October 14, 2025 10:09
pkg.ELFBinaryPackageNoteJSONPayload{},
pkg.ElixirMixLockEntry{},
pkg.ErlangRebarLockEntry{},
pkg.GGUFFileMetadata{},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we've scrubbed the terminology "metadata" as a suffix unless it's accurate (see the metadata type naming rules of thumb in the dev docs). What is the essence of what's being captured in this struct? (leaving this open-ended now as I review the rest of the PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're going with GGUFFileHeader here

Comment on lines 173 to 184
assert.Equal(t, metadata.ModelFormat, extractedMetadata.ModelFormat)
assert.Equal(t, metadata.ModelName, extractedMetadata.ModelName)
assert.Equal(t, metadata.ModelVersion, extractedMetadata.ModelVersion)
assert.Equal(t, metadata.License, extractedMetadata.License)
assert.Equal(t, metadata.Architecture, extractedMetadata.Architecture)
assert.Equal(t, metadata.Quantization, extractedMetadata.Quantization)
assert.Equal(t, metadata.Parameters, extractedMetadata.Parameters)
assert.Equal(t, metadata.GGUFVersion, extractedMetadata.GGUFVersion)
assert.Equal(t, metadata.TensorCount, extractedMetadata.TensorCount)
assert.Equal(t, metadata.Hash, extractedMetadata.Hash)
assert.Equal(t, metadata.TruncatedHeader, extractedMetadata.TruncatedHeader)
assert.Equal(t, metadata.Header, extractedMetadata.Header)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, why not a cmp.Diff?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function was dropped in the first PR review pass. For ones that were not dropped I'll go back and refactor to use cmp.Diff

syft/pkg/type.go Outdated
case HomebrewPkg:
return "homebrew"
case ModelPkg:
return "generic/model"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like it was preemtively added, but isn't needed yet since the cataloger isn't emitting purls for gguf files

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed from case statement - keeping the ModelPkg type

Package aiartifact provides concrete Cataloger implementations for AI artifacts and machine learning models,
including support for GGUF (GPT-Generated Unified Format) model files.
*/
package aiartifact
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
package aiartifact
package ai

or model , but having artifact here seems unnecessary (that is, we don't have an pythonartifact package, we have a python package).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model seems more right since it aligns with the type and is a little more generic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've renamed this to model

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind - we talked offline about this "Alex 'I'm sold on AI'"

Copy link
Contributor

@wagoodman wagoodman Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was surprised to see lots of manual parsing -- did https://github.com/gpustack/gguf-parser-go not have what we needed?

}

if kvCount > maxKVPairs {
log.Warnf("GGUF file has suspicious KV count: %d (max: %d)", kvCount, maxKVPairs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.Warnf("GGUF file has suspicious KV count: %d (max: %d)", kvCount, maxKVPairs)
log.Debugf("GGUF file has suspicious KV count: %d (max: %d)", kvCount, maxKVPairs)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are more spots in the parser that are warns, but should really be debugs. A warning is to let the user know about something that they might have control to do something about. This is more informational.

}

if tensorCount > maxTensors {
log.Warnf("GGUF file has suspicious tensor count: %d (max: %d)", tensorCount, maxTensors)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.Warnf("GGUF file has suspicious tensor count: %d (max: %d)", tensorCount, maxTensors)
log.Debugf("GGUF file has suspicious tensor count: %d (max: %d)", tensorCount, maxTensors)

Signed-off-by: Christopher Phillips <[email protected]>
@github-actions

This comment has been minimized.

* main:
  chore(deps): update tools to latest versions (#4302)
  chore(deps): bump github.com/github/go-spdx/v2 from 2.3.3 to 2.3.4 (#4301)
  chore(deps): bump github/codeql-action from 4.30.8 to 4.30.9 (#4299)
  support universal (fat) mach-o binary files (#4278)
  chore(deps): bump sigstore/cosign-installer from 3.10.0 to 4.0.0 (#4296)
  chore(deps): bump anchore/sbom-action from 0.20.7 to 0.20.8 (#4297)
  convert posix path back to windows (#4285)
  Remove duplicate image source providers (#4289)
  chore(deps): bump anchore/sbom-action from 0.20.6 to 0.20.7 (#4293)
  feat: add option to fetch remote licenses for pnpm-lock.yaml files (#4286)
  Add PDM parser (#4234)
  chore(deps): update tools to latest versions (#4291)
  fix: panic during java archive maven resolution (#4290)
  Extract zip archive with multiple entries (#4283)
  chore: update to use old configuration on new cosign (#4287)
  chore(deps): update anchore dependencies (#4282)
  chore(deps): bump github.com/mholt/archives from 0.1.3 to 0.1.5 (#4280)
  add docs to configs (#4281)
Signed-off-by: Christopher Phillips <[email protected]>
Signed-off-by: Christopher Phillips <[email protected]>
Signed-off-by: Christopher Phillips <[email protected]>
}

// Compute SHA256 hash
hash := sha256.Sum256(jsonBytes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe swap for xxh64?

"github.com/anchore/syft/syft/pkg/cataloger/generic"
)

const unknownGGUFData = "unknown"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in these cases we should leave the version blank, the post-cataloging compliance checks will cover this (which are user configurable)

result := make(map[string]interface{})

// Limit KV pairs to avoid bloat
const maxKVPairs = 200
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably be a cataloger configuration (not app config)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jk -- after we chatted about it we should lean on the fact that this is capped at 50 MB. We should not limit this to 200 entries.

Comment on lines +28 to +29
headerReader := &ggufHeaderReader{reader: reader}
headerData, err := headerReader.readHeader()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
headerReader := &ggufHeaderReader{reader: reader}
headerData, err := headerReader.readHeader()
headerData, err := readHeader(reader)

probably dont need the extra type

"github.com/anchore/syft/syft/pkg/cataloger/internal/pkgtest"
)

func TestGGUFCataloger_Globs(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should be using the same patterns for glob tests we have today (test wiring of regexes to catalogers, dont test functionality) ... e.g.

ExpectsResolverContentQueries(test.expected).

}

// createTestGGUFInDir creates a minimal test GGUF file in the specified directory
func createTestGGUFInDir(t *testing.T, dir, filename string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now this is dead code

expectedRelationships: nil,
},
{
name: "catalog multiple GGUF files",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really a useful test -- this is asserting that the parser function and generic cataloger work well together

expectedRelationships: nil,
},
{
name: "catalog GGUF in nested directories",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really a globs test, can be deleted

Comment on lines +262 to +269
// Create a valid GGUF
validData := newTestGGUFBuilder().
withVersion(3).
withTensorCount(100).
withStringKV("general.architecture", "llama").
withStringKV("general.name", "valid-model").
build()
os.WriteFile(filepath.Join(dir, "valid.gguf"), validData, 0644)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be deleted

tester.TestCataloger(t, NewGGUFCataloger())
}

func TestGGUFCataloger_Name(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont need this test

assert.Equal(t, "gguf-cataloger", cataloger.Name())
}

func TestGGUFCataloger_EmptyDirectory(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be deleted

tester.TestCataloger(t, NewGGUFCataloger())
}

func TestGGUFCataloger_MixedFiles(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be deleted

tester.TestCataloger(t, NewGGUFCataloger())
}

func TestGGUFCataloger_CaseInsensitiveGlob(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be refactored as a glob test

Parameters: 8030000000,
GGUFVersion: 3,
TensorCount: 291,
Header: map[string]any{},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be exercised here

// Header contains the remaining key-value pairs from the GGUF header that are not already
// represented as typed fields above. This preserves additional metadata fields for reference
// (namespaced with general.*, llama.*, etc.) while avoiding duplication.
Header map[string]interface{} `json:"header,omitempty" cyclonedx:"header"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of questions:

  • can this be map[string]string? This isn't using the remain struct tag, so it feels like there is some room here to enforce this as such
  • we've tried to scrub (but not everywhere) map types and use the key-value type instead to enforce stable ordering while preserving the order we found the data in; for example in the java type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that Header is a good enough name here

Comment on lines +10 to +14
// ModelName is the name of the model (from general.name or filename)
ModelName string `json:"modelName" cyclonedx:"modelName"`

// ModelVersion is the version of the model (if available in header, else "unknown")
ModelVersion string `json:"modelVersion,omitempty" cyclonedx:"modelVersion"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could probably be raised up on pkg.Package only and be omitted entirely from this struct

Comment on lines +77 to +85
// If model name is not in metadata, use filename
if syftMetadata.ModelName == "" {
syftMetadata.ModelName = extractModelNameFromPath(reader.Path())
}

// If version is still unknown, try to infer from name
if syftMetadata.ModelVersion == unknownGGUFData {
syftMetadata.ModelVersion = extractVersionFromName(syftMetadata.ModelName)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these could be persisted in a local var and passed into the package contructor instead of being on the metadata struct

Comment on lines +7 to +8
// ModelFormat is always "gguf"
ModelFormat string `json:"modelFormat" cyclonedx:"modelFormat"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should remove this


// extractVersionFromName tries to extract version from model name
func extractVersionFromName(_ string) string {
// Look for version patterns like "v1.0", "1.5b", "3.0", etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this wasn't implemented

License string `json:"license,omitempty" cyclonedx:"license"`

// GGUFVersion is the GGUF format version (e.g., 3)
GGUFVersion uint32 `json:"ggufVersion" cyclonedx:"ggufVersion"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this could be at the top of the struct

}

// computeMetadataHash computes a stable hash of the metadata for use as a global identifier
func computeMetadataHash(metadata *pkg.GGUFFileHeader) string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from an earlier conversation I think this is meant to be a hash of all of the KV pairs in sorted order. Then you can tell if the exact model (when given just the header and nothing else) was used in multiple places (in multiple SBOMs for instance). We should be careful to use all KV pairs, not any truncated list.

FileSize int64 `json:"fileSize,omitempty" cyclonedx:"fileSize"`

// Hash is a content hash of the metadata (for stable global identifiers across remotes)
Hash string `json:"hash,omitempty" cyclonedx:"hash"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should beef up teh description for this to explain that this is a hash of all k-v pairs. The name should probably change to convey this too (maybe). And this should be colocated in the struct to the Header field

Comment on lines +15 to +17
if metadata.Hash == "" {
metadata.Hash = computeMetadataHash(metadata)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really teh responsibility of the caller

TensorCount: ggufFile.Header.TensorCount,
Header: convertGGUFMetadataKVs(ggufFile.Header.MetadataKV),
TruncatedHeader: false, // We read the full header
Hash: "", // Will be computed in newGGUFPackage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Hash: "", // Will be computed in newGGUFPackage
Hash: computeHeaderHash(ggufFile.Header.MetadataKV), // Will be computed in newGGUFPackage

GGUFVersion: uint32(ggufFile.Header.Version),
TensorCount: ggufFile.Header.TensorCount,
Header: convertGGUFMetadataKVs(ggufFile.Header.MetadataKV),
TruncatedHeader: false, // We read the full header
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't being set (this is determined earlier but not used)

Header map[string]interface{} `json:"header,omitempty" cyclonedx:"header"`

// TruncatedHeader indicates if the header was truncated during parsing (for very large headers)
TruncatedHeader bool `json:"truncatedHeader,omitempty" cyclonedx:"truncatedHeader"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
TruncatedHeader bool `json:"truncatedHeader,omitempty" cyclonedx:"truncatedHeader"`
Truncated bool `json:"truncated,omitempty" cyclonedx:"truncated"`

the field comment can explain this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

json-schema Changes the json schema

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add support for cataloging GGUF models

3 participants