Skip to content

Work Unit 005: Ref Inspection in libs/refs #127

@jmgilman

Description

@jmgilman

Work Unit 005: Ref Inspection in libs/refs

Behavioral Goal

As a sow user evaluating external refs,
I need to inspect OCI refs before downloading them
so that I can understand their contents, validate their structure, and make informed decisions about whether to install them without consuming unnecessary bandwidth.

Success Criteria

  1. Users can run sow refs inspect <url> and see file list, metadata, and validation status within 3 seconds
  2. Inspection uses < 10KB bandwidth regardless of ref size (TOC + manifest only)
  3. Invalid refs are clearly identified before any installation attempt
  4. Directory tree and size estimates help users understand ref contents

Existing Code Context

Explanatory Context

This work unit creates the Inspector component within the new libs/refs module established in Work Unit 003. The inspection capability is a key differentiator for OCI refs over git refs - it enables users to preview contents before committing to a full download.

The implementation builds on the OCI client from Work Unit 003, which wraps github.com/jmgilman/go/oci. The OCI client provides a ListFiles operation that leverages the estargz (seekable tar.gz) format to download only the Table of Contents (TOC) without retrieving file contents. This is the key enabler for bandwidth-efficient inspection.

The existing refs architecture uses a clean interface pattern. The RefType interface (cli/internal/refs/types.go) defines operations like Cache(), Update(), IsStale(). However, inspection is a new capability not present in the current interface - it's specific to OCI refs because git refs require cloning to inspect contents.

The inspection workflow consists of two estargz operations:

  1. ListFiles: Downloads only the TOC (~few KB) which contains file paths, sizes, and modes
  2. Selective Extract: Downloads only .sow-ref.yaml (~1-5KB) for metadata display

Schema validation uses CUE, following the pattern in libs/project/state/validate.go. The validator loads embedded schemas at init time and uses schema.Unify(value).Validate() for structural validation.

Reference List

Core Interface Patterns:

  • libs/exec/executor.go:26-47 - Interface pattern with clear method contracts and go:generate for mocks
  • libs/git/client.go:14-45 - GitHubClient interface demonstrating operation grouping
  • cli/internal/refs/types.go:23-79 - RefType interface that inspection complements

Validation Patterns:

  • libs/project/state/validate.go:46-68 - CUE schema validation pattern with Unify and Validate
  • libs/project/state/validate.go:22-38 - Init-time schema loading from embedded FS
  • libs/schemas/ - Embedded CUE schemas directory

CLI Command Patterns:

  • cli/cmd/refs/add.go:24-72 - Cobra command setup with flags
  • cli/cmd/refs/add.go:74-127 - RunE implementation using manager pattern
  • cli/cmd/refs/add.go:129-163 - Output formatting with Printf and emoji indicators

Manager Patterns:

  • cli/internal/refs/manager.go:12-40 - CacheManager struct and factory functions
  • cli/internal/refs/manager.go:53-91 - Install method showing type inference and caching flow

Existing Documentation Context

Design Document (oci-refs-design.md)

The design document's Inspector section (lines 291-307) specifies the inspection workflow and responsibilities:

  • Call ListFiles to retrieve file list from estargz TOC
  • Parse .sow-ref.yaml via selective extraction
  • Display file count, total size estimate, directory tree
  • Show metadata (title, description, classifications, tags)
  • Validate structure before user commits to download

The Architecture Overview diagrams (lines 143-167) show the consumption flow for inspection:

OCI Registry → ListFiles (TOC only) → Display: "Ref contains 15 files, 2.3MB"

The Non-Functional Requirements (lines 99-107) establish:

  • NFR2: Inspection completes in < 3 seconds
  • NFR6: Security including path traversal protection

Cross-cutting Concepts (arc42-08-concepts-oci-refs.md)

Section 2 details Selective Extraction with estargz:

  • Inspection downloads TOC only (~5-20KB) via ListFiles
  • Selectively extracts .sow-ref.yaml (~1-5KB)
  • Total bandwidth: < 10KB
  • Structure validation occurs before suggesting full download

This establishes the key efficiency property: users can evaluate refs without bandwidth cost proportional to ref size.

Discovery Analysis (analysis.md)

Section 5 (OCI Library Integration) confirms:

  • github.com/jmgilman/go/oci provides ListFiles for TOC-only download
  • Expected API: ListFiles(ctx, ref) returns file list from estargz TOC

Section 9.4 (CLI Output) specifies output conventions:

  • Use cmd.Printf() with emoji indicators: success, warning, error

Dependencies

Work Unit Dependency Type Reason
003 Hard prerequisite Inspector uses OCI client for ListFiles and selective extraction
002 Hard prerequisite Inspector validates .sow-ref.yaml against CUE schema
004 None Packaging is independent (publishing vs consuming)
006 Consumer Installation may call inspection internally for validation
007 Consumer CLI commands invoke Inspector API

Interface Design

Inspector Interface

// Inspector provides pre-download inspection of OCI refs.
// It enables bandwidth-efficient preview by downloading only
// the estargz TOC and manifest, not full file contents.
type Inspector interface {
    // Inspect retrieves metadata and file listing from an OCI ref
    // without downloading the full image contents.
    //
    // This operation uses estargz ListFiles to download only the TOC,
    // then selectively extracts .sow-ref.yaml for metadata.
    // Total bandwidth: typically < 10KB regardless of ref size.
    //
    // The ref parameter accepts OCI URLs:
    //   - ghcr.io/org/repo:tag
    //   - oci://ghcr.io/org/repo:tag
    //   - ghcr.io/org/repo@sha256:...
    //
    // Returns InspectResult with file listing, metadata, and validation status.
    // Returns error if ref doesn't exist or network fails.
    Inspect(ctx context.Context, ref string) (*InspectResult, error)
}

Data Structures

// InspectResult contains all information gathered during ref inspection.
type InspectResult struct {
    // Ref is the original ref URL as provided
    Ref string

    // Digest is the SHA256 digest of the OCI image
    Digest string

    // Files is the complete file listing from the estargz TOC
    Files []FileEntry

    // TotalSize is the estimated total size in bytes
    // Calculated by summing file sizes from TOC
    TotalSize int64

    // FileCount is the number of files in the ref
    FileCount int

    // Manifest is the parsed .sow-ref.yaml content
    // Nil if manifest doesn't exist or couldn't be parsed
    Manifest *RefManifest

    // Valid indicates whether the manifest passed schema validation
    Valid bool

    // ValidationErrors contains detailed validation failure messages
    // Empty if Valid is true
    ValidationErrors []string
}

// FileEntry represents a single file from the estargz TOC.
type FileEntry struct {
    // Path is the file path relative to ref root
    Path string

    // Size is the file size in bytes
    Size int64

    // Mode is the Unix file mode (permissions)
    Mode os.FileMode
}

// RefManifest represents the parsed .sow-ref.yaml content.
// This mirrors the CUE schema from Work Unit 002.
type RefManifest struct {
    SchemaVersion string          `yaml:"schema_version"`
    Ref           RefIdentity     `yaml:"ref"`
    Content       RefContent      `yaml:"content"`
    Provenance    *RefProvenance  `yaml:"provenance,omitempty"`
    Packaging     *RefPackaging   `yaml:"packaging,omitempty"`
    Hints         *RefHints       `yaml:"hints,omitempty"`
    Metadata      map[string]any  `yaml:"metadata,omitempty"`
}

type RefIdentity struct {
    Title string `yaml:"title"`
    Link  string `yaml:"link"`
}

type RefContent struct {
    Description     string           `yaml:"description"`
    Summary         string           `yaml:"summary,omitempty"`
    Classifications []Classification `yaml:"classifications"`
    Tags            []string         `yaml:"tags"`
}

type Classification struct {
    Type        string `yaml:"type"`
    Description string `yaml:"description"`
}

type RefProvenance struct {
    Authors []string `yaml:"authors,omitempty"`
    Created string   `yaml:"created,omitempty"`
    Updated string   `yaml:"updated,omitempty"`
    Source  string   `yaml:"source,omitempty"`
    License string   `yaml:"license,omitempty"`
}

type RefPackaging struct {
    Exclude []string `yaml:"exclude,omitempty"`
}

type RefHints struct {
    SuggestedQueries []string `yaml:"suggested_queries,omitempty"`
    PrimaryFiles     []string `yaml:"primary_files,omitempty"`
}

Implementation Approach

High-Level Flow

Inspect(ctx, ref)
    │
    ├─1─► Parse and validate ref URL
    │     (reuse URL parsing from Work Unit 003)
    │
    ├─2─► Call OCI client ListFiles(ctx, ref)
    │     → Downloads estargz TOC only (~few KB)
    │     → Returns []FileEntry with paths, sizes, modes
    │
    ├─3─► Build directory tree representation
    │     → Calculate total size from file entries
    │     → Count files
    │
    ├─4─► Selective extract .sow-ref.yaml
    │     → Uses OCI client ExtractFile or similar
    │     → Downloads only this one file (~1-5KB)
    │
    ├─5─► Parse YAML and validate against CUE schema
    │     → Use validator from Work Unit 002
    │     → Capture validation errors if any
    │
    └─6─► Return InspectResult
          → File list, size, count
          → Parsed manifest (or nil)
          → Valid flag and any errors

Key Behaviors

  1. TOC-Only Download: The ListFiles call MUST NOT download file contents. It retrieves only the estargz table of contents which contains metadata about files without their actual data.

  2. Single-File Extraction: After getting the TOC, extract only .sow-ref.yaml. This is a targeted download of ~1-5KB, not a full image pull.

  3. Graceful Degradation: If .sow-ref.yaml doesn't exist or is malformed:

    • Still return file list and size information
    • Set Valid = false
    • Include descriptive error in ValidationErrors
    • User can still make informed decision
  4. Digest Capture: The OCI client should return the image digest from the registry. This enables digest pinning on subsequent install.

Error Handling

Scenario Behavior
Network failure Return error with clear message; don't return partial result
Ref not found (404) Return error: "ref not found: "
No .sow-ref.yaml Return result with Valid=false, Manifest=nil, error message
Invalid YAML syntax Return result with Valid=false, parse error in ValidationErrors
Schema validation fails Return result with Valid=false, field-level errors in ValidationErrors
Auth required Return error: "authentication required for "

Testing Strategy

Unit Tests

Inspector logic tests (with mocked OCI client):

  • Successful inspection with valid manifest
  • Inspection of ref without .sow-ref.yaml
  • Inspection with malformed YAML
  • Inspection with schema validation failures
  • File counting and size calculation accuracy
  • Directory tree building

FileEntry parsing:

  • Various file modes (regular, directory, symlink)
  • Path normalization
  • Size calculation overflow protection

Integration Tests

With test OCI registry:

  • End-to-end inspection of published test ref
  • Verify bandwidth usage (< 10KB)
  • Verify timing (< 3 seconds)
  • Test with refs of varying sizes (1MB, 10MB, 100MB)
  • Verify same result regardless of ref size

Authentication scenarios:

  • Anonymous access to public ref
  • Authenticated access to private ref
  • Clear error for unauthorized access

Benchmark Tests

  • Measure ListFiles latency for various ref sizes
  • Verify bandwidth is constant regardless of ref size
  • Compare inspection time vs full download time

Performance Requirements

Metric Target Rationale
Inspection time < 3 seconds Design doc NFR2
Bandwidth < 10KB TOC (~5KB) + manifest (~5KB max)
Memory O(file count) Only store file entries, not contents

Performance MUST be independent of ref size. A 1GB ref should inspect as fast as a 1KB ref because we never download file contents.


Security Considerations

  1. Path Traversal: Validate file paths from TOC don't contain ../ sequences
  2. Size Limits: Reject TOC if file count > 10,000 (DoS protection)
  3. Manifest Size: Reject .sow-ref.yaml if > 1MB (malicious manifest protection)
  4. URL Validation: Reject malformed or dangerous URLs before network calls

CLI Output Format

The CLI command (sow refs inspect) in Work Unit 007 will consume this API. Expected output format:

$ sow refs inspect ghcr.io/myorg/go-standards:v1.0.0

✓ Ref: ghcr.io/myorg/go-standards:v1.0.0
  Digest: sha256:abc123def456...

  Files: 23 files, 2.3 MB total

  Directory Structure:
    docs/
      README.md (12 KB)
      guide.md (45 KB)
      api/
        reference.md (120 KB)
    examples/
      demo.go (3 KB)
    .sow-ref.yaml (2 KB)

  Metadata:
    Title: Go Team Standards
    Link: go-standards
    Description: Team Go coding conventions and best practices.
    Classifications: guidelines
    Tags: golang, conventions, testing
    License: MIT

  Status: ✓ Valid manifest

For invalid refs:

$ sow refs inspect ghcr.io/myorg/bad-ref:v1.0.0

⚠ Ref: ghcr.io/myorg/bad-ref:v1.0.0
  Digest: sha256:xyz789...

  Files: 5 files, 150 KB total

  Status: ✗ Invalid manifest
    - ref.title: required field missing
    - content.classifications: must have at least one entry

Implementation Standards

All code produced in this work unit MUST adhere to the following standards:

Code Quality Standards

  • STYLE.md Compliance: All Go code must follow the conventions documented in .standards/STYLE.md
  • TESTING.md Compliance: All tests must follow the patterns documented in .standards/TESTING.md
  • golangci-lint: Code must pass golangci-lint run with zero errors before completion

Required Dependencies

  • OCI Operations: Use github.com/jmgilman/go/oci for all OCI registry operations
    • List files (TOC-only): client.ListFiles(ctx, reference) - downloads only estargz TOC
    • Filtered list: client.ListFilesWithFilter(ctx, reference, patterns...)
    • The library provides bandwidth-efficient inspection via estargz format
  • Filesystem Abstractions: Use github.com/jmgilman/go/fs/core and github.com/jmgilman/go/fs/billy for all file system operations
    • Pass oci.WithFilesystem(fsys) for testability
    • Use billy.NewMemoryFS() in unit tests

Verification Checklist

Before marking this work unit complete, verify:

  • golangci-lint run ./libs/refs/... passes with zero errors
  • All code follows STYLE.md conventions (functional options, error wrapping, etc.)
  • All tests follow TESTING.md patterns (table-driven tests, test helpers, etc.)
  • Unit tests use memory filesystem via billy.NewMemoryFS() where applicable

Acceptance Criteria

  1. Inspector interface defined in libs/refs/inspector.go
  2. InspectResult, FileEntry, RefManifest types implemented
  3. Implementation uses OCI client ListFiles for TOC-only download
  4. Implementation selectively extracts only .sow-ref.yaml
  5. Schema validation uses CUE validator from Work Unit 002
  6. Graceful handling when manifest missing or invalid
  7. Unit tests with mocked OCI client achieve >80% coverage
  8. Integration test verifies < 10KB bandwidth
  9. Integration test verifies < 3 second completion
  10. Security validations (path traversal, size limits) implemented
  11. Mock generation via go:generate directive

Out of Scope

  • CLI command implementation → Work Unit 007
  • Full image download → Work Unit 006 (Installation)
  • Publishing/packaging → Work Unit 004
  • OCI client implementation → Work Unit 003
  • CUE schema definition → Work Unit 002
  • Cache management → Work Unit 006

References

Document Relevance
.sow/knowledge/designs/oci-refs/oci-refs-design.md lines 291-307 Inspector component specification
.sow/knowledge/designs/oci-refs/oci-refs-design.md lines 143-167 Inspection flow diagram
.sow/knowledge/designs/oci-refs/arc42-08-concepts-oci-refs.md Section 2 Selective extraction concept
.sow/project/discovery/analysis.md Sections 5, 9.4 OCI library API, CLI output conventions
libs/exec/executor.go Interface pattern reference
libs/project/state/validate.go CUE validation pattern
cli/cmd/refs/add.go CLI command pattern

Metadata

Metadata

Assignees

No one assigned

    Labels

    sowIssues managed by sow breakdown workflow

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions