This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Architecture

How all the projects and datasets fit together

Anchore’s open source security tooling consists of several interconnected tools that work together to detect vulnerabilities and ensure license compliance in software packages. This page explains how these tools interact and how data flows through the system.

The Anchore OSS ecosystem includes five main tools that, at the 30,000 ft view, work together as follows:

---
config:
  layout: dagre
  look: handDrawn
  theme: default
  flowchart:
    curve: linear
---
flowchart TD
    vunnel["***Vunnel***<br><small>Downloads and normalizes<br>security feeds</small>"]:::Ash
    grypedb["***Grype DB***<br><small>Converts feeds to<br>SQLite database</small>"]:::Ash
    grype["***Grype***<br><small>Matches vulnerabilities<br>from SBOM + database</small>"]:::Ash
    syft["***Syft***<br><small>Generates SBOMs from<br>scan targets</small>"]:::Ash
    grant["***Grant***<br><small>Analyzes licenses<br>from SBOM</small>"]:::Ash

    vunnel --> grypedb --> grype
    syft --> grype & grant

    vunnel@{ shape: event}
    grypedb@{ shape: event}
    grype@{ shape: event}
    syft@{ shape: event}
    grant@{ shape: event}

    classDef Ash stroke-width:1px, stroke-dasharray:none, stroke:#424242, fill:#e1ffe1, color:#000000

Zooming in to the 20,000 ft view, here’s how data flows through the same system:

---
config:
  layout: dagre
  look: handDrawn
  theme: default
  flowchart:
    curve: linear
---
flowchart TB

  feed1["NVD Feed"]
  feed2["Alpine Feed"]
  feed3["... (20+ feeds)"]

  subgraph anchore["<b>Anchore Infrastructure</b>"]
    vunnel["Vunnel"]
    grypedb["Grype DB"]
    cache["Daily DB"]
    vunnel --> grypedb --> cache
  end


  subgraph user["<b>User Environment</b>"]
    targets["Image, filesystem,<br>PURLs, directory, ..."]
    local["DB Cache"]

    syft["Syft"]
    sbom["SBOM"]

    targets --> syft --> sbom

    grype["Grype"]
    vulns["Vulnerability+Package<br>Matches"]
    grant["Grant"]
    licenses["License Compliance<br>Report"]

    grype --> vulns
    grant --> licenses

    sbom --> grype
    sbom --> grant
    local --> grype
  end

  feed1 --> vunnel
  feed2 --> vunnel
  feed3 -.-> vunnel

  cache -. "<i>download</i>" .-> local

  feed1:::ExternalSource@{ shape: cloud}
  feed2:::ExternalSource@{ shape: cloud}
  feed3:::ExternalSource@{ shape: cloud}
  vunnel:::Application@{ shape: event}
  grypedb:::Application@{ shape: event}
  grype:::Application@{ shape: event}
  syft:::Application@{ shape: event}
  grant:::Application@{ shape: event}

  targets:::AnalysisInput
  cache:::Database@{ shape: db}
  local:::Database@{ shape: db}
  sbom:::Document@{ shape: doc}
  vulns:::Document@{ shape: doc}
  licenses:::Document@{ shape: doc}

  style anchore fill:none, stroke:#333333, stroke-width:2px, stroke-dasharray:5 5
  style user fill:none, stroke:#333333, stroke-width:2px, stroke-dasharray:5 5

  classDef AnalysisInput stroke-width:1px, stroke-dasharray:none, stroke:#424242, fill:#f0f8ff, color:#000000
  classDef ExternalSource stroke-width:1px, stroke-dasharray:none, stroke:#424242, fill:#f0f8ff, color:#000000
  classDef Application stroke-width:1px, stroke-dasharray:none, stroke:#424242, fill:#e1ffe1, color:#000000
  classDef Document stroke-width:1px, stroke-dasharray:none, stroke:#424242, fill:#fff9c4, color:#000000
  classDef Database stroke-width:1px, stroke-dasharray:none, stroke:#424242, fill:#fff9c4, color:#000000

1 - Go CLI patterms

All of the common patterns used in our go-based CLIs

This document explains how all of the golang-base Anchore OSS tools are organized, covering the package structure, common core architectural concepts, and where key functionality is implemented.

Use this as a reference when trying to familiarize yourself with the overall structure of Syft, Grype, or other applications.

CLI

The cmd package uses the Clio framework (built on top of the spf13/cobra and spf13/viper) to manage flag/argument parsing, configuration, and command execution.

All flags, arguments, and config arguments are represented in the application as a struct. Each command tends to get it’s own struct with all options the command needs to function. Common options or sets of options can be defined independently and reused across commands, being composed within each command struct that needs the option.

Select options that represent flags are registered with the AddFlags method defined on the command struct (or on each option struct used within the command struct). If any additional processing is needed to be done to elements in command structs or option structs before being used in the application then you can define a PostLoad method on the struct to mutate the elements you need.

In terms of what is executed when: all processing is done within the selected cobra command’s PreRun hook, wrapping any potential user-provided hook. This means that all of this fits nicely into the existing cobra command lifecycle.

See the sign command in Quill for a small example of all of this together.

The reason for this approach is to smooth over the rough edges between cobra and viper, which have multiple ways to configure and use functionality, and provide a single way to specify any input into the application. Being prescriptive about these approaches has allowed us to take many shared concerns that used to be a lot of boilerplate when creating an application and put it into one framework –Clio.

Execution flow

The following diagrams show the execution of a typical Anchore application at different levels of detail, using the scan command in Syft as a representative example:

sequenceDiagram
    actor user as User
    participant syft as Syft Application
    participant cmd as Command Handler (Cobra)
    participant lib as Library

    user->>syft: syft scan alpine:latest
    syft->>cmd: Execute
    cmd->>cmd: Initialize & Load Configuration
    cmd->>lib: Execute Scan Logic
    lib->>cmd: SBOM
    cmd-->>user: Display/Write SBOM

sequenceDiagram
    actor user as User

    box rgba(0,0,0,.1) Syft Application
      participant main as main.go
      participant cliApp as cli.Application()
      participant clio as Clio Framework
    end

    box rgba(0,0,0,.1) Command Handler
      participant cobra as Command PreRunE
      participant opts as Command Options
      participant runE as Command RunE
    end

    participant lib as Library

    user->>main: syft scan alpine:latest

    Note over main,clio: Syft Application (initialization)
    main->>cliApp: Create app with ID
    cliApp->>clio: clio.New(config)
    clio-->>cliApp: app instance

    Note over cliApp,cobra: Build Command Tree
    cliApp->>cliApp: commands.Scan(app)
    cliApp->>clio: app.SetupCommand(&cobra.Command, opts)
    Note over clio: Bind config sources to options struct
    clio-->>cliApp: configured scanCmd

    cliApp->>cliApp: commands.Root(app, scanCmd)
    cliApp->>clio: app.SetupRootCommand(&cobra.Command, opts)
    clio-->>cliApp: rootCmd with scanCmd attached

    main->>clio: app.Run()
    clio->>cobra: rootCmd.Execute()

    Note over cobra,runE: Command Handler (execution)
    cobra->>cobra: Parse args → "scan alpine:latest"
    cobra->>opts: Load config (files/env/flags)
    cobra->>opts: opts.PostLoad() validation
    cobra->>runE: RunE(cmd, args)

    runE->>lib: Execute Scan Logic
    lib-->>runE: SBOM

    Note over runE: Result Output
    runE-->>user: SBOM output

Package structure

Many of the Anchore OSS tools have the following setup (or very similar):

  • /cmd/NAME/ - CLI application layer. This is the entry point for the command-line tool and wires up much of the functionality from the public API.

    ./cmd/NAME/
    │   ├── cli/
    │   │   ├── cli.go          // where all commands are wired up
    │   │   ├── commands/       // all command implementations
    │   │   ├── options/        // all command flags and configuration options
    │   │   └── ui/             // all handlers for events that are shown on the UI
    │   └── main.go             // entrypoint for the application
    ...
    
  • /NAME/ - Public library API. This is how API users interact with the underlying capabilities without coupling to the application configuration, specific presentation on the terminal, or high-level workflows.

The internalization philosophy

Applications extensively use internal/ packages at multiple levels to minimize the public API surface area. The codebase follows the guiding principle “internalize anything you can” - expose only what library consumers truly need.

Take for example the various internal packages within Syft

/internal/               # Project-wide internals (bus, log, etc...)
/syft/internal/          # Syft library-specific internals (relationships, evidence)
/cmd/syft/internal/      # CLI-specific internals (options, UI handlers)
/syft/source/internal/   # Package-specific internals (source resolution details)
/syft/pkg/cataloger/<ecosystem>/internal/  # Cataloger-specific internals

This multi-level approach allows Syft to expose a minimal, stable public API while keeping implementation details flexible and changeable. Go’s module system prevents importing internal/ packages from outside their parent directory, which enforces clean separation of concerns.

Core facilities

The bus system

The bus system, under /internal/bus/ within the target application, is an event publishing mechanism that enables progress reporting and UI updates without coupling the library to any specific user interface implementation.

The bus follows a strict one-way communication pattern: the library publishes events but never subscribes to them. The intention is that functionality is NOT fulfilled by listening to events on the bus and taking action. Only the application layer (CLI) subscribes to events for display. This keeps the library completely decoupled from UI concerns.

You can think of the bus as a structured extension of the logger, allowing for publishing not just strings or maps of strings, but enabling publishing objects that can yield additional telemetry on-demand, fueling richer interactions.

This enables library consumers to implement any UI they want (terminal UI, web UI, no UI) by subscribing to events and handling them appropriately. The library has zero knowledge of how events are used, maintaining a clean separation between business logic and presentation.

The bus is implemented as a singleton with a global publisher that can be set by library consumers:

var publisher partybus.Publisher

func Set(p partybus.Publisher) {
    publisher = p
}

func Publish(e partybus.Event) {
    if publisher != nil {
        publisher.Publish(e)
    }
}

The library calls bus.Publish() throughout cataloging operations. If no publisher is set, events are silently discarded. This makes events truly optional.

Event streams

Picking the right “level” for events is key. Libraries tend to not assume that events can be read “quickly” off the bus. At the same time, to remain lively and useful, we want to be able to have consumers of the bus to get information at a rate they choose. A common pattern used is to publish a “start” event (for example, “cataloging started”) and publish with that event a read-only, thread-safe object that can be polled by the caller to get progress or status-based information out.

sequenceDiagram
    participant CMD as cmd/<br/>(CLI Layer)
    participant Bus as internal/bus/<br/>(Event Bus)
    participant Lib as lib/<br/>(Library Layer)
    participant Progress as Progress Object

    CMD->>Bus: Subscribe()
    CMD->>+Lib: PerformOperation()
    Lib->>Progress: Create progress object
    Lib->>Bus: Publish(StartEvent, progress)
    Bus->>CMD: StartEvent

    loop Poll until complete
        CMD->>Progress: Size(), Current(), Stage(), Error()
        Progress-->>CMD: status (Error: nil)
    end

    Lib-->>-CMD: Return result
    CMD->>Progress: Error()
    Progress-->>CMD: ErrCompleted

This prevents against the library accidentally being a “firehose” and overwhelming subscribers who are trying to convey timely information. When subscribers cannot keep up with the amount of events emitted from the library then the very information being displayed tends to get stale and useless anyway. At the same time, the there is a lot of value in responding to events instead of polling for all information.

This pattern helps to balance the best of both worlds, getting an event driven system with a consumer-driven update cadence.

The logging system

The logging system, under /internal/log/ within the target application, provides structured logging throughout Anchore’s applications with an injectable logger interface. This allows library consumers to integrate the application’s logging into their own logging infrastructure. There is an adapter for logrus to this interface implemented, and we’re happy to take additional contributions for other concrete logger adapters.

The logging system is implemented as a singleton with global functions (log.Info, log.Debug, etc.). Library consumers inject their logger by calling the public API function syft.SetLogger(yourLoggerHere).

By default, Syft uses a discard logger (no-op) that silently ignores all log messages. This ensures the library produces no output unless a logger is explicitly provided.

All loggers are automatically wrapped with a redaction layer when you call SetLogger(). The wrapping is applied internally by the logging system, which removes sensitive information (like authentication tokens) from log output. This happens transparently within the application CLI, however, API users will need to explicitly register secrets to be redacted.

Releasing

Each application uses goreleaser to build and publish releases, as orchestrated by a release workflow.

The release workflow can be triggered with make release from a local checkout of the repository. Chronicle is used to automatically generate release notes based on GitHub issues and PR titles/labels, using the same information to determine the next version for the release.

With each repo, we tend to publish (but some details may vary slightly between repos):

  • a tag with the version (e.g., v0.50.0)
  • binaries for Linux, Mac, and Windows, uploaded as GitHub release assets (note, we sign and notarize Mac binaries with Quill)
  • Docker images, pushed to Docker Hub and ghcr.io registries
  • Update homebrew taps

We ensure the same tool versions are used locally and in CI by using Binny, orchestrated with make and task.

2 - Syft

Architecture and design of the Syft SBOM tool

Code organization

At a high level, this is the package structure of Syft:

./cmd/syft/                 // main entrypoint
│   └── ...
└── syft/                   // the "core" syft library
    ├── format/             // contains code to encode or decode to and from SBOM formats
    ├── pkg/                // contains code to catalog packages from a source
    ├── sbom/               // contains the definition of an SBOM
    └── source/             // contains code to create a source object for some input type (e.g. container image, directory, etc)

Syft’s core library is implemented in the syft package and subpackages. The major packages work together in a pipeline:

  • The syft/source package produces a source.Source object that can be used to catalog a directory, container, and other source types.
  • The syft package knows how to take a source.Source object and catalog it to produce an sbom.SBOM object.
  • The syft/format package contains the ability to encode an sbom.SBOM object to and from different SBOM formats (such as SPDX and CycloneDX).

This design creates a clear flow: source → catalog → format:

sequenceDiagram
    actor User
    participant CLI
    participant Resolve as Source Resolution
    participant Catalog as SBOM Creation
    participant Format as Format Output

    User->>CLI: syft scan <target>
    CLI->>CLI: Parse configuration

    CLI->>Resolve: Resolve input (image/dir/file)
    Note over Resolve: Tries: File→Directory→OCI→Docker→Podman→Containerd→Registry
    Resolve-->>CLI: source.Source

    CLI->>Catalog: Create SBOM from source
    Note over Catalog: Task-based cataloging engine
    Catalog-->>CLI: sbom.SBOM struct

    CLI->>Format: Write to format(s)
    Note over Format: Parallel: SPDX, CycloneDX, Syft JSON, etc.
    Format-->>User: SBOM file(s)

Shows the task-based architecture and execution phases. Tasks are selected by tags (image/directory/installed) and organized into serial phases, with parallel execution within each phase.

sequenceDiagram
    participant CLI as scan.go
    participant GetSource as Source Providers
    participant CreateSBOM as syft.CreateSBOM
    participant Config as CreateSBOMConfig
    participant Executor as Task Executor
    participant Builder as sbomsync.Builder
    participant Resolver as file.Resolver

    Note over CLI,GetSource: Source Resolution
    CLI->>GetSource: GetSource(userInput, cfg)
    GetSource->>GetSource: Try providers until success
    GetSource-->>CLI: source.Source + file.Resolver

    Note over CLI,Builder: SBOM Creation (task-based architecture)
    CLI->>CreateSBOM: CreateSBOM(ctx, source, cfg)
    CreateSBOM->>Config: makeTaskGroups(srcMetadata)

    Note over Config: Task Selection & Organization
    Config->>Config: Select catalogers by tags<br/>(image/directory/installed)
    Config->>Config: Organize into execution phases
    Config-->>CreateSBOM: [][]Task (grouped by phase)

    CreateSBOM->>Builder: Initialize thread-safe builder

    Note over CreateSBOM,Executor: Phase 1: Environment Detection
    CreateSBOM->>Executor: Execute environment tasks
    Executor->>Resolver: Read OS release files
    Executor->>Builder: SetLinuxDistribution()

    Note over CreateSBOM,Executor: Phase 2: Package + File Cataloging
    CreateSBOM->>Executor: Execute package & file tasks
    par Parallel Task Execution
        Executor->>Resolver: Read package manifests
        Executor->>Builder: AddPackages()
    and
        Executor->>Resolver: Read file metadata
        Executor->>Builder: Add file artifacts
    end

    Note over CreateSBOM,Executor: Phase 3: Post-Processing
    CreateSBOM->>Executor: Execute relationship tasks
    Executor->>Builder: AddRelationships()
    CreateSBOM->>Executor: Execute cleanup tasks

    CreateSBOM-->>CLI: *sbom.SBOM

    Note over CLI: Format Output
    CLI->>CLI: Write multi-format output

The Package object

The pkg.Package object is a core data structure that represents a software package.

Key fields include:

  • FoundBy: the name of the cataloger that discovered this package (e.g. python-pip-cataloger).
  • Locations: the set of paths and layer IDs that were parsed to discover this package.
  • Language: the language of the package (e.g. python).
  • Type: a high-level categorization of the ecosystem the package resides in. For instance, even if the package is an egg, wheel, or requirements.txt reference, it is still logically a “python” package. Not all package types align with a language (e.g. rpm) but it is common.
  • Metadata: specialized data for specific location(s) parsed. This should contain as much raw information as seems useful, kept as flat as possible using the raw names and values from the underlying source material.

Additional package Metadata

Packages can have specialized metadata that is specific to the package type and source of information. This metadata is stored in the Metadata field of the pkg.Package struct as an any type, allowing for flexibility in the data stored.

When pkg.Package is serialized, an additional MetadataType field is shown to help consumers understand the datashape of the Metadata field.

By convention the MetadataType value follows these rules:

  • Only use lowercase letters, numbers, and hyphens. Use hyphens to separate words.
  • Anchor the name in the ecosystem, language, or packaging tooling. For language ecosystems, prefix with the language/framework/runtime. For instance dart-pubspec-lock is better than pubspec-lock. For OS package managers this is not necessary (e.g. apk-db-entry is good, but alpine-apk-db-entry is redundant).
  • Be as specific as possible to what the data represents. For instance ruby-gem is NOT a good MetadataType value, but ruby-gemspec is, since Ruby gem information can come from a gemspec file or a Gemfile.lock, which are very different.
  • Describe WHAT the data is, NOT HOW it’s used. For instance r-description-installed-file is not good since it’s trying to convey how we use the DESCRIPTION file. Instead simply describe what the DESCRIPTION file is: r-description.
  • Use the lock suffix to distinguish between manifest files that loosely describe package version requirements vs files that strongly specify one and only one version of a package (“lock” files). These should only be used with respect to package managers that have the guide and lock distinction, but would not be appropriate otherwise (e.g. rpm does not have a guide vs lock, so lock should NOT be used to describe a db entry).
  • Use the archive suffix to indicate a package archive (e.g. rpm file, apk file) that describes the contents of the package. For example an RPM file would have a rpm-archive metadata type (not to be confused with an RPM DB record entry which would be rpm-db-entry).
  • Use the entry suffix to indicate information about a package found as a single entry within a file that has multiple package entries. If found within a DB or flat-file store for an OS package manager, use db-entry.
  • Should NOT contain the phrase package, though exceptions are allowed if the canonical name literally has the phrase package in it.
  • Should NOT have a file suffix unless the canonical name has the term “file”, such as a pipfile or gemfile.
  • Should NOT contain the exact filename+extensions. For instance pipfile.lock shouldn’t be in the name; instead describe what the file is: python-pipfile-lock.
  • Should NOT contain the phrase metadata, unless the canonical name has this term.
  • Should represent a single use case. For example, trying to describe Hackage metadata with a single HackageMetadata struct is not allowed since it represents 3 mutually exclusive use cases: stack.yaml, stack.lock, or cabal.project. Each should have its own struct and MetadataType.

The goal is to provide a consistent naming scheme that is easy to understand. If the rules don’t apply in your situation, use your best judgement.

When the underlying parsed data represents multiple files, there are two approaches:

  • Use the primary file to represent all the data. For instance, though the dpkg-cataloger looks at multiple files, it’s the status file that gets represented.
  • Nest each individual file’s data under the Metadata field. For instance, the java-archive-cataloger may find information from pom.xml, pom.properties, and MANIFEST.MF. The metadata is java-metadata with each possibility as a nested optional field.

Package Catalogers

Catalogers are the mechanism by which Syft identifies and constructs packages given a targeted list of files.

For example, a cataloger can ask Syft for all package-lock.json files in order to parse and raise up JavaScript packages (see file globs and file parser functions for examples).

There is a generic cataloger implementation that can be leveraged to quickly create new catalogers by specifying file globs and parser functions (browse the source code for syft catalogers for example usage).

Design principles

From a high level, catalogers have the following properties:

  • They are independent of one another. The Java cataloger has no idea of the processes, assumptions, or results of the Python cataloger, for example.

  • They do not know what source is being analyzed. Are we analyzing a local directory? An image? If so, the squashed representation or all layers? The catalogers do not know the answers to these questions. Only that there is an interface to query for file paths and contents from an underlying “source” being scanned.

  • Packages created by the cataloger should not be mutated after they are created. There is one exception made for adding CPEs to a package after the cataloging phase, but that will most likely be moved back into the cataloger in the future.

Naming conventions

Cataloger names should be unique and named with these rules in mind:

  • Must end with -cataloger
  • Use lowercase letters, numbers, and hyphens only
  • Use hyphens to separate words
  • Catalogers for language ecosystems should start with the language name (e.g. python-)
  • Distinguish between when the cataloger is searching for evidence of installed packages vs declared packages. For example, there are two different gemspec-based catalogers: ruby-gemspec-cataloger and ruby-installed-gemspec-cataloger, where the latter requires that the gemspec is found within a specifications directory (meaning it was installed, not just at the root of a source repo).

File search and selection

All catalogers are provided an instance of the file.Resolver to interface with the image and search for files. The implementations for these abstractions leverage stereoscope to perform searching. Here is a rough outline how that works:

  1. A stereoscope file.Index is searched based on the input given (a path, glob, or MIME type). The index is relatively fast to search, but requires results to be filtered down to the files that exist in the specific layer(s) of interest. This is done automatically by the filetree.Searcher abstraction. This abstraction will fallback to searching directly against the raw filetree.FileTree if the index does not contain the file(s) of interest. Note: the filetree.Searcher is used by the file.Resolver abstraction.

  2. Once the set of files are returned from the filetree.Searcher the results are filtered down further to return the most unique file results. For example, you may have requested files by a glob that returns multiple results. These results are filtered down to deduplicate by real files, so if a result contains two references to the same file (one accessed via symlink and one accessed via the real path), then the real path reference is returned and the symlink reference is filtered out. If both were accessed by symlink then the first (by lexical order) is returned. This is done automatically by the file.Resolver abstraction.

  3. By the time results reach the pkg.Cataloger you are guaranteed to have a set of unique files that exist in the layer(s) of interest (relative to what the resolver supports).

CLI and core API

The CLI (in the cmd/syft/ package) and the core library API (in the syft/ package) are separate layers with a clear boundary. Application level concerns always reside with the CLI, while the core library focuses on SBOM generation logic. That means that there is an application configuration (e.g. cmd/syft/cli) and a separate library configuration, and when the CLI uses the library API, it must adapt its configuration to the library’s configuration types. In that adapter, the CLI layer defers to API-level defaults as much as possible so there is a single source of truth for default behavior.

See the Syft reponitory on GitHub for detailed API example usage.

3 - Grype

Architecture and design of the Grype vulnerability scanner

Code organization

At a high level, this is the package structure of Grype:

./cmd/grype/                // main entrypoint
│   └── ...
└── grype/                  // the "core" grype library
    ├── db/                 // vulnerability database management, schemas, readers, and writers
    │   ├── v5/             // V5 database schema
    │   └── v6/             // v6 database schema
    ├── match/              // core types for matches and result processing
    ├── matcher/            // vulnerability matching strategies
    │   ├── stock/          // default matcher (ecosystem + CPE)
    │   └── <ecosystem>/    // ecosystem-specific matchers (java, dpkg, rpm, etc.)
    ├── pkg/                // types for package representation (wraps Syft packages)
    ├── search/             // search criteria and strategies
    ├── version/            // version comparison across formats
    ├── vulnerability/      // core types for vulnerabilities and provider interface
    └── presenter/          // output formatters (JSON, table, etc.)

The grype package and subpackages implement Grype’s core library. The major packages work together in a pipeline:

  • The grype/pkg package wraps Syft packages and prepares them as match candidates, augmenting them with upstream package information and CPEs.
  • The grype/matcher package contains matching strategies that search for vulnerabilities matching specific package types.
  • The grype/db package manages the vulnerability database and provides query interfaces for matchers.
  • The grype/vulnerability package defines vulnerability data structures and the Provider interface for database queries.
  • The grype/search package implements search strategies (ecosystem, distro, CPE) and criteria composition.
  • The grype/presenter package formats match results into various output formats.

This design creates a clear flow: SBOM → package preparation → matching → results:

sequenceDiagram
    actor User
    participant CLI
    participant DB as Database
    participant Prep as Package Prep
    participant Match as Matching Engine
    participant Post as Post-Processing
    participant Format as Presenter

    User->>CLI: grype <target>
    CLI->>CLI: Parse configuration

    Note over CLI: Input Phase
    alt SBOM provided
        CLI->>CLI: Load SBOM from file
    else Scan target
        CLI->>CLI: Generate SBOM with Syft
    end

    Note over CLI,Prep: Preparation Phase
    CLI->>DB: Load vulnerability database
    DB-->>CLI: Database provider

    CLI->>Prep: Prepare packages for matching
    Note over Prep: Wrap Syft packages<br/>Add upstream packages<br/>Generate CPEs<br/>Filter overlaps
    Prep-->>CLI: Match candidates

    Note over CLI,Match: Matching Phase
    CLI->>Match: FindMatches(match candidates, provider)
    Note over Match: Group by package type<br/>Select matchers<br/>Execute in parallel
    Match-->>CLI: Raw matches + ignore filters

    Note over CLI,Post: Post-Processing Phase
    CLI->>Post: Process matches
    Note over Post: Apply ignore filters<br/>Apply user ignore rules<br/>Apply VEX statements<br/>Deduplicate results
    Post-->>CLI: Final matches

    Note over CLI,Format: Output Phase
    CLI->>Format: Format results
    Format-->>User: Vulnerability report

This diagram zooms into the Matching Phase from the high-level diagram, showing how the matching engine executes parallel matcher searches against the database. Components are grouped in boxes to show how they map to the high-level participants.

sequenceDiagram
    participant CLI as grype/main

    box rgba(200, 220, 240, 0.3) Matching Engine
        participant Matcher as VulnerabilityMatcher
        participant M as Matcher<br/>(Stock, Java, Dpkg, etc.)
    end

    participant Search as Search Strategies

    box rgba(220, 240, 200, 0.3) Database
        participant Provider as DB Provider
        participant DB as SQLite
    end

    Note over CLI,DB: Matching Phase (expanded from high-level view)
    CLI->>Matcher: FindMatches(match candidates, provider)

    Matcher->>Matcher: Group candidates by package type

    Note over Matcher,M: Each matcher runs in parallel with ecosystem-specific logic

    loop For each package type (stock, java, dpkg, etc.)
        Matcher->>M: Match(packages for this type)
        M->>Search: Build search criteria<br/>(ecosystem, distro, or CPE-based)
        Search->>Provider: SearchForVulnerabilities(criteria)
        Provider->>DB: Query vulnerability_handles
        DB-->>Provider: Matching handles
        Provider->>Provider: Compare versions against constraints
        Provider->>DB: Check unaffected_package_handles
        DB-->>Provider: Unaffected records
        Provider->>DB: Load blobs for confirmed matches
        DB-->>Provider: Vulnerability details
        Provider-->>Search: Confirmed matches
        Search-->>M: Filtered matches
        M-->>Matcher: Matches + ignore filters
    end

    Matcher->>Matcher: Collect matches from all matchers
    Matcher-->>CLI: Raw matches + ignore filters

    Note over CLI: Continues to Post-Processing Phase (see high-level view)

Relationship to Syft

Grype uses Syft’s SBOM generation capabilities rather than reimplementing package cataloging. The integration happens at two levels:

  1. External SBOMs: You can provide an SBOM file generated by Syft (or any SPDX/CycloneDX SBOM), and Grype consumes it directly.
  2. Inline scanning: When you provide a scan target (like a container image or directory), Grype invokes Syft internally to generate an SBOM, then immediately matches it against vulnerabilities.

The grype/pkg package wraps syft/pkg.Package objects and augments them with matching-specific data:

  • Upstream packages: For packages built from source (like Debian or RPM packages), Grype adds references to the source package so it can search both the binary package name and source package name.
  • CPE generation: Grype generates Common Platform Enumeration (CPE) identifiers for packages based on their metadata, enabling CPE-based matching as a fallback strategy.
  • Distro context: Grype preserves the Linux distribution information from Syft to enable distro-specific vulnerability matching.

This wrapping pattern maintains a clear architectural boundary. Syft focuses on finding packages, while Grype focuses on finding vulnerabilities in those packages.

Package representation

The grype/pkg package converts Syft packages into Grype match candidates. The pkg.FromCollection() function performs this conversion:

  1. Wraps each Syft package in a grype.Package that preserves the original package data.
  2. Adds upstream packages for packages that have source package relationships (e.g., a Debian binary package has a source package).
  3. Generates CPEs based on package metadata (name, version, vendor, product).
  4. Filters overlapping packages for comprehensive distros (like Debian or RPM-based distros) where you might have both installed packages and package files, preferring the installed packages.

The grype.Package type maintains a reference to the original syft.Package while augmenting it with:

  • Upstreams []UpstreamPackage: Source packages to search in addition to the binary package.
  • CPEs []syftPkg.CPE: Generated CPE identifiers for fallback matching.

This design preserves the complete SBOM information while preparing packages for the matching process. Matchers receive these enhanced packages and decide which attributes to use for searching.

Data flow

The data flow through Grype follows these steps:

  1. SBOM ingestion: Load an SBOM from a file or generate one by scanning a target.
  2. Package conversion: Convert Syft packages into grype.Package match candidates, adding upstream packages, CPEs, and filtering overlapping packages.
  3. Matcher selection: Group packages by type (e.g., Java, dpkg, npm) and select appropriate matchers.
  4. Parallel matching: Execute matchers in parallel, each querying the database with search criteria specific to their package types.
  5. Result aggregation: Collect matches from all matchers and apply deduplication using ignore filters.
  6. Post-processing: Apply user-configured ignore rules, VEX (Vulnerability Exploitability eXchange) statements, and optional CVE normalization.
  7. Output formatting: Format the final matches using the selected presenter (JSON, table, SARIF, etc.).

The database sits at the center of this flow. All matchers query the same database provider, but they use different search strategies based on their package types.

Vulnerability database

Grype uses a SQLite database to store vulnerability data. The database design prioritizes query performance and storage efficiency.

In order to interoperate any DB schema with the high-level Grype engine, each schema must implement a Provider interface. This allows for DB specific schemas to be adapted to the core Grype types.

v6 Schema design

The overall design of the v6 database schema is heavily influenced by the OSV schema, so if you are familiar with OSV, many of the entities / concepts will feel similar.

The database uses a blob + handle pattern:

  • Handles: Small, indexed records containing anything you might want to search by (package name, vulnerability id, provider name, etc.). Grype stores these in tables optimized for fast lookups. These tables point to blobs for full details. See the Grype DB SQL schemas for more details on handle table structures.

  • Blobs: Full JSON documents containing complete vulnerability details. Grype stores these separately and loads them only when a match is made. See the Grype DB blob schemas for more details on blob structures.

This separation allows Grype to quickly query millions of vulnerability records without loading full vulnerability details until necessary.

Key tables include:

  • vulnerability_handles: Searchable for vulnerability records by name (CVE/advisory ID), status (active, withdrawn, etc), published/modified/withdrawn dates, and provider ID. References a blob containing full vulnerability details (description, references, aliases, severities).

  • affected_package_handles: Links vulnerabilities, packages, and (optionally) operating systems. The referenced blob contains version constraints (for example, “vulnerable in 1.0.0 to 1.2.5”) and fix information. Used when the package ecosystem is known (npm, python, gem, etc.).

  • unaffected_package_handles: Explicitly marks package versions that are NOT vulnerable. Same structure as affected_package_handles but represents exemptions. These are applied on top of any discovered affected records to remove matches (thus reduce false positives).

  • affected_cpe_handles: Links vulnerabilities and explicit CPEs, useful when a CPE cannot be resolved to a clear package ecosystem.

  • packages: Stores unique ecosystem + name combinations (for example, ecosystem=‘npm’, name=‘lodash’).

  • operating_systems: Stores OS release information with name, major/minor version, codename, and channel (for example, RHEL EUS versus mainline). Provides context for distro-specific package matching.

  • cpes: Stores parsed CPE 2.3 components (part, vendor, product, edition, etc.). Version constraints are stored in blobs, not in this table.

  • blobs: Complete vulnerability, package, and decorator details as compressed JSON. There are 3 blob types:

    • VulnerabilityBlob (full vulnerability data)
    • PackageBlob (version ranges and fixes)
    • KnownExploitedVulnerabilityBlob (KEV catalog data).

Additional decorator tables enhance vulnerability information:

  • known_exploited_vulnerability_handles: Links CVE identifiers to blob containing CISA KEV catalog data (date added, vendor, product, required action, ransomware campaign use).

  • epss_handles: Stores EPSS (Exploit Prediction Scoring System) data with CVE identifier, EPSS score (0-1 probability), and percentile ranking.

  • cwe_handles: Maps CVE identifiers to CWE (Common Weakness Enumeration) IDs with source and type information.

The schema also includes a package_cpes junction table creating many-to-many relationships between packages and CPEs. When a CPE can be resolved to a package (via this table), vulnerabilities use affected_package_handles. When a CPE cannot be resolved, vulnerabilities use affected_cpe_handles instead.

Grype versions the database schema (currently v6). When the schema changes, users download a new database file that Grype automatically detects and uses.

Data organization

Relationships between tables enable efficient querying:

  1. Matchers create search criteria (package name, version, distro, etc.).
  2. The database provider queries the appropriate handle tables with these criteria.
  3. The grype/version package filters handles by version constraints.
  4. The provider loads the corresponding vulnerability blob for confirmed matches.
  5. The complete vulnerability record returns to the matcher.

Version constraints in the database use multi-version constraint syntax, allowing a single record to express complex version ranges like “affected in 1.0.0 to 1.2.5 and 2.0.0 to 2.1.3”.

Matching engine

The matching engine orchestrates vulnerability matching across different package types. The core component is the VulnerabilityMatcher, which:

  1. Groups packages by type: Java packages go to the Java matcher, dpkg packages to the dpkg matcher, etc.
  2. Selects matchers: Each matcher declares which package types it handles.
  3. Executes matching: Matchers run in parallel, querying the database with their specific search strategies.
  4. Collects results: Matches from all matchers are aggregated.
  5. Applies ignore filters: Matchers can mark certain matches to be ignored by other matchers, preventing duplicate reporting.

The ignore filter mechanism is important. For example, the dpkg matcher searches both the binary package name and the source package name. When it finds a match via the source package, it creates an ignore filter so the stock matcher doesn’t report the same vulnerability using a CPE match. This prevents duplicate matches for the same vulnerability.

Matchers

Each matcher implements the Matcher interface. This allows Grype to support multiple matching strategies for different package ecosystems.

The process of making a match involves several steps:

  1. Candidate creation: Matchers create match candidates when database records meet search criteria.
  2. Version comparison: The grype/version package compares the package version against the vulnerability’s version constraints.
  3. Unaffected check: If the database has an explicit “not affected” record for this version, the match is rejected.
  4. Match creation: Confirmed matches become Match objects with confidence scores (the scores are currently unused).
  5. Ignore filter check: Matches are checked against ignore filters from other matchers.
  6. User ignore rules: Matches are checked against user-configured ignore rules.

Search strategies

Matchers determine what to search for based on package type and available metadata. Grype supports three main search strategies:

  • Ecosystem search: Queries vulnerabilities by package name and version within a specific package ecosystem (npm, pypi, gem, etc.). Search fields include ecosystem, package name, and version. The database returns handles where the package name matches and version constraints include the specified version.

  • Distro search: Queries vulnerabilities by Linux distribution, package name, and version for OS packages managed by apt, yum, or apk. Search fields include distro name and version (for example, debian:10), package name, and version. Also understands distro channels like RHEL EUS versus mainline.

  • CPE matching: Fallback strategy when ecosystem or distro matching isn’t applicable, using CPE identifiers in the format cpe:2.3:a:vendor:product:version:.... Search fields include CPE components (part, vendor, product). Broader and less precise than ecosystem matching, used primarily when ecosystem data isn’t available.

Search criteria system

The grype/search package provides a criteria system that matchers use to express search requirements. Criteria can be combined with AND and OR operators:

  • AND(ecosystem("npm"), packageName("lodash"), version("4.17.20"))
  • OR(distro("debian:10"), distro("debian:11"))

The database provider translates these criteria into SQL queries against the handle tables. This abstraction allows matchers to express complex search requirements without writing SQL directly.

Ideally, matchers orchestrate search criteria at a high level, letting each specific criteria type handle its own needs. It’s the vulnerability provider that ultimately translates criteria into efficient database queries.

Version comparison

Grype supports multiple version formats because different ecosystems have different versioning schemes. The grype/version package provides format-specific version comparers, falling back to a “catch all” fuzzy comparer when the format cannot be determined.

Each format has its own constraint parser that understands ecosystem-specific constraint syntax. The version comparison system detects the appropriate format based on the package type, then uses the correct comparer to evaluate version constraints from the database.

The records from the Grype DB specify which version format to use on one side of the comparison, and the package type determines the format on the other side. If no specific format is found, or the formats are incompatible (essentially do not match), the fuzzy comparer is used as a last resort.

4 - Grype DB

Architecture and design of the Grype vulnerability database build system

Overview

grype-db is essentially an application that extracts information from upstream vulnerability data providers, transforms it into smaller records targeted for Grype consumption, and loads the individual records into a new SQLite DB.

flowchart LR
    subgraph pull["Pull"]
        A[Pull vuln data<br/>from upstream]
    end

    subgraph build["Build"]
        B[Transform entries]
        C[Load entries<br/>into new DB]
    end

    subgraph package["Package"]
        D[Package DB]
    end

    A --> B --> C --> D

    style pull stroke-dasharray: 5 5, fill:none
    style build stroke-dasharray: 5 5, fill:none
    style package stroke-dasharray: 5 5, fill:none

Multi-Schema Support Architecture

What makes grype-db unique compared to a typical ETL job is the extra responsibility of needing to transform the most recent vulnerability data shape (defined in the vunnel repo) to all supported DB schema versions.

From the perspective of the Daily DB Publisher workflow, (abridged) execution looks something like this:

%%{ init: { 'flowchart': { 'curve': 'linear' } } }%%
flowchart LR
    A[Pull vulnerability data]

    B5[Build v5 DB]
    C5[Package v5 DB]
    D5[Publish v5]

    B6[Build v6 DB]
    C6[Package v6 DB]
    D6[Publish v6]

    A --- B5 --> C5 --> D5
    A --- B6 --> C6 --> D6

Core Abstractions

In order to support multiple DB schemas easily from a code-organization perspective, the following abstractions exist:

  • Provider - Responsible for providing raw vulnerability data files that are cached locally for later processing.

  • Processor - Responsible for unmarshalling any entries given by the Provider, passing them into Transformers, and returning any resulting entries. Note: the object definition is schema-agnostic but instances are schema-specific since Transformers are dependency-injected into this object.

  • Transformer (v5, v6) - Takes raw data entries of a specific vunnel-defined schema and transforms the data into schema-specific entries to later be written to the database. Note: the object definition is schema-specific, encapsulating grypeDB/v# specific objects within schema-agnostic Entry objects.

  • Entry - Encapsulates schema-specific database records produced by Processors/Transformers (from the provider data) and accepted by Writers.

  • Writer (v5, v6) - Takes Entry objects and writes them to a backing store (today a SQLite database). Note: the object definition is schema-specific and typically references grypeDB/v# schema-specific writers.

Data Flow

All the above abstractions are defined in the pkg/data Go package and are used together commonly in the following flow:

%%{ init: { 'flowchart': { 'curve': 'linear' } } }%%
flowchart LR
    A["data.Provider"]

    subgraph processor["data.Processor"]
        direction LR
        B["unmarshaller"]
        C["v# data.Transformer"]
        B --> C
    end

    D["data.Writer"]
    E["grypeDB/v#/writer.Write"]

    A -->|"cache file"| processor
    processor -->|"[]data.Entry"| D --> E

    style processor fill:none

Where there is:

  • A data.Provider for each upstream data source (e.g. canonical, redhat, github, NIST, etc.)
  • A data.Processor for every vunnel-defined data shape (github, os, msrc, nvd, etc… defined in the vunnel repo)
  • A data.Transformer for every processor and DB schema version pairing
  • A data.Writer for every DB schema version

Code Organization

From a Go package organization perspective, the above abstractions are organized as follows:

grype-db/
└── pkg
    ├── data                      # common data structures and objects that define the ETL flow
    ├── process
    │    ├── processors           # common data.Processors to call common unmarshallers and pass entries into data.Transformers
    │    ├── v5                   # schema v5 (legacy, active)
    │    │    ├── processors.go   # wires up all common data.Processors to v5-specific data.Transformers
    │    │    ├── writer.go       # v5-specific store writer
    │    │    └── transformers    # v5-specific transformers
    │    └── v6                   # schema v6 (current, active)
    │         ├── processors.go   # wires up all common data.Processors to v6-specific data.Transformers
    │         ├── writer.go       # v6-specific store writer
    │         └── transformers    # v6-specific transformers
    └── provider                  # common code to pull, unmarshal, and cache upstream vuln data into local files
        └── ...

Note: Historical schema versions (v1-v4) have been removed from the codebase.

DB Structure and Definitions

The definitions of what goes into the database and how to access it (both reads and writes) live in the public grype repo under the grype/db package. Responsibilities of grype (not grype-db) include (but are not limited to):

  • What tables are in the database
  • What columns are in each table
  • How each record should be serialized for writing into the database
  • How records should be read/written from/to the database
  • Providing rich objects for dealing with schema-specific data structures
  • The name of the SQLite DB file within an archive
  • The definition of a listing file and listing file entries

The purpose of grype-db is to use the definitions from grype/db and the upstream vulnerability data to create DB archives and make them publicly available for consumption via Grype.

DB Distribution Files

Grype DB currently supports two active schema versions, each with a different distribution mechanism:

  • Schema v5 (legacy): Supports Grype v0.87.0+
  • Schema v6 (current): Supports Grype main branch

Historical schemas (v1-v4) are no longer supported and their code has been removed from the codebase.

Schema v5: listing.json

The listing.json file is a legacy distribution mechanism used for schema v5 (and historically v1-v4):

  • Location: databases/listing.json
  • Structure: Contains URLs to DB archives organized by schema version, ordered by latest-date-first
  • Format: { "available": { "1": [...], "2": [...], "5": [...] } }
  • Update Process: Re-generated daily by the grype-db publisher workflow through a separate listing update step

Schema v6+: latest.json

The latest.json file is the modern distribution mechanism used for schema v6 and future versions:

  • Location: databases/v{major}/latest.json (e.g., v6/latest.json, v7/latest.json)
  • Structure: Contains metadata and URL for the single latest DB archive for that major schema version
  • Format: { "url": "...", "built": "...", "checksum": "...", "schemaVersion": 6 }
  • Update Process: Generated and uploaded atomically with each DB build (no separate update step)

This dual-distribution approach allows Grype to maintain backward compatibility with v5 while providing a more efficient distribution mechanism for v6 and future versions.

Implementation Notes:

  • Distribution file definitions reside in the grype repo, while the grype-db repo is responsible for generating DBs and creating/updating these distribution files
  • As long as Grype has been configured to point to the correct distribution file URL, the DBs can be stored separately, replaced with a service returning the distribution file contents, or mirrored for systems behind an air gap

Daily Workflows

There are two workflows that drive getting a new Grype DB out to OSS users:

  1. The daily data sync workflow, which uses vunnel to pull upstream vulnerability data.
  2. The daily DB publisher workflow, which builds and publishes a Grype DB from the data obtained in the daily data sync workflow.

Daily Data Sync Workflow

This workflow takes the upstream vulnerability data (from canonical, redhat, debian, NVD, etc), processes it, and writes the results to OCI repos.

%%{ init: { 'flowchart': { 'curve': 'linear' } } }%%
flowchart LR
    A1["Pull alpine"] --> B1["Publish to ghcr.io/anchore/grype-db/data/alpine:&lt;date&gt;"]
    A2["Pull amazon"] --> B2["Publish to ghcr.io/anchore/grype-db/data/amazon:&lt;date&gt;"]
    A3["Pull debian"] --> B3["Publish to ghcr.io/anchore/grype-db/data/debian:&lt;date&gt;"]
    A4["Pull github"] --> B4["Publish to ghcr.io/anchore/grype-db/data/github:&lt;date&gt;"]
    A5["Pull nvd"] --> B5["Publish to ghcr.io/anchore/grype-db/data/nvd:&lt;date&gt;"]
    A6["..."] --> B6["... repeat for all upstream providers ..."]

    style A6 fill:none,stroke:none
    style B6 fill:none,stroke:none

Once all providers have been updated, a single vulnerability cache OCI repo is updated with all of the latest vulnerability data at ghcr.io/anchore/grype-db/data:<date>. This repo is what is used downstream by the DB publisher workflow to create Grype DBs.

The in-repo .grype-db.yaml and .vunnel.yaml configurations are used to define the upstream data sources, how to obtain them, and where to put the results locally.

Daily DB Publishing Workflow

This workflow takes the latest vulnerability data cache, builds a Grype DB, and publishes it for general consumption:

%%{ init: { 'flowchart': { 'curve': 'linear' } } }%%
flowchart LR
    subgraph pull["1. Pull"]
        A["Pull vuln data<br/>(from the daily<br/>sync workflow<br/>output)"]
    end

    subgraph generate["2. Generate Databases"]
        B5["Build v5 DB"]
        C5["Package v5 DB"]
        D5["Upload Archive"]

        B6["Build v6 DB"]
        C6["Package v6 DB<br/>(includes latest.json)"]
        D6["Upload Archive<br/>+ latest.json"]

        B5 --> C5 --> D5
        B6 --> C6 --> D6
    end

    subgraph listing["3. Update Listing (v5 only)"]
        F["Update listing.json"]
    end

    A --- B5
    A --- B6

    D5 --- F
    D6 -.->|"No listing update<br/>needed for v6"| G[Done]

    style pull stroke-dasharray: 5 5, fill:none
    style generate stroke-dasharray: 5 5, fill:none
    style listing stroke-dasharray: 5 5, fill:none
    style G fill:none,stroke:none

The manager/ directory contains all code responsible for driving the Daily DB Publisher workflow, generating DBs for all supported schema versions (currently v5 and v6) and making them available to the public.

1. Pull

Download the latest vulnerability data from various upstream data sources into a local directory. The destination for the provider data is in the data/vunnel directory.

2. Generate

Build databases for all supported schema versions based on the latest vulnerability data and upload them to Cloudflare R2 (S3-compatible storage).

Supported Schemas (see schema-info.json):

  • Schema v5 (legacy)
  • Schema v6 (current)

Build and Upload Process:

Each DB undergoes the following steps:

  1. Build: Transform vulnerability data into the schema-specific format
  2. Package: Create a compressed archive (.tar.zst)
  3. Validate: Smoke test with Grype by comparing against the previous release using vulnerability-match-labels
  4. Upload: Only DBs that pass validation are uploaded

Storage Location:

  • Distribution base URL: https://grype.anchore.io/databases/...
  • Schema-specific paths:
    • v5: databases/<archive-name>.tar.zst
    • v6: databases/v6/<archive-name>.tar.zst + databases/v6/latest.json

Key Difference:

  • v5: Only the DB archive is uploaded; discoverability happens in the next step
  • v6: Both the DB archive AND latest.json are uploaded atomically, making the DB immediately discoverable

3. Update Listing (v5 Only)

This step only applies to schema v5.

Generate and upload a new listing.json file to Cloudflare R2 based on the existing listing file and newly discovered DB archives.

The listing file is tested against installations of Grype to ensure scans can successfully discover and download the DB. The scan must have a non-zero count of matches to pass validation.

Once the listing file has been uploaded to databases/listing.json, user-facing Grype v5 installations can discover and download the new DB.

Note: Schema v6 does not require this step because the latest.json file is generated and uploaded atomically with the DB archive in step 2, with a 5-minute cache TTL for fast updates.

For more details on:

  • How Vunnel processes vulnerability data, see the Vunnel Architecture page
  • How quality gates validate database builds, see the Quality Gates section

5 - Vunnel

Architecture and design of the Vunnel vulnerability data processing tool

Overview

Vunnel is a CLI tool that downloads and processes vulnerability data from various sources (in the codebase, these are called “providers”).

flowchart LR
    subgraph input[ ]
        alpine_data(((<b>Alpine Sec DB</b><br/><small>secdb.alpinelinux.org</small>)))
        rhel_data(((<b>RedHat CSAF</b><br/><small>redhat.com/security</small>)))
        nvd_data(((<b>NVD Data</b><br/><small>services.nvd.nist</small>)))
        other_data((("...")))
    end

    subgraph vunnel["<b>Vunnel</b>"]
        alpine_provider[Alpine Provider]
        rhel_provider[RHEL Provider]
        nvd_provider[NVD Provider]
        other_provider[(...)]
    end

    subgraph output[ ]
        alpine_out[./data/alpine/]
        rhel_out[./data/rhel/]
        nvd_out[./data/nvd/]
        other_out[...]
    end

    alpine_data -->|download| alpine_provider
    rhel_data -->|download| rhel_provider
    nvd_data -->|download| nvd_provider

    alpine_provider -->|write| alpine_out
    rhel_provider -->|write| rhel_out
    nvd_provider -->|write| nvd_out


    vunnel:::Application

    style other_data fill:none,stroke:none
    style other_provider fill:none,stroke:none
    style other_out fill:none,stroke:none
    style output fill:none,stroke:none
    style input fill:none,stroke:none

    alpine_data:::ExternalSource@{ shape: cloud }
    rhel_data:::ExternalSource@{ shape: cloud }
    nvd_data:::ExternalSource@{ shape: cloud }

    alpine_provider:::Provider
    rhel_provider:::Provider
    nvd_provider:::Provider

    alpine_out:::Database@{ shape: db }
    rhel_out:::Database@{ shape: db }
    nvd_out:::Database@{ shape: db }

    classDef ExternalSource stroke-width:1px, stroke-dasharray:none, stroke:#424242, fill:#f0f8ff, color:#000000
    classDef Application fill:#e1ffe1,stroke:#424242,stroke-width:1px
    classDef Provider fill:#none,stroke:#424242,stroke-width:1px
    classDef Database stroke-width:1px, stroke-dasharray:none, stroke:#424242, fill:#fff9c4, color:#000000

Conceptually, one or more invocations of Vunnel will produce a single data directory which Grype DB uses to create a Grype database:

flowchart LR
    subgraph vunnel_runs[ ]
        vunnel_alpine[<b>vunnel run alpine</b>]
        vunnel_rhel[<b>vunnel run rhel</b>]
        vunnel_nvd[<b>vunnel run nvd</b>]
        vunnel_other[(...)]
    end

    subgraph data[ ]
        alpine_data[./data/alpine/]
        rhel_data[./data/rhel/]
        nvd_data[./data/nvd/]
        other_data[...]
    end

    db_processor[Grype-DB]

    subgraph db_out[ ]
        sqlite_db[vulnerability.db<br/><small>sqlite</small>]
    end

    vunnel_alpine -->|write| alpine_data
    vunnel_rhel -->|write| rhel_data
    vunnel_nvd -->|write| nvd_data

    alpine_data -->|read| db_processor
    rhel_data -->|read| db_processor
    nvd_data -->|read| db_processor

    db_processor -->|write| sqlite_db

    db_processor:::Application
    vunnel_alpine:::Application
    vunnel_rhel:::Application
    vunnel_nvd:::Application
    sqlite_db:::Database@{ shape: db }

    alpine_data:::Database@{ shape: db }
    rhel_data:::Database@{ shape: db }
    nvd_data:::Database@{ shape: db }

    style vunnel_other fill:none,stroke:none
    style other_data fill:none,stroke:none
    style vunnel_runs fill:none,stroke:none
    style data fill:none,stroke:none
    style db_out fill:none,stroke:none

    classDef Application fill:#e1ffe1,stroke:#424242,stroke-width:1px
    classDef Database stroke-width:1px, stroke-dasharray:none, stroke:#424242, fill:#fff9c4, color:#000000

Integration with Grype DB

The Vunnel CLI tool is optimized to run a single provider at a time, not orchestrating multiple providers at once. Grype DB is the tool that collates output from multiple providers and produces a single database, and is ultimately responsible for orchestrating multiple Vunnel calls to prepare the input data:

grype-db pull

flowchart LR
    config["<code><b># .grype-db.yaml</b></code><br><code>providers:</code><br><code>  - alpine</code><br><code>  - rhel</code><br><code>  - nvd</code><br><code>  - ...</code>"]
    pull[grype-db pull]

    subgraph vunnel_runs[ ]
        vunnel_alpine[<b>vunnel run alpine</b>]
        vunnel_rhel[<b>vunnel run rhel</b>]
        vunnel_nvd[<b>vunnel run nvd</b>]
        vunnel_other[<b>vunnel run ...</b>]
    end

    subgraph data[ ]
        data_out[(./data/)]
    end

    config -->|read| pull
    pull -->|execute| vunnel_alpine
    pull -->|execute| vunnel_rhel
    pull -->|execute| vunnel_nvd
    pull -.->|execute| vunnel_other

    vunnel_alpine -->|write| data_out
    vunnel_rhel -->|write| data_out
    vunnel_nvd -->|write| data_out
    vunnel_other -.->|write| data_out

    pull:::Application
    vunnel_alpine:::Application
    vunnel_rhel:::Application
    vunnel_nvd:::Application
    vunnel_other:::Application

    config:::AnalysisInput@{ shape: document }
    data_out:::Database@{ shape: db }

    style vunnel_runs fill:none,stroke:none
    style data fill:none,stroke:none

    classDef AnalysisInput stroke-width:1px, stroke-dasharray:none, stroke:#424242, fill:#f0f8ff, color:#000000
    classDef Application fill:#e1ffe1,stroke:#424242,stroke-width:1px
    classDef Database stroke-width:1px, stroke-dasharray:none, stroke:#424242, fill:#fff9c4, color:#000000

grype-db build

flowchart LR
    subgraph data[ ]
        data_in[(./data/)]
    end

    build[grype-db build]

    subgraph db_out[ ]
        db[(vulnerability.db<br/><small>sqlite</small>)]
    end

    data_in -->|read| build
    build -->|write| db

    build:::Application
    data_in:::Database@{ shape: db }
    db:::Database@{ shape: db }

    style data fill:none,stroke:none
    style db_out fill:none,stroke:none

    classDef Application fill:#e1ffe1,stroke:#424242,stroke-width:1px
    classDef Database stroke-width:1px, stroke-dasharray:none, stroke:#424242, fill:#fff9c4, color:#000000

grype-db package

flowchart LR
    subgraph db_in[ ]
        db[vulnerability.db<br/><small>sqlite</small>]
    end

    package[grype-db package]

    subgraph archive_out[ ]
        archive[[vulnerability-db-DATE.tar.gz]]
    end

    db -->|read| package
    package -->|write| archive

    package:::Application
    db:::Database@{ shape: db }
    archive:::Database@{ shape: document }

    style db_in fill:none,stroke:none
    style archive_out fill:none,stroke:none

    classDef Application fill:#e1ffe1,stroke:#424242,stroke-width:1px
    classDef Database stroke-width:1px, stroke-dasharray:none, stroke:#424242, fill:#fff9c4, color:#000000

For more information about how Grype DB uses Vunnel see the Grype DB Architecture page.

Provider Architecture

A “Provider” is the core abstraction for Vunnel and represents a single source of vulnerability data. Vunnel is a CLI wrapper around multiple vulnerability data providers.

Provider Requirements

All provider implementations should:

  • Live under src/vunnel/providers in their own directory (e.g. the NVD provider code is under src/vunnel/providers/nvd/...)
  • Have a class that implements the Provider interface
  • Be centrally registered with a unique name under src/vunnel/providers/__init__.py
  • Be independent from other vulnerability providers data — that is, the debian provider CANNOT reach into the NVD data provider directory to look up information (such as severity)
  • Follow the workspace conventions for downloaded provider inputs, produced results, and tracking of metadata

Workspace Conventions

Each provider has a “workspace” directory within the “vunnel root” directory (defaults to ./data) named after the provider.

data/                       # the "vunnel root" directory
└── alpine/                 # the provider workspace directory
    ├── input/              # any file that needs to be downloaded and referenced should be stored here
    ├── results/            # schema-compliant vulnerability results (1 record per file)
    ├── checksums           # listing of result file checksums (xxh64 algorithm)
    └── metadata.json       # metadata about the input and result files

The metadata.json and checksums are written out after all results are written to results/. An example metadata.json:

{
  "provider": "amazon",
  "urls": ["https://alas.aws.amazon.com/AL2022/alas.rss"],
  "listing": {
    "digest": "dd3bb0f6c21f3936",
    "path": "checksums",
    "algorithm": "xxh64"
  },
  "timestamp": "2023-01-01T21:20:57.504194+00:00",
  "schema": {
    "version": "1.0.0",
    "url": "https://raw.githubusercontent.com/anchore/vunnel/main/schema/provider-workspace-state/schema-1.0.0.json"
  }
}

Where:

  • provider: the name of the provider that generated the results
  • urls: the URLs that were referenced to generate the results
  • listing: the path to the checksums listing file that lists all of the results, the checksum of that file, and the algorithm used to checksum the file (and the same algorithm used for all contained checksums)
  • timestamp: the point in time when the results were generated or last updated
  • schema: the data shape that the current file conforms to

Result Format

All results from a provider are handled by a common base class helper (provider.Provider.results_writer()) and is driven by the application configuration (e.g. JSON flat files or SQLite database). The data shape of the results are self-describing via an envelope with a schema reference.

For example:

{
  "schema": "https://raw.githubusercontent.com/anchore/vunnel/main/schema/vulnerability/os/schema-1.0.0.json",
  "identifier": "3.3/cve-2015-8366",
  "item": {
    "Vulnerability": {
      "Severity": "Unknown",
      "NamespaceName": "alpine:3.3",
      "FixedIn": [
        {
          "VersionFormat": "apk",
          "NamespaceName": "alpine:3.3",
          "Name": "libraw",
          "Version": "0.17.1-r0"
        }
      ],
      "Link": "http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2015-8366",
      "Description": "",
      "Metadata": {},
      "Name": "CVE-2015-8366",
      "CVSS": []
    }
  }
}

Where:

  • The schema field is a URL to the schema that describes the data shape of the item field
  • The identifier field should have a unique identifier within the context of the provider results
  • The item field is the actual vulnerability data, and the shape of this field is defined by the schema

Note that the identifier is 3.3/cve-2015-8366 and not just cve-2015-8366 in order to uniquely identify cve-2015-8366 as applied to the alpine 3.3 distro version among other records in the results directory.

Currently only JSON payloads are supported.

Vulnerability Schemas

Possible vulnerability schemas supported within the vunnel repo are:

If at any point a breaking change needs to be made to a provider (and say the schema remains the same), then you can set the __version__ attribute on the provider class to a new integer value (incrementing from 1 onwards). This is a way to indicate that the cached input/results are not compatible with the output of the current version of the provider, in which case the next invocation of the provider will delete the previous input and results before running.

Provider Configuration

Each provider has a configuration object defined next to the provider class. This object is used in the vunnel application configuration and is passed as input to the provider class. Take the debian provider configuration for example:

from dataclasses import dataclass, field

from vunnel import provider, result

@dataclass
class Config:
    runtime: provider.RuntimeConfig = field(
        default_factory=lambda: provider.RuntimeConfig(
            result_store=result.StoreStrategy.SQLITE,
            existing_results=provider.ResultStatePolicy.DELETE_BEFORE_WRITE,
        ),
    )
    request_timeout: int = 125

Configuration Requirements

Every provider configuration must:

  • Be a dataclass
  • Have a runtime field that is a provider.RuntimeConfig field

The runtime field is used to configure common behaviors of the provider that are enforced within the vunnel.provider.Provider subclass.

Runtime Configuration Options

  • on_error: what to do when the provider fails

    • action: choose to fail, skip, or retry when the failure occurs
    • retry_count: the number of times to retry the provider before failing (only applicable when action is retry)
    • retry_delay: the number of seconds to wait between retries (only applicable when action is retry)
    • input: what to do about the input data directory on failure (such as keep or delete)
    • results: what to do about the results data directory on failure (such as keep or delete)
  • existing_results: what to do when the provider is run again and the results directory already exists

    • delete-before-write: delete the existing results just before writing the first processed (new) result
    • delete: delete existing results before running the provider
    • keep: keep the existing results
  • existing_input: what to do when the provider is run again and the input directory already exists

    • delete: delete the existing input before running the provider
    • keep: keep the existing input
  • result_store: where to store the results

    • sqlite: store results as key-value form in a SQLite database, where keys are the record identifiers values are the json vulnerability records
    • flat-file: store results in JSON files named after the record identifiers

Any provider-specific config options can be added to the configuration object as needed (such as request_timeout, which is a common field).

For more details on how Grype DB uses Vunnel output, see the Grype DB Architecture page.

Next Steps