Anchore’s open source security tooling consists of several interconnected tools that work together to detect vulnerabilities and ensure license compliance in software packages. This page explains how these tools interact and how data flows through the system.
The Anchore OSS ecosystem includes five main tools that, at the 30,000 ft view, work together as follows:
chronicle for automatically generating release notes based on github issues and PR titles/labels.
quill for signing and notarizing release binaries for Mac.
This document explains how all of the golang-base Anchore OSS tools are organized, covering the
package structure, common core architectural concepts, and where key functionality is implemented.
Use this as a reference when trying to familiarize yourself with the overall structure of Syft, Grype, or other applications.
CLI
The cmd package uses the Clio framework (built on top of the spf13/cobra and spf13/viper) to manage flag/argument parsing, configuration, and command execution.
All flags, arguments, and config arguments are represented in the application as a struct.
Each command tends to get it’s own struct with all options the command needs to function.
Common options or sets of options can be defined independently and reused across commands, being composed within each command struct that needs the option.
Select options that represent flags are registered with the AddFlags method defined on the command struct (or on each option struct used within the command struct).
If any additional processing is needed to be done to elements in command structs or option structs before being used in the application then you can define a PostLoad method on the struct to mutate the elements you need.
In terms of what is executed when: all processing is done within the selected cobra command’s PreRun hook, wrapping any potential user-provided hook.
This means that all of this fits nicely into the existing cobra command lifecycle.
See the sign command in Quill for a small example of all of this together.
The reason for this approach is to smooth over the rough edges between cobra and viper, which have multiple ways to configure and use functionality, and provide a single way to specify any input into the application.
Being prescriptive about these approaches has allowed us to take many shared concerns that used to be a lot of boilerplate when creating an application and put it into one framework –Clio.
Execution flow
The following diagrams show the execution of a typical Anchore application at different levels of detail, using the scan command in Syft as a representative example:
sequenceDiagram
actor user as User
participant syft as Syft Application
participant cmd as Command Handler (Cobra)
participant lib as Library
user->>syft: syft scan alpine:latest
syft->>cmd: Execute
cmd->>cmd: Initialize & Load Configuration
cmd->>lib: Execute Scan Logic
lib->>cmd: SBOM
cmd-->>user: Display/Write SBOM
sequenceDiagram
actor user as User
box rgba(0,0,0,.1) Syft Application
participant main as main.go
participant cliApp as cli.Application()
participant clio as Clio Framework
end
box rgba(0,0,0,.1) Command Handler
participant cobra as Command PreRunE
participant opts as Command Options
participant runE as Command RunE
end
participant lib as Library
user->>main: syft scan alpine:latest
Note over main,clio: Syft Application (initialization)
main->>cliApp: Create app with ID
cliApp->>clio: clio.New(config)
clio-->>cliApp: app instance
Note over cliApp,cobra: Build Command Tree
cliApp->>cliApp: commands.Scan(app)
cliApp->>clio: app.SetupCommand(&cobra.Command, opts)
Note over clio: Bind config sources to options struct
clio-->>cliApp: configured scanCmd
cliApp->>cliApp: commands.Root(app, scanCmd)
cliApp->>clio: app.SetupRootCommand(&cobra.Command, opts)
clio-->>cliApp: rootCmd with scanCmd attached
main->>clio: app.Run()
clio->>cobra: rootCmd.Execute()
Note over cobra,runE: Command Handler (execution)
cobra->>cobra: Parse args → "scan alpine:latest"
cobra->>opts: Load config (files/env/flags)
cobra->>opts: opts.PostLoad() validation
cobra->>runE: RunE(cmd, args)
runE->>lib: Execute Scan Logic
lib-->>runE: SBOM
Note over runE: Result Output
runE-->>user: SBOM output
Package structure
Many of the Anchore OSS tools have the following setup (or very similar):
/cmd/NAME/ - CLI application layer.
This is the entry point for the command-line tool and wires up much of the functionality from the public API.
./cmd/NAME/
│ ├── cli/
│ │ ├── cli.go // where all commands are wired up
│ │ ├── commands/ // all command implementations
│ │ ├── options/ // all command flags and configuration options
│ │ └── ui/ // all handlers for events that are shown on the UI
│ └── main.go // entrypoint for the application
...
/NAME/ - Public library API.
This is how API users interact with the underlying capabilities without coupling to the application configuration, specific presentation on the terminal, or high-level workflows.
The internalization philosophy
Applications extensively use internal/ packages at multiple levels to minimize the public API surface area.
The codebase follows the guiding principle “internalize anything you can” - expose only what library consumers truly need.
Take for example the various internal packages within Syft
This multi-level approach allows Syft to expose a minimal, stable public API while keeping implementation details flexible and changeable.
Go’s module system prevents importing internal/ packages from outside their parent directory, which enforces clean separation of concerns.
Core facilities
The bus system
The bus system, under /internal/bus/ within the target application, is an event publishing mechanism that enables progress reporting and UI updates
without coupling the library to any specific user interface implementation.
The bus follows a strict one-way communication pattern: the library publishes events but never subscribes to them.
The intention is that functionality is NOT fulfilled by listening to events on the bus and taking action.
Only the application layer (CLI) subscribes to events for display.
This keeps the library completely decoupled from UI concerns.
You can think of the bus as a structured extension of the logger, allowing for publishing not just strings or maps of strings,
but enabling publishing objects that can yield additional telemetry on-demand, fueling richer interactions.
This enables library consumers to implement any UI they want (terminal UI, web UI, no UI) by subscribing to events and handling them appropriately.
The library has zero knowledge of how events are used, maintaining a clean separation between business logic and presentation.
The bus is implemented as a singleton with a global publisher that can be set by library consumers:
var publisher partybus.Publisher
funcSet(p partybus.Publisher) {
publisher = p
}
funcPublish(e partybus.Event) {
if publisher !=nil {
publisher.Publish(e)
}
}
The library calls bus.Publish() throughout cataloging operations. If no publisher is set, events are silently discarded.
This makes events truly optional.
Event streams
Picking the right “level” for events is key. Libraries tend to not assume that events can be read “quickly” off the bus.
At the same time, to remain lively and useful, we want to be able to have consumers of the bus to get information
at a rate they choose. A common pattern used is to publish a “start” event (for example, “cataloging started”) and
publish with that event a read-only, thread-safe object that can be polled by the caller to get progress or status-based
information out.
sequenceDiagram
participant CMD as cmd/<br/>(CLI Layer)
participant Bus as internal/bus/<br/>(Event Bus)
participant Lib as lib/<br/>(Library Layer)
participant Progress as Progress Object
CMD->>Bus: Subscribe()
CMD->>+Lib: PerformOperation()
Lib->>Progress: Create progress object
Lib->>Bus: Publish(StartEvent, progress)
Bus->>CMD: StartEvent
loop Poll until complete
CMD->>Progress: Size(), Current(), Stage(), Error()
Progress-->>CMD: status (Error: nil)
end
Lib-->>-CMD: Return result
CMD->>Progress: Error()
Progress-->>CMD: ErrCompleted
This prevents against the library accidentally being a “firehose” and overwhelming subscribers who are trying to convey
timely information. When subscribers cannot keep up with the amount of events emitted from the library then the very
information being displayed tends to get stale and useless anyway. At the same time, the there is a lot of value in
responding to events instead of polling for all information.
This pattern helps to balance the best of both worlds, getting an event driven system with a consumer-driven update
cadence.
The logging system
The logging system, under /internal/log/ within the target application, provides structured logging throughout Anchore’s applications with an injectable logger interface.
This allows library consumers to integrate the application’s logging into their own logging infrastructure.
There is an adapter for logrus to this interface implemented, and we’re happy to take additional contributions for other concrete logger adapters.
The logging system is implemented as a singleton with global functions (log.Info, log.Debug, etc.).
Library consumers inject their logger by calling the public API function syft.SetLogger(yourLoggerHere).
By default, Syft uses a discard logger (no-op) that silently ignores all log messages.
This ensures the library produces no output unless a logger is explicitly provided.
All loggers are automatically wrapped with a redaction layer when you call SetLogger().
The wrapping is applied internally by the logging system, which removes sensitive information (like authentication tokens) from log output.
This happens transparently within the application CLI, however, API users will need to explicitly register secrets to be redacted.
Releasing
Each application uses goreleaser to build and publish releases, as orchestrated by a release workflow.
The release workflow can be triggered with make release from a local checkout of the repository.
Chronicle is used to automatically generate release notes based on GitHub issues and PR titles/labels,
using the same information to determine the next version for the release.
With each repo, we tend to publish (but some details may vary slightly between repos):
a tag with the version (e.g., v0.50.0)
binaries for Linux, Mac, and Windows, uploaded as GitHub release assets (note, we sign and notarize Mac binaries with Quill)
Docker images, pushed to Docker Hub and ghcr.io registries
Update homebrew taps
We ensure the same tool versions are used locally and in CI by using Binny, orchestrated with make and task.
2 - Syft
Architecture and design of the Syft SBOM tool
Note
See the Golang CLI Patterns for common structures and frameworks used in Syft and across other Anchore open source projects.
At a high level, this is the package structure of Syft:
./cmd/syft/ // main entrypoint
│ └── ...
└── syft/ // the "core" syft library
├── format/ // contains code to encode or decode to and from SBOM formats
├── pkg/ // contains code to catalog packages from a source
├── sbom/ // contains the definition of an SBOM
└── source/ // contains code to create a source object for some input type (e.g. container image, directory, etc)
Syft’s core library is implemented in the syft package and subpackages.
The major packages work together in a pipeline:
The syft/source package produces a source.Source object that can be used to catalog a directory, container, and other source types.
The syft package knows how to take a source.Source object and catalog it to produce an sbom.SBOM object.
The syft/format package contains the ability to encode an sbom.SBOM object to and from different SBOM formats (such as SPDX and CycloneDX).
This design creates a clear flow: source → catalog → format:
sequenceDiagram
actor User
participant CLI
participant Resolve as Source Resolution
participant Catalog as SBOM Creation
participant Format as Format Output
User->>CLI: syft scan <target>
CLI->>CLI: Parse configuration
CLI->>Resolve: Resolve input (image/dir/file)
Note over Resolve: Tries: File→Directory→OCI→Docker→Podman→Containerd→Registry
Resolve-->>CLI: source.Source
CLI->>Catalog: Create SBOM from source
Note over Catalog: Task-based cataloging engine
Catalog-->>CLI: sbom.SBOM struct
CLI->>Format: Write to format(s)
Note over Format: Parallel: SPDX, CycloneDX, Syft JSON, etc.
Format-->>User: SBOM file(s)
Shows the task-based architecture and execution phases.
Tasks are selected by tags (image/directory/installed) and organized into serial phases, with parallel execution within each phase.
sequenceDiagram
participant CLI as scan.go
participant GetSource as Source Providers
participant CreateSBOM as syft.CreateSBOM
participant Config as CreateSBOMConfig
participant Executor as Task Executor
participant Builder as sbomsync.Builder
participant Resolver as file.Resolver
Note over CLI,GetSource: Source Resolution
CLI->>GetSource: GetSource(userInput, cfg)
GetSource->>GetSource: Try providers until success
GetSource-->>CLI: source.Source + file.Resolver
Note over CLI,Builder: SBOM Creation (task-based architecture)
CLI->>CreateSBOM: CreateSBOM(ctx, source, cfg)
CreateSBOM->>Config: makeTaskGroups(srcMetadata)
Note over Config: Task Selection & Organization
Config->>Config: Select catalogers by tags<br/>(image/directory/installed)
Config->>Config: Organize into execution phases
Config-->>CreateSBOM: [][]Task (grouped by phase)
CreateSBOM->>Builder: Initialize thread-safe builder
Note over CreateSBOM,Executor: Phase 1: Environment Detection
CreateSBOM->>Executor: Execute environment tasks
Executor->>Resolver: Read OS release files
Executor->>Builder: SetLinuxDistribution()
Note over CreateSBOM,Executor: Phase 2: Package + File Cataloging
CreateSBOM->>Executor: Execute package & file tasks
par Parallel Task Execution
Executor->>Resolver: Read package manifests
Executor->>Builder: AddPackages()
and
Executor->>Resolver: Read file metadata
Executor->>Builder: Add file artifacts
end
Note over CreateSBOM,Executor: Phase 3: Post-Processing
CreateSBOM->>Executor: Execute relationship tasks
Executor->>Builder: AddRelationships()
CreateSBOM->>Executor: Execute cleanup tasks
CreateSBOM-->>CLI: *sbom.SBOM
Note over CLI: Format Output
CLI->>CLI: Write multi-format output
The Package object
The pkg.Package object is a core data structure that represents a software package.
Key fields include:
FoundBy: the name of the cataloger that discovered this package (e.g. python-pip-cataloger).
Locations: the set of paths and layer IDs that were parsed to discover this package.
Language: the language of the package (e.g. python).
Type: a high-level categorization of the ecosystem the package resides in. For instance, even if the package is an egg, wheel, or requirements.txt reference, it is still logically a “python” package. Not all package types align with a language (e.g. rpm) but it is common.
Metadata: specialized data for specific location(s) parsed. This should contain as much raw information as seems useful, kept as flat as possible using the raw names and values from the underlying source material.
Additional package Metadata
Packages can have specialized metadata that is specific to the package type and source of information.
This metadata is stored in the Metadata field of the pkg.Package struct as an any type, allowing for flexibility in the data stored.
When pkg.Package is serialized, an additional MetadataType field is shown to help consumers understand the datashape of the Metadata field.
By convention the MetadataType value follows these rules:
Only use lowercase letters, numbers, and hyphens. Use hyphens to separate words.
Anchor the name in the ecosystem, language, or packaging tooling. For language ecosystems, prefix with the language/framework/runtime. For instance dart-pubspec-lock is better than pubspec-lock. For OS package managers this is not necessary (e.g. apk-db-entry is good, but alpine-apk-db-entry is redundant).
Be as specific as possible to what the data represents. For instance ruby-gem is NOT a good MetadataType value, but ruby-gemspec is, since Ruby gem information can come from a gemspec file or a Gemfile.lock, which are very different.
Describe WHAT the data is, NOT HOW it’s used. For instance r-description-installed-file is not good since it’s trying to convey how we use the DESCRIPTION file. Instead simply describe what the DESCRIPTION file is: r-description.
Use the lock suffix to distinguish between manifest files that loosely describe package version requirements vs files that strongly specify one and only one version of a package (“lock” files). These should only be used with respect to package managers that have the guide and lock distinction, but would not be appropriate otherwise (e.g. rpm does not have a guide vs lock, so lock should NOT be used to describe a db entry).
Use the archive suffix to indicate a package archive (e.g. rpm file, apk file) that describes the contents of the package. For example an RPM file would have a rpm-archive metadata type (not to be confused with an RPM DB record entry which would be rpm-db-entry).
Use the entry suffix to indicate information about a package found as a single entry within a file that has multiple package entries. If found within a DB or flat-file store for an OS package manager, use db-entry.
Should NOT contain the phrase package, though exceptions are allowed if the canonical name literally has the phrase package in it.
Should NOT have a file suffix unless the canonical name has the term “file”, such as a pipfile or gemfile.
Should NOT contain the exact filename+extensions. For instance pipfile.lock shouldn’t be in the name; instead describe what the file is: python-pipfile-lock.
Should NOT contain the phrase metadata, unless the canonical name has this term.
Should represent a single use case. For example, trying to describe Hackage metadata with a single HackageMetadata struct is not allowed since it represents 3 mutually exclusive use cases: stack.yaml, stack.lock, or cabal.project. Each should have its own struct and MetadataType.
The goal is to provide a consistent naming scheme that is easy to understand. If the rules don’t apply in your situation, use your best judgement.
When the underlying parsed data represents multiple files, there are two approaches:
Use the primary file to represent all the data. For instance, though the dpkg-cataloger looks at multiple files, it’s the status file that gets represented.
Nest each individual file’s data under the Metadata field. For instance, the java-archive-cataloger may find information from pom.xml, pom.properties, and MANIFEST.MF. The metadata is java-metadata with each possibility as a nested optional field.
Package Catalogers
Catalogers are the mechanism by which Syft identifies and constructs packages given a targeted list of files.
For example, a cataloger can ask Syft for all package-lock.json files in order to parse and raise up JavaScript packages (see file globs and file parser functions for examples).
There is a generic cataloger implementation that can be leveraged to
quickly create new catalogers by specifying file globs and parser functions (browse the source code for syft catalogers for example usage).
Design principles
From a high level, catalogers have the following properties:
They are independent of one another. The Java cataloger has no idea of the processes, assumptions,
or results of the Python cataloger, for example.
They do not know what source is being analyzed. Are we analyzing a local directory? An image?
If so, the squashed representation or all layers? The catalogers do not know the answers to these questions.
Only that there is an interface to query for file paths and contents from an underlying “source” being scanned.
Packages created by the cataloger should not be mutated after they are created. There is one exception made
for adding CPEs to a package after the cataloging phase, but that will most likely be moved back into the cataloger in the future.
Naming conventions
Cataloger names should be unique and named with these rules in mind:
Must end with -cataloger
Use lowercase letters, numbers, and hyphens only
Use hyphens to separate words
Catalogers for language ecosystems should start with the language name (e.g. python-)
Distinguish between when the cataloger is searching for evidence of installed packages vs declared packages. For example, there are two different gemspec-based catalogers: ruby-gemspec-cataloger and ruby-installed-gemspec-cataloger, where the latter requires that the gemspec is found within a specifications directory (meaning it was installed, not just at the root of a source repo).
File search and selection
All catalogers are provided an instance of the file.Resolver to interface with the image and search for files.
The implementations for these abstractions leverage stereoscope to perform searching.
Here is a rough outline how that works:
A stereoscope file.Index is searched based on the input given (a path, glob, or MIME type). The index is relatively
fast to search, but requires results to be filtered down to the files that exist in the specific layer(s) of interest.
This is done automatically by the filetree.Searcher abstraction. This abstraction will fallback to searching
directly against the raw filetree.FileTree if the index does not contain the file(s) of interest.
Note: the filetree.Searcher is used by the file.Resolver abstraction.
Once the set of files are returned from the filetree.Searcher the results are filtered down further to return
the most unique file results. For example, you may have requested files by a glob that returns multiple results.
These results are filtered down to deduplicate by real files, so if a result contains two references to the same file
(one accessed via symlink and one accessed via the real path), then the real path reference is returned and the symlink
reference is filtered out. If both were accessed by symlink then the first (by lexical order) is returned.
This is done automatically by the file.Resolver abstraction.
By the time results reach the pkg.Cataloger you are guaranteed to have a set of unique files that exist in the
layer(s) of interest (relative to what the resolver supports).
CLI and core API
The CLI (in the cmd/syft/ package) and the core library API (in the syft/ package) are separate layers with a clear boundary.
Application level concerns always reside with the CLI, while the core library focuses on SBOM generation logic.
That means that there is an application configuration (e.g. cmd/syft/cli) and a separate library configuration, and when the CLI uses
the library API, it must adapt its configuration to the library’s configuration types. In that adapter, the CLI layer
defers to API-level defaults as much as possible so there is a single source of truth for default behavior.
Architecture and design of the Grype vulnerability scanner
Note
See the Golang CLI Patterns for common structures and frameworks used in Grype and across other Anchore open source projects.
Code organization
At a high level, this is the package structure of Grype:
./cmd/grype/// main entrypoint│ └── ...└── grype/// the "core" grype library ├── db/// vulnerability database management, schemas, readers, and writers │ ├── v5/// V5 database schema │ └── v6/// v6 database schema ├── match/// core types for matches and result processing ├── matcher/// vulnerability matching strategies │ ├── stock/// default matcher (ecosystem + CPE) │ └── <ecosystem>/// ecosystem-specific matchers (java, dpkg, rpm, etc.) ├── pkg/// types for package representation (wraps Syft packages) ├── search/// search criteria and strategies ├── version/// version comparison across formats ├── vulnerability/// core types for vulnerabilities and provider interface └── presenter/// output formatters (JSON, table, etc.)
The grype package and subpackages implement Grype’s core library. The major packages work together in a pipeline:
The grype/pkg package wraps Syft packages and prepares them as match candidates, augmenting them with upstream package information and CPEs.
The grype/matcher package contains matching strategies that search for vulnerabilities matching specific package types.
The grype/db package manages the vulnerability database and provides query interfaces for matchers.
The grype/vulnerability package defines vulnerability data structures and the Provider interface for database queries.
The grype/search package implements search strategies (ecosystem, distro, CPE) and criteria composition.
The grype/presenter package formats match results into various output formats.
This design creates a clear flow: SBOM → package preparation → matching → results:
sequenceDiagram
actor User
participant CLI
participant DB as Database
participant Prep as Package Prep
participant Match as Matching Engine
participant Post as Post-Processing
participant Format as Presenter
User->>CLI: grype <target>
CLI->>CLI: Parse configuration
Note over CLI: Input Phase
alt SBOM provided
CLI->>CLI: Load SBOM from file
else Scan target
CLI->>CLI: Generate SBOM with Syft
end
Note over CLI,Prep: Preparation Phase
CLI->>DB: Load vulnerability database
DB-->>CLI: Database provider
CLI->>Prep: Prepare packages for matching
Note over Prep: Wrap Syft packages<br/>Add upstream packages<br/>Generate CPEs<br/>Filter overlaps
Prep-->>CLI: Match candidates
Note over CLI,Match: Matching Phase
CLI->>Match: FindMatches(match candidates, provider)
Note over Match: Group by package type<br/>Select matchers<br/>Execute in parallel
Match-->>CLI: Raw matches + ignore filters
Note over CLI,Post: Post-Processing Phase
CLI->>Post: Process matches
Note over Post: Apply ignore filters<br/>Apply user ignore rules<br/>Apply VEX statements<br/>Deduplicate results
Post-->>CLI: Final matches
Note over CLI,Format: Output Phase
CLI->>Format: Format results
Format-->>User: Vulnerability report
This diagram zooms into the Matching Phase from the high-level diagram, showing how the matching engine executes parallel matcher searches against the database. Components are grouped in boxes to show how they map to the high-level participants.
sequenceDiagram
participant CLI as grype/main
box rgba(200, 220, 240, 0.3) Matching Engine
participant Matcher as VulnerabilityMatcher
participant M as Matcher<br/>(Stock, Java, Dpkg, etc.)
end
participant Search as Search Strategies
box rgba(220, 240, 200, 0.3) Database
participant Provider as DB Provider
participant DB as SQLite
end
Note over CLI,DB: Matching Phase (expanded from high-level view)
CLI->>Matcher: FindMatches(match candidates, provider)
Matcher->>Matcher: Group candidates by package type
Note over Matcher,M: Each matcher runs in parallel with ecosystem-specific logic
loop For each package type (stock, java, dpkg, etc.)
Matcher->>M: Match(packages for this type)
M->>Search: Build search criteria<br/>(ecosystem, distro, or CPE-based)
Search->>Provider: SearchForVulnerabilities(criteria)
Provider->>DB: Query vulnerability_handles
DB-->>Provider: Matching handles
Provider->>Provider: Compare versions against constraints
Provider->>DB: Check unaffected_package_handles
DB-->>Provider: Unaffected records
Provider->>DB: Load blobs for confirmed matches
DB-->>Provider: Vulnerability details
Provider-->>Search: Confirmed matches
Search-->>M: Filtered matches
M-->>Matcher: Matches + ignore filters
end
Matcher->>Matcher: Collect matches from all matchers
Matcher-->>CLI: Raw matches + ignore filters
Note over CLI: Continues to Post-Processing Phase (see high-level view)
Relationship to Syft
Grype uses Syft’s SBOM generation capabilities rather than reimplementing package cataloging. The integration happens at two levels:
External SBOMs: You can provide an SBOM file generated by Syft (or any SPDX/CycloneDX SBOM), and Grype consumes it directly.
Inline scanning: When you provide a scan target (like a container image or directory), Grype invokes Syft internally to generate an SBOM, then immediately matches it against vulnerabilities.
The grype/pkg package wraps syft/pkg.Package objects and augments them with matching-specific data:
Upstream packages: For packages built from source (like Debian or RPM packages), Grype adds references to the source package so it can search both the binary package name and source package name.
CPE generation: Grype generates Common Platform Enumeration (CPE) identifiers for packages based on their metadata, enabling CPE-based matching as a fallback strategy.
Distro context: Grype preserves the Linux distribution information from Syft to enable distro-specific vulnerability matching.
This wrapping pattern maintains a clear architectural boundary. Syft focuses on finding packages, while Grype focuses on finding vulnerabilities in those packages.
Package representation
The grype/pkg package converts Syft packages into Grype match candidates. The pkg.FromCollection() function performs this conversion:
Wraps each Syft package in a grype.Package that preserves the original package data.
Adds upstream packages for packages that have source package relationships (e.g., a Debian binary package has a source package).
Generates CPEs based on package metadata (name, version, vendor, product).
Filters overlapping packages for comprehensive distros (like Debian or RPM-based distros) where you might have both installed packages and package files, preferring the installed packages.
The grype.Package type maintains a reference to the original syft.Package while augmenting it with:
Upstreams []UpstreamPackage: Source packages to search in addition to the binary package.
CPEs []syftPkg.CPE: Generated CPE identifiers for fallback matching.
This design preserves the complete SBOM information while preparing packages for the matching process. Matchers receive these enhanced packages and decide which attributes to use for searching.
Data flow
The data flow through Grype follows these steps:
SBOM ingestion: Load an SBOM from a file or generate one by scanning a target.
Package conversion: Convert Syft packages into grype.Package match candidates, adding upstream packages, CPEs, and filtering overlapping packages.
Matcher selection: Group packages by type (e.g., Java, dpkg, npm) and select appropriate matchers.
Parallel matching: Execute matchers in parallel, each querying the database with search criteria specific to their package types.
Result aggregation: Collect matches from all matchers and apply deduplication using ignore filters.
Output formatting: Format the final matches using the selected presenter (JSON, table, SARIF, etc.).
The database sits at the center of this flow. All matchers query the same database provider, but they use different search strategies based on their package types.
Vulnerability database
Grype uses a SQLite database to store vulnerability data. The database design prioritizes query performance and storage efficiency.
In order to interoperate any DB schema with the high-level Grype engine, each schema must implement a Provider interface.
This allows for DB specific schemas to be adapted to the core Grype types.
v6 Schema design
The overall design of the v6 database schema is heavily influenced by the OSV schema,
so if you are familiar with OSV, many of the entities / concepts will feel similar.
The database uses a blob + handle pattern:
Handles: Small, indexed records containing anything you might want to search by (package name, vulnerability id, provider name, etc.).
Grype stores these in tables optimized for fast lookups. These tables point to blobs for full details.
See the Grype DB SQL schemas for more details on handle table structures.
Blobs: Full JSON documents containing complete vulnerability details.
Grype stores these separately and loads them only when a match is made.
See the Grype DB blob schemas for more details on blob structures.
This separation allows Grype to quickly query millions of vulnerability records without loading full vulnerability details until necessary.
Key tables include:
vulnerability_handles: Searchable for vulnerability records by name (CVE/advisory ID), status (active, withdrawn, etc), published/modified/withdrawn dates, and provider ID.
References a blob containing full vulnerability details (description, references, aliases, severities).
affected_package_handles: Links vulnerabilities, packages, and (optionally) operating systems.
The referenced blob contains version constraints (for example, “vulnerable in 1.0.0 to 1.2.5”) and fix information.
Used when the package ecosystem is known (npm, python, gem, etc.).
unaffected_package_handles: Explicitly marks package versions that are NOT vulnerable.
Same structure as affected_package_handles but represents exemptions.
These are applied on top of any discovered affected records to remove matches (thus reduce false positives).
affected_cpe_handles: Links vulnerabilities and explicit CPEs, useful when a CPE cannot be resolved to a clear package ecosystem.
packages: Stores unique ecosystem + name combinations (for example, ecosystem=‘npm’, name=‘lodash’).
operating_systems: Stores OS release information with name, major/minor version, codename, and channel (for example, RHEL EUS versus mainline).
Provides context for distro-specific package matching.
cpes: Stores parsed CPE 2.3 components (part, vendor, product, edition, etc.).
Version constraints are stored in blobs, not in this table.
blobs: Complete vulnerability, package, and decorator details as compressed JSON.
There are 3 blob types:
known_exploited_vulnerability_handles: Links CVE identifiers to blob containing CISA KEV catalog data (date added, vendor, product, required action, ransomware campaign use).
epss_handles: Stores EPSS (Exploit Prediction Scoring System) data with CVE identifier, EPSS score (0-1 probability), and percentile ranking.
cwe_handles: Maps CVE identifiers to CWE (Common Weakness Enumeration) IDs with source and type information.
The schema also includes a package_cpes junction table creating many-to-many relationships between packages and CPEs.
When a CPE can be resolved to a package (via this table), vulnerabilities use affected_package_handles.
When a CPE cannot be resolved, vulnerabilities use affected_cpe_handles instead.
Grype versions the database schema (currently v6). When the schema changes, users download a new database file that Grype automatically detects and uses.
Data organization
Relationships between tables enable efficient querying:
The database provider queries the appropriate handle tables with these criteria.
The grype/version package filters handles by version constraints.
The provider loads the corresponding vulnerability blob for confirmed matches.
The complete vulnerability record returns to the matcher.
Version constraints in the database use multi-version constraint syntax, allowing a single record to express complex version ranges like “affected in 1.0.0 to 1.2.5 and 2.0.0 to 2.1.3”.
Matching engine
The matching engine orchestrates vulnerability matching across different package types. The core component is the VulnerabilityMatcher, which:
Groups packages by type: Java packages go to the Java matcher, dpkg packages to the dpkg matcher, etc.
Selects matchers: Each matcher declares which package types it handles.
Executes matching: Matchers run in parallel, querying the database with their specific search strategies.
Collects results: Matches from all matchers are aggregated.
Applies ignore filters: Matchers can mark certain matches to be ignored by other matchers, preventing duplicate reporting.
The ignore filter mechanism is important. For example, the dpkg matcher searches both the binary package name and the source package name. When it finds a match via the source package, it creates an ignore filter so the stock matcher doesn’t report the same vulnerability using a CPE match. This prevents duplicate matches for the same vulnerability.
Matchers
Each matcher implements the Matcher interface.
This allows Grype to support multiple matching strategies for different package ecosystems.
The process of making a match involves several steps:
Candidate creation: Matchers create match candidates when database records meet search criteria.
Version comparison: The grype/version package compares the package version against the vulnerability’s version constraints.
Unaffected check: If the database has an explicit “not affected” record for this version, the match is rejected.
Match creation: Confirmed matches become Match objects with confidence scores (the scores are currently unused).
Ignore filter check: Matches are checked against ignore filters from other matchers.
User ignore rules: Matches are checked against user-configured ignore rules.
Search strategies
Matchers determine what to search for based on package type and available metadata. Grype supports three main search strategies:
Ecosystem search: Queries vulnerabilities by package name and version within a specific package ecosystem (npm, pypi, gem, etc.). Search fields include ecosystem, package name, and version. The database returns handles where the package name matches and version constraints include the specified version.
Distro search: Queries vulnerabilities by Linux distribution, package name, and version for OS packages managed by apt, yum, or apk. Search fields include distro name and version (for example, debian:10), package name, and version. Also understands distro channels like RHEL EUS versus mainline.
CPE matching: Fallback strategy when ecosystem or distro matching isn’t applicable, using CPE identifiers in the format cpe:2.3:a:vendor:product:version:.... Search fields include CPE components (part, vendor, product). Broader and less precise than ecosystem matching, used primarily when ecosystem data isn’t available.
Search criteria system
The grype/search package provides a criteria system that matchers use to express search requirements.
Criteria can be combined with AND and OR operators:
The database provider translates these criteria into SQL queries against the handle tables.
This abstraction allows matchers to express complex search requirements without writing SQL directly.
Ideally, matchers orchestrate search criteria at a high level, letting each specific criteria type handle its own needs.
It’s the vulnerability provider that ultimately translates criteria into efficient database queries.
Version comparison
Grype supports multiple version formats because different ecosystems have different versioning schemes.
The grype/version package provides format-specific version comparers,
falling back to a “catch all” fuzzy comparer when the format cannot be determined.
Each format has its own constraint parser that understands ecosystem-specific constraint syntax.
The version comparison system detects the appropriate format based on the package type,
then uses the correct comparer to evaluate version constraints from the database.
The records from the Grype DB specify which version format to use on one side of the comparison, and the package type determines the format on the other side.
If no specific format is found, or the formats are incompatible (essentially do not match), the fuzzy comparer is used as a last resort.
Related architecture
Golang CLI Patterns - Common structures and frameworks used across Anchore OSS projects
Syft Architecture - SBOM generation architecture that Grype builds upon
4 - Grype DB
Architecture and design of the Grype vulnerability database build system
Overview
grype-db is essentially an application that extracts information from upstream vulnerability data providers, transforms it into smaller records targeted for Grype consumption, and loads the individual records into a new SQLite DB.
flowchart LR
subgraph pull["Pull"]
A[Pull vuln data<br/>from upstream]
end
subgraph build["Build"]
B[Transform entries]
C[Load entries<br/>into new DB]
end
subgraph package["Package"]
D[Package DB]
end
A --> B --> C --> D
style pull stroke-dasharray: 5 5, fill:none
style build stroke-dasharray: 5 5, fill:none
style package stroke-dasharray: 5 5, fill:none
Multi-Schema Support Architecture
What makes grype-db unique compared to a typical ETL job is the extra responsibility of needing to transform the most recent vulnerability data shape (defined in the vunnel repo) to all supported DB schema versions.
From the perspective of the Daily DB Publisher workflow, (abridged) execution looks something like this:
In order to support multiple DB schemas easily from a code-organization perspective, the following abstractions exist:
Provider - Responsible for providing raw vulnerability data files that are cached locally for later processing.
Processor - Responsible for unmarshalling any entries given by the Provider, passing them into Transformers, and returning any resulting entries. Note: the object definition is schema-agnostic but instances are schema-specific since Transformers are dependency-injected into this object.
Transformer (v5, v6) - Takes raw data entries of a specific vunnel-defined schema and transforms the data into schema-specific entries to later be written to the database. Note: the object definition is schema-specific, encapsulating grypeDB/v# specific objects within schema-agnostic Entry objects.
Entry - Encapsulates schema-specific database records produced by Processors/Transformers (from the provider data) and accepted by Writers.
Writer (v5, v6) - Takes Entry objects and writes them to a backing store (today a SQLite database). Note: the object definition is schema-specific and typically references grypeDB/v# schema-specific writers.
Data Flow
All the above abstractions are defined in the pkg/data Go package and are used together commonly in the following flow:
%%{ init: { 'flowchart': { 'curve': 'linear' } } }%%
flowchart LR
A["data.Provider"]
subgraph processor["data.Processor"]
direction LR
B["unmarshaller"]
C["v# data.Transformer"]
B --> C
end
D["data.Writer"]
E["grypeDB/v#/writer.Write"]
A -->|"cache file"| processor
processor -->|"[]data.Entry"| D --> E
style processor fill:none
Where there is:
A data.Provider for each upstream data source (e.g. canonical, redhat, github, NIST, etc.)
A data.Processor for every vunnel-defined data shape (github, os, msrc, nvd, etc… defined in the vunnel repo)
A data.Transformer for every processor and DB schema version pairing
A data.Writer for every DB schema version
Code Organization
From a Go package organization perspective, the above abstractions are organized as follows:
grype-db/
└── pkg
├── data # common data structures and objects that define the ETL flow
├── process
│ ├── processors # common data.Processors to call common unmarshallers and pass entries into data.Transformers
│ ├── v5 # schema v5 (legacy, active)
│ │ ├── processors.go # wires up all common data.Processors to v5-specific data.Transformers
│ │ ├── writer.go # v5-specific store writer
│ │ └── transformers # v5-specific transformers
│ └── v6 # schema v6 (current, active)
│ ├── processors.go # wires up all common data.Processors to v6-specific data.Transformers
│ ├── writer.go # v6-specific store writer
│ └── transformers # v6-specific transformers
└── provider # common code to pull, unmarshal, and cache upstream vuln data into local files
└── ...
Note: Historical schema versions (v1-v4) have been removed from the codebase.
DB Structure and Definitions
The definitions of what goes into the database and how to access it (both reads and writes) live in the public grype repo under the grype/db package. Responsibilities of grype (not grype-db) include (but are not limited to):
What tables are in the database
What columns are in each table
How each record should be serialized for writing into the database
How records should be read/written from/to the database
Providing rich objects for dealing with schema-specific data structures
The name of the SQLite DB file within an archive
The definition of a listing file and listing file entries
The purpose of grype-db is to use the definitions from grype/db and the upstream vulnerability data to create DB archives and make them publicly available for consumption via Grype.
DB Distribution Files
Grype DB currently supports two active schema versions, each with a different distribution mechanism:
Schema v5(legacy): Supports Grype v0.87.0+
Schema v6(current): Supports Grype main branch
Historical schemas (v1-v4) are no longer supported and their code has been removed from the codebase.
Schema v5: listing.json
The listing.json file is a legacy distribution mechanism used for schema v5 (and historically v1-v4):
Location: databases/listing.json
Structure: Contains URLs to DB archives organized by schema version, ordered by latest-date-first
Update Process: Generated and uploaded atomically with each DB build (no separate update step)
This dual-distribution approach allows Grype to maintain backward compatibility with v5 while providing a more efficient distribution mechanism for v6 and future versions.
Implementation Notes:
Distribution file definitions reside in the grype repo, while the grype-db repo is responsible for generating DBs and creating/updating these distribution files
As long as Grype has been configured to point to the correct distribution file URL, the DBs can be stored separately, replaced with a service returning the distribution file contents, or mirrored for systems behind an air gap
Daily Workflows
There are two workflows that drive getting a new Grype DB out to OSS users:
The daily DB publisher workflow, which builds and publishes a Grype DB from the data obtained in the daily data sync workflow.
Daily Data Sync Workflow
This workflow takes the upstream vulnerability data (from canonical, redhat, debian, NVD, etc), processes it, and writes the results to OCI repos.
%%{ init: { 'flowchart': { 'curve': 'linear' } } }%%
flowchart LR
A1["Pull alpine"] --> B1["Publish to ghcr.io/anchore/grype-db/data/alpine:<date>"]
A2["Pull amazon"] --> B2["Publish to ghcr.io/anchore/grype-db/data/amazon:<date>"]
A3["Pull debian"] --> B3["Publish to ghcr.io/anchore/grype-db/data/debian:<date>"]
A4["Pull github"] --> B4["Publish to ghcr.io/anchore/grype-db/data/github:<date>"]
A5["Pull nvd"] --> B5["Publish to ghcr.io/anchore/grype-db/data/nvd:<date>"]
A6["..."] --> B6["... repeat for all upstream providers ..."]
style A6 fill:none,stroke:none
style B6 fill:none,stroke:none
Once all providers have been updated, a single vulnerability cache OCI repo is updated with all of the latest vulnerability data at ghcr.io/anchore/grype-db/data:<date>. This repo is what is used downstream by the DB publisher workflow to create Grype DBs.
The in-repo .grype-db.yaml and .vunnel.yaml configurations are used to define the upstream data sources, how to obtain them, and where to put the results locally.
Daily DB Publishing Workflow
This workflow takes the latest vulnerability data cache, builds a Grype DB, and publishes it for general consumption:
The manager/ directory contains all code responsible for driving the Daily DB Publisher workflow, generating DBs for all supported schema versions (currently v5 and v6) and making them available to the public.
1. Pull
Download the latest vulnerability data from various upstream data sources into a local directory. The destination for the provider data is in the data/vunnel directory.
2. Generate
Build databases for all supported schema versions based on the latest vulnerability data and upload them to Cloudflare R2 (S3-compatible storage).
v5: Only the DB archive is uploaded; discoverability happens in the next step
v6: Both the DB archive AND latest.json are uploaded atomically, making the DB immediately discoverable
3. Update Listing (v5 Only)
This step only applies to schema v5.
Generate and upload a new listing.json file to Cloudflare R2 based on the existing listing file and newly discovered DB archives.
The listing file is tested against installations of Grype to ensure scans can successfully discover and download the DB. The scan must have a non-zero count of matches to pass validation.
Once the listing file has been uploaded to databases/listing.json, user-facing Grype v5 installations can discover and download the new DB.
Note: Schema v6 does not require this step because the latest.json file is generated and uploaded atomically with the DB archive in step 2, with a 5-minute cache TTL for fast updates.
Conceptually, one or more invocations of Vunnel will produce a single data directory which Grype DB uses to create a Grype database:
flowchart LR
subgraph vunnel_runs[ ]
vunnel_alpine[<b>vunnel run alpine</b>]
vunnel_rhel[<b>vunnel run rhel</b>]
vunnel_nvd[<b>vunnel run nvd</b>]
vunnel_other[(...)]
end
subgraph data[ ]
alpine_data[./data/alpine/]
rhel_data[./data/rhel/]
nvd_data[./data/nvd/]
other_data[...]
end
db_processor[Grype-DB]
subgraph db_out[ ]
sqlite_db[vulnerability.db<br/><small>sqlite</small>]
end
vunnel_alpine -->|write| alpine_data
vunnel_rhel -->|write| rhel_data
vunnel_nvd -->|write| nvd_data
alpine_data -->|read| db_processor
rhel_data -->|read| db_processor
nvd_data -->|read| db_processor
db_processor -->|write| sqlite_db
db_processor:::Application
vunnel_alpine:::Application
vunnel_rhel:::Application
vunnel_nvd:::Application
sqlite_db:::Database@{ shape: db }
alpine_data:::Database@{ shape: db }
rhel_data:::Database@{ shape: db }
nvd_data:::Database@{ shape: db }
style vunnel_other fill:none,stroke:none
style other_data fill:none,stroke:none
style vunnel_runs fill:none,stroke:none
style data fill:none,stroke:none
style db_out fill:none,stroke:none
classDef Application fill:#e1ffe1,stroke:#424242,stroke-width:1px
classDef Database stroke-width:1px, stroke-dasharray:none, stroke:#424242, fill:#fff9c4, color:#000000
Integration with Grype DB
The Vunnel CLI tool is optimized to run a single provider at a time, not orchestrating multiple providers at once. Grype DB is the tool that collates output from multiple providers and produces a single database, and is ultimately responsible for orchestrating multiple Vunnel calls to prepare the input data:
flowchart LR
subgraph data[ ]
data_in[(./data/)]
end
build[grype-db build]
subgraph db_out[ ]
db[(vulnerability.db<br/><small>sqlite</small>)]
end
data_in -->|read| build
build -->|write| db
build:::Application
data_in:::Database@{ shape: db }
db:::Database@{ shape: db }
style data fill:none,stroke:none
style db_out fill:none,stroke:none
classDef Application fill:#e1ffe1,stroke:#424242,stroke-width:1px
classDef Database stroke-width:1px, stroke-dasharray:none, stroke:#424242, fill:#fff9c4, color:#000000
grype-db package
flowchart LR
subgraph db_in[ ]
db[vulnerability.db<br/><small>sqlite</small>]
end
package[grype-db package]
subgraph archive_out[ ]
archive[[vulnerability-db-DATE.tar.gz]]
end
db -->|read| package
package -->|write| archive
package:::Application
db:::Database@{ shape: db }
archive:::Database@{ shape: document }
style db_in fill:none,stroke:none
style archive_out fill:none,stroke:none
classDef Application fill:#e1ffe1,stroke:#424242,stroke-width:1px
classDef Database stroke-width:1px, stroke-dasharray:none, stroke:#424242, fill:#fff9c4, color:#000000
For more information about how Grype DB uses Vunnel see the Grype DB Architecture page.
Provider Architecture
A “Provider” is the core abstraction for Vunnel and represents a single source of vulnerability data. Vunnel is a CLI wrapper around multiple vulnerability data providers.
Provider Requirements
All provider implementations should:
Live under src/vunnel/providers in their own directory (e.g. the NVD provider code is under src/vunnel/providers/nvd/...)
Be independent from other vulnerability providers data — that is, the debian provider CANNOT reach into the NVD data provider directory to look up information (such as severity)
Follow the workspace conventions for downloaded provider inputs, produced results, and tracking of metadata
Workspace Conventions
Each provider has a “workspace” directory within the “vunnel root” directory (defaults to ./data) named after the provider.
data/ # the "vunnel root" directory└── alpine/ # the provider workspace directory ├── input/ # any file that needs to be downloaded and referenced should be stored here ├── results/ # schema-compliant vulnerability results (1 record per file) ├── checksums # listing of result file checksums (xxh64 algorithm) └── metadata.json # metadata about the input and result files
The metadata.json and checksums are written out after all results are written to results/. An example metadata.json:
provider: the name of the provider that generated the results
urls: the URLs that were referenced to generate the results
listing: the path to the checksums listing file that lists all of the results, the checksum of that file, and the algorithm used to checksum the file (and the same algorithm used for all contained checksums)
timestamp: the point in time when the results were generated or last updated
schema: the data shape that the current file conforms to
Result Format
All results from a provider are handled by a common base class helper (provider.Provider.results_writer()) and is driven by the application configuration (e.g. JSON flat files or SQLite database). The data shape of the results are self-describing via an envelope with a schema reference.
The schema field is a URL to the schema that describes the data shape of the item field
The identifier field should have a unique identifier within the context of the provider results
The item field is the actual vulnerability data, and the shape of this field is defined by the schema
Note that the identifier is 3.3/cve-2015-8366 and not just cve-2015-8366 in order to uniquely identify cve-2015-8366 as applied to the alpine 3.3 distro version among other records in the results directory.
Currently only JSON payloads are supported.
Vulnerability Schemas
Possible vulnerability schemas supported within the vunnel repo are:
If at any point a breaking change needs to be made to a provider (and say the schema remains the same), then you can set the __version__ attribute on the provider class to a new integer value (incrementing from 1 onwards). This is a way to indicate that the cached input/results are not compatible with the output of the current version of the provider, in which case the next invocation of the provider will delete the previous input and results before running.
Provider Configuration
Each provider has a configuration object defined next to the provider class. This object is used in the vunnel application configuration and is passed as input to the provider class. Take the debian provider configuration for example:
from dataclasses import dataclass, field
from vunnel import provider, result
@dataclass
classConfig:
runtime: provider.RuntimeConfig = field(
default_factory=lambda: provider.RuntimeConfig(
result_store=result.StoreStrategy.SQLITE,
existing_results=provider.ResultStatePolicy.DELETE_BEFORE_WRITE,
),
)
request_timeout: int=125
Configuration Requirements
Every provider configuration must:
Be a dataclass
Have a runtime field that is a provider.RuntimeConfig field
The runtime field is used to configure common behaviors of the provider that are enforced within the vunnel.provider.Provider subclass.
Runtime Configuration Options
on_error: what to do when the provider fails
action: choose to fail, skip, or retry when the failure occurs
retry_count: the number of times to retry the provider before failing (only applicable when action is retry)
retry_delay: the number of seconds to wait between retries (only applicable when action is retry)
input: what to do about the input data directory on failure (such as keep or delete)
results: what to do about the results data directory on failure (such as keep or delete)
existing_results: what to do when the provider is run again and the results directory already exists
delete-before-write: delete the existing results just before writing the first processed (new) result
delete: delete existing results before running the provider
keep: keep the existing results
existing_input: what to do when the provider is run again and the input directory already exists
delete: delete the existing input before running the provider
keep: keep the existing input
result_store: where to store the results
sqlite: store results as key-value form in a SQLite database, where keys are the record identifiers values are the json vulnerability records
flat-file: store results in JSON files named after the record identifiers
Any provider-specific config options can be added to the configuration object as needed (such as request_timeout, which is a common field).
Related Architecture
For more details on how Grype DB uses Vunnel output, see the Grype DB Architecture page.