Skip to content

[feature] Add index to improve package search performance #2739

@ANIRUDH-333

Description

@ANIRUDH-333

Problem Description

Queries that search for artifacts containing specific package versions are slow on large GUAC databases. This performance issue stems from a missing index on a key table, forcing the database to perform a full Sequential Scan on a table with millions of rows. This significantly impacts the usability of GUAC for dependency and vulnerability lookups at scale.

During performance analysis on a database with over 6 million entries in bill_of_materials_included_software_packages, we identified a primary bottleneck in the query execution plan.

Analysis & Evidence

The EXPLAIN plan for a typical package search query revealed a highly inefficient operation: a Seq Scan on bill_of_materials_included_software_packages, which is the largest bottleneck, scanning over 6.7 million rows.

See EXPLAIN plan BEFORE adding indexes

QUERY PLAN
Gather  (cost=1300.19..219481.05 rows=105 width=25)
  Workers Planned: 2
  ->  Hash Join  (cost=300.19..218470.55 rows=44 width=25)
        Hash Cond: (bom.artifact_id = a.id)
        ->  Nested Loop  (cost=298.49..218463.58 rows=2006 width=41)
              ->  Hash Join  (cost=298.06..217456.86 rows=2006 width=41)
                    Hash Cond: (bom_pkg.package_version_id = pv.id)
                    ->  Parallel Seq Scan on bill_of_materials_included_software_packages bom_pkg  (cost=0.00..191915.30 rows=6725830 width=32)
                    ->  Hash  (cost=298.02..298.02 rows=3 width=41)
                          ->  Hash Join  (cost=22.32..298.02 rows=3 width=41)
                                Hash Cond: (pv.name_id = pn.id)
                                Join Filter: ((((pn.name)::text = 'stdlib'::text) AND ((pv.version)::text = '1.20'::text)) OR ((pn.name)::text = 'openssl'::text) OR (((pn.name)::text = 'ant'::text) AND ((pv.version)::text = '1.10.14'::text)))
                                ->  Seq Scan on package_versions pv  (cost=0.00..251.23 rows=9323 width=43)
                                ->  Hash  (cost=22.28..22.28 rows=3 width=30)
                                      ->  Bitmap Heap Scan on package_names pn  (cost=12.86..22.28 rows=3 width=30)
                                            Recheck Cond: (((name)::text = 'stdlib'::text) OR ((name)::text = 'openssl'::text) OR ((name)::text = 'ant'::text))
                                            ->  BitmapOr  (cost=12.86..12.86 rows=3 width=0)
                                                  ->  Bitmap Index Scan on packagename_name_namespace_type  (cost=0.00..4.29 rows=1 width=0)
                                                        Index Cond: ((name)::text = 'stdlib'::text)
                                                  ->  Bitmap Index Scan on packagename_name_namespace_type  (cost=0.00..4.29 rows=1 width=0)
                                                        Index Cond: ((name)::text = 'openssl'::text)
                                                  ->  Bitmap Index Scan on packagename_name_namespace_type  (cost=0.00..4.29 rows=1 width=0)
                                                        Index Cond: ((name)::text = 'ant'::text)
              ->  Index Scan using bill_of_materials_pkey on bill_of_materials bom  (cost=0.43..0.50 rows=1 width=32)
                    Index Cond: (id = bom_pkg.bill_of_materials_id)
        ->  Hash  (cost=1.69..1.69 rows=1 width=16)
              ->  Seq Scan on artifacts a  (cost=0.00..1.69 rows=1 width=16)
                    Filter: (((algorithm)::text = 'sha256'::text) AND ((digest)::text = '2b3dd23a9614ad08a6af9faab29e2e9685c5bfd3088e769b7a5cdb9d76018349'::text))

See EXPLAIN plan AFTER adding indexes

QUERY PLAN
Hash Join  (cost=24.89..3447.70 rows=107 width=25)
  Hash Cond: (bom.artifact_id = a.id)
  ->  Nested Loop  (cost=23.18..3433.02 rows=4943 width=41)
        ->  Nested Loop  (cost=22.75..959.28 rows=4943 width=41)
              ->  Hash Join  (cost=22.32..298.02 rows=3 width=41)
                    Hash Cond: (pv.name_id = pn.id)
                    Join Filter: ((((pn.name)::text = 'stdlib'::text) AND ((pv.version)::text = '1.20'::text)) OR ((pn.name)::text = 'openssl'::text) OR (((pn.name)::text = 'ant'::text) AND ((pv.version)::text = '1.10.14'::text)))
                    ->  Seq Scan on package_versions pv  (cost=0.00..251.23 rows=9323 width=43)
                    ->  Hash  (cost=22.28..22.28 rows=3 width=30)
                          ->  Bitmap Heap Scan on package_names pn  (cost=12.86..22.28 rows=3 width=30)
                                Recheck Cond: (((name)::text = 'stdlib'::text) OR ((name)::text = 'openssl'::text) OR ((name)::text = 'ant'::text))
                                ->  BitmapOr  (cost=12.86..12.86 rows=3 width=0)
                                      ->  Bitmap Index Scan on packagename_name_namespace_type  (cost=0.00..4.29 rows=1 width=0)
                                            Index Cond: ((name)::text = 'stdlib'::text)
                                      ->  Bitmap Index Scan on packagename_name_namespace_type  (cost=0.00..4.29 rows=1 width=0)
                                            Index Cond: ((name)::text = 'openssl'::text)
                                      ->  Bitmap Index Scan on packagename_name_namespace_type  (cost=0.00..4.29 rows=1 width=0)
                                            Index Cond: ((name)::text = 'ant'::text)
              ->  Index Scan using idx_bom_included_packages_pkg_ver_id on bill_of_materials_included_software_packages bom_pkg  (cost=0.43..162.12 rows=5830 width=32)
                    Index Cond: (package_version_id = pv.id)
        ->  Index Scan using bill_of_materials_pkey on bill_of_materials bom  (cost=0.43..0.50 rows=1 width=32)
              Index Cond: (id = bom_pkg.bill_of_materials_id)
  ->  Hash  (cost=1.69..1.69 rows=1 width=16)
        ->  Seq Scan on artifacts a  (cost=0.00..1.69 rows=1 width=16)
              Filter: (((algorithm)::text = 'sha256'::text) AND ((digest)::text = 
'2b3dd23a9614ad08a6af9faab29e2e9685c5bfd3088e769b7a5cdb9d76018349'::text))

Proposed Solution

To resolve this performance issue, we propose adding an index on the package_version_id column of the bill_of_materials_included_software_packages table.

CREATE INDEX idx_bom_included_packages_pkg_ver_id
ON bill_of_materials_included_software_packages (package_version_id);

Impact

Adding this index provides a significant performance improvement for a fundamental GUAC use case. The estimated query cost dropped from ~219,000 to ~3,400 (a ~60x improvement).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestscaleIssues related to scaling GUAC operations

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions