Skip to content

Fix transitive dependency detection for Python CVE analysis#242

Open
etsien wants to merge 1 commit into
mainfrom
transitive-dependency-fix
Open

Fix transitive dependency detection for Python CVE analysis#242
etsien wants to merge 1 commit into
mainfrom
transitive-dependency-fix

Conversation

@etsien
Copy link
Copy Markdown
Collaborator

@etsien etsien commented May 29, 2026

Summary

The vulnerability analysis pipeline was producing false-positive code_not_present verdicts
for CVEs affecting transitive Python dependencies (e.g., jinja2 pulled in by ansible, urllib3
pulled in by requests). The agent searched the application source for import jinja2, found
nothing because the app never imports it directly, and concluded the library was absent.

Three related problems caused this:

Manifest discovery gap. PythonDependencyTreeBuilder.install_dependencies opened
requirements.txt unconditionally and raised FileNotFoundError for any project that uses
pyproject.toml, setup.py, uv.lock, poetry.lock, setup.cfg, or Pipfile. The
exception was silently swallowed by the VDB builder, leaving the code index empty and making
every transitive CVE a false positive for those projects.

Site-packages not indexed. Even when installation succeeded, the packages in
transitive_env/lib/*/site-packages/ were outside the repo tree and never added to the
document index. Code Keyword Search and the Call Chain Analyzer (CCA) had no visibility into
the source of installed packages such as jinja2 or urllib3.

Installed package list unavailable. There was no indexed record of what was actually
installed in the container environment, so the agent had no way to confirm library presence
short of finding a direct import in the application source.

Changes

src/exploit_iq_commons/utils/dep_tree.py

  • detect_ecosystem now recognises uv.lock, poetry.lock, setup.cfg, and Pipfile as
    Python project indicators, so the installation step is triggered for projects that use those
    formats.

  • PythonDependencyTreeBuilder.install_dependencies is restructured around a manifest fallback
    chain. It tries each format in priority order and logs which one succeeded. The
    requirements.txt path is unchanged in behaviour. Lock-file projects (uv.lock,
    poetry.lock) are handled via uv export. pyproject.toml/setup.py/setup.cfg projects
    use uv pip install .. Pipfile projects use pipenv requirements.

  • After installation, _write_installed_packages runs pip list --format=freeze and writes
    the result to installed_packages.txt in the repository root. This file is a freeze-format
    snapshot of everything installed in the venv.

  • PythonDependencyTreeBuilder.build_tree no longer hard-codes a read of requirements.txt
    to determine direct dependencies. The new _get_direct_dependencies helper reads
    requirements.txt when present, and falls back to the indent-level-0 packages in the
    deptree output when it is not.

src/exploit_iq_commons/utils/source_code_git_loader.py

  • SourceCodeGitLoader.yield_blobs unconditionally adds installed_packages.txt to the
    include set when the file is present. This makes Code Keyword Search able to confirm library
    presence (e.g., urllib3==2.2.1) without needing to find a direct import in application
    source.

  • The new _add_site_packages_blobs static method scans transitive_env/lib/*/site-packages/
    and adds Python source files to the include set for any package directory containing at most
    150 .py files. This lets the CCA and Code Keyword Search trace call chains across package
    boundaries (for example, confirming that an operator's use of Ansible transitively reaches
    the vulnerable jinja2 template path). Packages exceeding the file-count limit,
    .dist-info/.egg-info metadata directories, ansible_collections, test directories, and
    __pycache__ are excluded.

@etsien etsien requested review from RedTanny and zvigrinberg May 30, 2026 01:56
tmihalac added a commit to tmihalac/vulnerability-analysis that referenced this pull request May 31, 2026
-Small param name fix
-Reverted some python dependencies changes since there are handled better in RHEcosystemAppEng#242
-Show actual synthetic question text instead of constant "(synthetic)" placeholder

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
@tmihalac
Copy link
Copy Markdown

A small suggestion, add this as well:

if "UV_CACHE_DIR" not in os.environ:
            fallback_cache = manifest_path / ".uv_cache"
            try:
                fallback_cache.mkdir(exist_ok=True)
                os.environ["UV_CACHE_DIR"] = str(fallback_cache)
            except OSError:
                pass

When uv runs without UV_CACHE_DIR set, it defaults to ~/.cache/uv. In
OpenShift containers the home directory is often read-only (non-root UID,
restricted SecurityContextConstraints), so uv venv or uv pip install fails
with a permission error trying to create its cache.

The fallback creates .uv_cache inside manifest_path (the cloned repo
directory, which is writable on the PVC) and sets UV_CACHE_DIR to point
there. The try/except OSError handles the edge case where even that
directory can't be created.

tmihalac added a commit that referenced this pull request May 31, 2026
#240)

* Fix Go method regex regression and add patch-based function enrichment

  Go segmenter regex fixes:
  - Fix parse_all_methods regex that failed to match Go methods
  with
    pointer parameters (e.g. *wire.AckFrame) — old param pattern
    [,a-zA-Z0-9\s\[\].]* excluded '*', new pattern [^)]* matches
  all
    param types
  - Fix receiver pattern to support bracket types
  [a-zA-Z0-9\s\*.\[\]]+
    and underscores in method names [a-zA-Z_][a-zA-Z0-9_]*
  - Fix return type group to use (\(?[^{]*\)?)? instead of the
    (\(?(?:[^{}]|\{[^}]*\})*\)?)? variant which greedily consumed
    function bodies through the \{[^}]*\} alternation, causing
  short
    functions (e.g. GetLowestPacketNotConfirmedAcked) to swallow
  the
    next method's signature
  - Result: 31 methods extracted from sent_packet_handler.go (was
  26/27),
    including detectAndRemoveAckedPackets which is critical for
    CVE-2025-29785

  Patch-based vulnerable function enrichment (intel_utils.py):
  - Add enrich_vulnerable_functions_from_patch() — when GHSA/OSV
  have no
    vulnerable_functions, extract them from GitHub fix commit
  diffs
  - Parse function names from unified diff hunk headers and added
  lines
    using ecosystem-aware regex patterns (Go, Python, Java, JS, C)
  - Prioritize GHSA references tagged as type=FIX
  - Authenticate GitHub API calls via existing GHSA_API_KEY env
  var
  - Limit to 3 commits max to avoid excessive API calls

  Go enrichment pipeline refactoring (cve_agent.py):
  - Move enrich_go_candidates and validate_go_vendor_packages from
    reachability_agent.py to intel_utils.py for reuse across
  agents
  - Call Go enrichment and patch enrichment in _process_steps
  before
    dispatching checklist questions
  - Batch-dispatch all questions to routing LLM via asyncio.gather
    instead of sequential per-step dispatch
  - Inject synthetic reachability question when no checklist
  question
    targets reachability but candidate packages exist
  - Handle routing failures gracefully (log warning, skip failed
  steps)
  - Fix _postprocess_results index-out-of-bounds for synthetic
  questions

  CCA debug logging (chain_of_calls_retriever.py):
  - Add structured logging throughout DFS traversal: query
  parsing, tree
    lookup, __find_initial_function (doc counts, found/not-found),
    each DFS iteration (function, package, path length, caller
  found),
    backtracking, and final result
  - Fix potential IndexError: check parent_parents is non-empty
  before
    accessing [0] in __find_caller_function_dfs
  - Fix CCA query parsing for Go sub-packages containing '/' in
  the
    function portion (e.g. internal/ackhandler.func)

  Go parser debug logging (golang_functions_parsers.py):
  - Add logging to search_for_called_function, __check_identifier,
    and __trace_down_package for tracing Go method resolution
  - Fix __check_identifier to pass receiver chain (parts[:-1])
  instead
    of full identifier expression to __trace_down_package
  - Fix struct field type extraction to handle fields without
  trailing
    space (find(" ") returning -1)

  Other changes:
  - Add FL debug log showing package doc count before fuzzy
  matching
  - Expand full_text_search to widen to 500 results when top-50
  returns
    only dependency docs and no application docs
  - Relax Go transitive search test assertion from exact path
  length to
    len > 1 with root package check

* Removed debug logging and fixed tests

* Removed unused logger

* Fix IUA tantivy DocAddress bug and add synthetic reachability safety net

  - Fix Import Usage Analyzer: use all_query() to get proper
  DocAddress objects
    instead of broken range(num_docs) iteration (tool was
  non-functional)
  - Relax synthetic reachability question condition to fire when
  candidate_packages
    exists, even without vulnerable_functions
  - Consolidate Go import regex into single pattern handling
  aliased/grouped imports
  - Fix _find_usage_in_file to extract short names from Go
  slash-separated paths
  - Always run patch-based function enrichment, not only when vuln
   functions empty

* Fix patch enrichment accuracy regression with test file filtering

  - Skip test files at file level in
  enrich_vulnerable_functions_from_patch
    (_TEST_FILE_RE matches _test.go, *Test.java, test_*.py,
  *.test.js,
    *_test.c, src/test/ across all ecosystems)
  - Restore vulnerable_functions.add() — patch-extracted functions
   from
    non-test files are safe for Rule 9 enforcement
  - Restore conditional enrichment pattern in _process_steps so
  enriched
    functions reach precomputed_intel

* Changed redpanda pull policy to IfNotPresent

* CCA Go sub-package disambiguation
  - Same-package shortcut returned True without verifying the caller
    was in the callee package (time.Parse matching strvals.Parse)
  - Function name regex matched suffixes (MustParse matching Parse)
    — added negative lookbehind for word boundary
  - Type resolution matched types globally without checking the caller
    imports the callee package (pattern.Parser matching jwt.Parser)

* Fix Go CCA false-positive chains, demote OSV/patch enrichment to hints

  - Demote OSV-enriched data to critical_context hints instead of
  candidate_packages/vulnerable_functions
  - Demote patch-extracted functions to critical_context hints instead of
  vulnerable_functions
  - Fix reachability_agent already_enriched check to match new hint format
  - Handle missing requirements.txt in Python dep tree builder (fallback to
  pyproject.toml/setup.py)
  - Set UV_CACHE_DIR fallback for container permission issues
  - Add PythonDependencyTreeBuilder manifest detection and fallback tests
  - Update intel_utils tests to assert hint-only enrichment

* Changes following code review

-Small param name fix
-Reverted some python dependencies changes since there are handled better in #242
-Show actual synthetic question text instead of constant "(synthetic)" placeholder

* Changes following code review

-Removed incorrect tests

* Changes following code review

Fall back to reachability agent on routing failure instead of dropping questions

  - Upgrade routing failure log level from warning to error
  - Replace failed routings with default reachability QuestionRouting instead
  of removing them from routed_steps

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
@zvigrinberg
Copy link
Copy Markdown
Collaborator

A small suggestion, add this as well:

if "UV_CACHE_DIR" not in os.environ:
            fallback_cache = manifest_path / ".uv_cache"
            try:
                fallback_cache.mkdir(exist_ok=True)
                os.environ["UV_CACHE_DIR"] = str(fallback_cache)
            except OSError:
                pass

When uv runs without UV_CACHE_DIR set, it defaults to ~/.cache/uv. In OpenShift containers the home directory is often read-only (non-root UID, restricted SecurityContextConstraints), so uv venv or uv pip install fails with a permission error trying to create its cache.

The fallback creates .uv_cache inside manifest_path (the cloned repo directory, which is writable on the PVC) and sets UV_CACHE_DIR to point there. The try/except OSError handles the edge case where even that directory can't be created.

@etsien This is important...
But just instead of mutating the UV_CACHE_DIR environment variable of the agent process globally, it's better to pass it as env var to the subprocess.run that invokes the uv package manager binary ( in order to prevent potential race conditions and errors between threads that run python analysis jobs concurrently).

Copy link
Copy Markdown
Collaborator

@zvigrinberg zvigrinberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@etsien This is a great idea... good job.
Did you test it successfully for failed scenarios?
Please see @tmihalac comment and mine following one.

@etsien
Copy link
Copy Markdown
Collaborator Author

etsien commented Jun 1, 2026

A small suggestion, add this as well:

if "UV_CACHE_DIR" not in os.environ:
            fallback_cache = manifest_path / ".uv_cache"
            try:
                fallback_cache.mkdir(exist_ok=True)
                os.environ["UV_CACHE_DIR"] = str(fallback_cache)
            except OSError:
                pass

When uv runs without UV_CACHE_DIR set, it defaults to ~/.cache/uv. In OpenShift containers the home directory is often read-only (non-root UID, restricted SecurityContextConstraints), so uv venv or uv pip install fails with a permission error trying to create its cache.
The fallback creates .uv_cache inside manifest_path (the cloned repo directory, which is writable on the PVC) and sets UV_CACHE_DIR to point there. The try/except OSError handles the edge case where even that directory can't be created.

@etsien This is important... But just instead of mutating the UV_CACHE_DIR environment variable of the agent process globally, it's better to pass it as env var to the subprocess.run that invokes the uv package manager binary ( in order to prevent potential race conditions and errors between threads that run python analysis jobs concurrently).

ok will do

@etsien
Copy link
Copy Markdown
Collaborator Author

etsien commented Jun 1, 2026

@etsien This is a great idea... good job. Did you test it successfully for failed scenarios? Please see @tmihalac comment and mine following one.

Yes, I had 4 python samples consistently fail, silent failures, because it can't unpack the imports. After the changes, they are now finding and indexing the 3rd party packages to trace the dependencies. This is reflected in the agent traces inside the checklist.

@etsien
Copy link
Copy Markdown
Collaborator Author

etsien commented Jun 1, 2026

/test-heavy

@tmihalac
Copy link
Copy Markdown

tmihalac commented Jun 2, 2026

@etsien Following up on the UV_CACHE_DIR discussion and @zvigrinberg's feedback — we hit this exact issue in the latest integration test run and traced the full failure chain. The problem goes deeper than just the cache directory:

What happened (integration test log evidence)

  1. Pickle cache warmcollect_documents() returns cached docs → install_dependencies() never runs
  2. uv pip install deptree fails silently — Permission denied creating ~/.cache/uv (no UV_CACHE_DIR set)
  3. transitive_env/bin/python: not found — the venv was created by a previous pod; bin/python is a symlink to that pod's interpreter path, which doesn't exist in the new pod
  4. Python version mismatch — when re-creating the venv, uv venv downloads the latest Python (3.13) but existing packages on the PVC are under python3.12/site-packages/. deptree with 3.13 only sees deptree + setuptools, not werkzeug/flask
  5. deptree returns wrong outputsupported_packages missing real packages → FL says "Package 'werkzeug' not found" → agent can't call CCA → false code_not_present verdict

Suggested additions to build_tree

Since install_dependencies is skipped when the pickle cache is warm, the fixes must live in build_tree (or helpers it calls):

def _ensure_uv_cache_dir(self, manifest_path: Path):
    """Set UV_CACHE_DIR if not already configured.

    In OpenShift containers the default ~/.cache/uv is often read-only.
    This is normally set during install_dependencies(), but when the pickle
    cache is warm that method is skipped entirely.
    """
    if "UV_CACHE_DIR" not in os.environ:
        fallback_cache = manifest_path / ".uv_cache"
        try:
            fallback_cache.mkdir(exist_ok=True)
            os.environ["UV_CACHE_DIR"] = str(fallback_cache)
            logger.info("Set UV_CACHE_DIR=%s (home dir may be read-only)", fallback_cache)
        except OSError:
            logger.warning("Could not create UV_CACHE_DIR at %s", fallback_cache)

@staticmethod
def _detect_existing_site_packages_version(manifest_path: Path) -> str | None:
    """Detect the Python version from an existing site-packages directory.

    When re-creating a broken venv, we must match the Python version that
    the packages were originally installed with (e.g. python3.12), otherwise
    the new interpreter won't see the existing site-packages.
    """
    lib_dir = manifest_path / TRANSITIVE_ENV_NAME / "lib"
    if not lib_dir.exists():
        return None
    for entry in sorted(lib_dir.iterdir()):
        if entry.is_dir() and entry.name.startswith("python") and (entry / "site-packages").exists():
            version = entry.name.replace("python", "")
            logger.info("Detected existing site-packages for Python %s", version)
            return version
    return None

def _ensure_venv_python(self, manifest_path: Path) -> str:
    """Ensure transitive_env has a working python binary.

    When the pickle cache is warm, install_dependencies() is skipped and
    the venv from a previous pod may have a broken bin/python symlink
    (points to the old pod's interpreter path). Re-create the venv
    in-place so deptree can run; existing site-packages on the PVC
    are preserved by uv venv (verified — it only replaces bin/ and metadata).
    """
    venv_python = f"{manifest_path}/{TRANSITIVE_ENV_NAME}/bin/python"
    if Path(venv_python).exists():
        return venv_python
    logger.warning("Venv python not found at %s — re-creating venv for deptree", venv_python)
    # Must match the Python version of existing site-packages, not project metadata
    python_version = self._detect_existing_site_packages_version(manifest_path)
    if not python_version:
        python_version = self.determine_python_version(str(manifest_path))
    if python_version:
        cmd = f"cd {manifest_path} && uv venv {TRANSITIVE_ENV_NAME} --python {python_version}"
    else:
        cmd = f"cd {manifest_path} && uv venv {TRANSITIVE_ENV_NAME}"
    run_command(cmd)
    return venv_python

def _build_flat_tree_from_manifest(self, manifest_path: Path) -> str:
    """Build a flat deptree-format string from requirements.txt as a fallback
    when deptree cannot run (e.g. venv binary missing after pickle cache hit)."""
    lines = []
    try:
        with open(manifest_path / PYTHON_MANIFEST, "r") as manifest:
            for line in manifest:
                line = line.strip()
                if line and not PythonLanguageFunctionsParser.is_comment_line(line):
                    lines.append(line)
    except FileNotFoundError:
        logger.error("requirements.txt not found at %s — cannot build fallback tree", manifest_path)
        return ""
    if lines:
        logger.warning("Using requirements.txt fallback for dependency tree (%d packages)", len(lines))
    return os.linesep.join(lines)

def build_tree(self, manifest_path: Path) -> defaultdict[Any, list]:
    self._ensure_uv_cache_dir(manifest_path)
    venv_python = self._ensure_venv_python(manifest_path)
    run_command(f'uv pip install "setuptools<81" deptree --python {venv_python}')
    dependencies = run_command(f"{venv_python} -m deptree")
    if not dependencies or not dependencies.strip():
        logger.error("deptree returned empty output — ...")
        dependencies = self._build_flat_tree_from_manifest(manifest_path)
    # ... rest unchanged

Re @zvigrinberg's point about not mutating os.environ globally — agreed that passing env to subprocess.run is cleaner. For _ensure_uv_cache_dir the os.environ approach works because all uv calls in the same process benefit, but if you prefer subprocess-level env, that works too.

The _ensure_venv_python + _detect_existing_site_packages_version parts are critical — without them, any run where the pickle cache survives a pod restart will fail for Python CVEs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants