Fix transitive dependency detection for Python CVE analysis#242
Conversation
-Small param name fix -Reverted some python dependencies changes since there are handled better in RHEcosystemAppEng#242 -Show actual synthetic question text instead of constant "(synthetic)" placeholder Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
|
A small suggestion, add this as well: When uv runs without UV_CACHE_DIR set, it defaults to ~/.cache/uv. In The fallback creates .uv_cache inside manifest_path (the cloned repo |
#240) * Fix Go method regex regression and add patch-based function enrichment Go segmenter regex fixes: - Fix parse_all_methods regex that failed to match Go methods with pointer parameters (e.g. *wire.AckFrame) — old param pattern [,a-zA-Z0-9\s\[\].]* excluded '*', new pattern [^)]* matches all param types - Fix receiver pattern to support bracket types [a-zA-Z0-9\s\*.\[\]]+ and underscores in method names [a-zA-Z_][a-zA-Z0-9_]* - Fix return type group to use (\(?[^{]*\)?)? instead of the (\(?(?:[^{}]|\{[^}]*\})*\)?)? variant which greedily consumed function bodies through the \{[^}]*\} alternation, causing short functions (e.g. GetLowestPacketNotConfirmedAcked) to swallow the next method's signature - Result: 31 methods extracted from sent_packet_handler.go (was 26/27), including detectAndRemoveAckedPackets which is critical for CVE-2025-29785 Patch-based vulnerable function enrichment (intel_utils.py): - Add enrich_vulnerable_functions_from_patch() — when GHSA/OSV have no vulnerable_functions, extract them from GitHub fix commit diffs - Parse function names from unified diff hunk headers and added lines using ecosystem-aware regex patterns (Go, Python, Java, JS, C) - Prioritize GHSA references tagged as type=FIX - Authenticate GitHub API calls via existing GHSA_API_KEY env var - Limit to 3 commits max to avoid excessive API calls Go enrichment pipeline refactoring (cve_agent.py): - Move enrich_go_candidates and validate_go_vendor_packages from reachability_agent.py to intel_utils.py for reuse across agents - Call Go enrichment and patch enrichment in _process_steps before dispatching checklist questions - Batch-dispatch all questions to routing LLM via asyncio.gather instead of sequential per-step dispatch - Inject synthetic reachability question when no checklist question targets reachability but candidate packages exist - Handle routing failures gracefully (log warning, skip failed steps) - Fix _postprocess_results index-out-of-bounds for synthetic questions CCA debug logging (chain_of_calls_retriever.py): - Add structured logging throughout DFS traversal: query parsing, tree lookup, __find_initial_function (doc counts, found/not-found), each DFS iteration (function, package, path length, caller found), backtracking, and final result - Fix potential IndexError: check parent_parents is non-empty before accessing [0] in __find_caller_function_dfs - Fix CCA query parsing for Go sub-packages containing '/' in the function portion (e.g. internal/ackhandler.func) Go parser debug logging (golang_functions_parsers.py): - Add logging to search_for_called_function, __check_identifier, and __trace_down_package for tracing Go method resolution - Fix __check_identifier to pass receiver chain (parts[:-1]) instead of full identifier expression to __trace_down_package - Fix struct field type extraction to handle fields without trailing space (find(" ") returning -1) Other changes: - Add FL debug log showing package doc count before fuzzy matching - Expand full_text_search to widen to 500 results when top-50 returns only dependency docs and no application docs - Relax Go transitive search test assertion from exact path length to len > 1 with root package check * Removed debug logging and fixed tests * Removed unused logger * Fix IUA tantivy DocAddress bug and add synthetic reachability safety net - Fix Import Usage Analyzer: use all_query() to get proper DocAddress objects instead of broken range(num_docs) iteration (tool was non-functional) - Relax synthetic reachability question condition to fire when candidate_packages exists, even without vulnerable_functions - Consolidate Go import regex into single pattern handling aliased/grouped imports - Fix _find_usage_in_file to extract short names from Go slash-separated paths - Always run patch-based function enrichment, not only when vuln functions empty * Fix patch enrichment accuracy regression with test file filtering - Skip test files at file level in enrich_vulnerable_functions_from_patch (_TEST_FILE_RE matches _test.go, *Test.java, test_*.py, *.test.js, *_test.c, src/test/ across all ecosystems) - Restore vulnerable_functions.add() — patch-extracted functions from non-test files are safe for Rule 9 enforcement - Restore conditional enrichment pattern in _process_steps so enriched functions reach precomputed_intel * Changed redpanda pull policy to IfNotPresent * CCA Go sub-package disambiguation - Same-package shortcut returned True without verifying the caller was in the callee package (time.Parse matching strvals.Parse) - Function name regex matched suffixes (MustParse matching Parse) — added negative lookbehind for word boundary - Type resolution matched types globally without checking the caller imports the callee package (pattern.Parser matching jwt.Parser) * Fix Go CCA false-positive chains, demote OSV/patch enrichment to hints - Demote OSV-enriched data to critical_context hints instead of candidate_packages/vulnerable_functions - Demote patch-extracted functions to critical_context hints instead of vulnerable_functions - Fix reachability_agent already_enriched check to match new hint format - Handle missing requirements.txt in Python dep tree builder (fallback to pyproject.toml/setup.py) - Set UV_CACHE_DIR fallback for container permission issues - Add PythonDependencyTreeBuilder manifest detection and fallback tests - Update intel_utils tests to assert hint-only enrichment * Changes following code review -Small param name fix -Reverted some python dependencies changes since there are handled better in #242 -Show actual synthetic question text instead of constant "(synthetic)" placeholder * Changes following code review -Removed incorrect tests * Changes following code review Fall back to reachability agent on routing failure instead of dropping questions - Upgrade routing failure log level from warning to error - Replace failed routings with default reachability QuestionRouting instead of removing them from routed_steps Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
@etsien This is important... |
ok will do |
Yes, I had 4 python samples consistently fail, silent failures, because it can't unpack the imports. After the changes, they are now finding and indexing the 3rd party packages to trace the dependencies. This is reflected in the agent traces inside the checklist. |
|
/test-heavy |
|
@etsien Following up on the What happened (integration test log evidence)
Suggested additions to
|
Summary
The vulnerability analysis pipeline was producing false-positive
code_not_presentverdictsfor CVEs affecting transitive Python dependencies (e.g., jinja2 pulled in by ansible, urllib3
pulled in by requests). The agent searched the application source for
import jinja2, foundnothing because the app never imports it directly, and concluded the library was absent.
Three related problems caused this:
Manifest discovery gap.
PythonDependencyTreeBuilder.install_dependenciesopenedrequirements.txtunconditionally and raisedFileNotFoundErrorfor any project that usespyproject.toml,setup.py,uv.lock,poetry.lock,setup.cfg, orPipfile. Theexception was silently swallowed by the VDB builder, leaving the code index empty and making
every transitive CVE a false positive for those projects.
Site-packages not indexed. Even when installation succeeded, the packages in
transitive_env/lib/*/site-packages/were outside the repo tree and never added to thedocument index. Code Keyword Search and the Call Chain Analyzer (CCA) had no visibility into
the source of installed packages such as jinja2 or urllib3.
Installed package list unavailable. There was no indexed record of what was actually
installed in the container environment, so the agent had no way to confirm library presence
short of finding a direct import in the application source.
Changes
src/exploit_iq_commons/utils/dep_tree.pydetect_ecosystemnow recognisesuv.lock,poetry.lock,setup.cfg, andPipfileasPython project indicators, so the installation step is triggered for projects that use those
formats.
PythonDependencyTreeBuilder.install_dependenciesis restructured around a manifest fallbackchain. It tries each format in priority order and logs which one succeeded. The
requirements.txtpath is unchanged in behaviour. Lock-file projects (uv.lock,poetry.lock) are handled viauv export.pyproject.toml/setup.py/setup.cfgprojectsuse
uv pip install ..Pipfileprojects usepipenv requirements.After installation,
_write_installed_packagesrunspip list --format=freezeand writesthe result to
installed_packages.txtin the repository root. This file is a freeze-formatsnapshot of everything installed in the venv.
PythonDependencyTreeBuilder.build_treeno longer hard-codes a read ofrequirements.txtto determine direct dependencies. The new
_get_direct_dependencieshelper readsrequirements.txtwhen present, and falls back to the indent-level-0 packages in thedeptreeoutput when it is not.src/exploit_iq_commons/utils/source_code_git_loader.pySourceCodeGitLoader.yield_blobsunconditionally addsinstalled_packages.txtto theinclude set when the file is present. This makes Code Keyword Search able to confirm library
presence (e.g.,
urllib3==2.2.1) without needing to find a direct import in applicationsource.
The new
_add_site_packages_blobsstatic method scanstransitive_env/lib/*/site-packages/and adds Python source files to the include set for any package directory containing at most
150
.pyfiles. This lets the CCA and Code Keyword Search trace call chains across packageboundaries (for example, confirming that an operator's use of Ansible transitively reaches
the vulnerable jinja2 template path). Packages exceeding the file-count limit,
.dist-info/.egg-infometadata directories,ansible_collections, test directories, and__pycache__are excluded.