Fix Xygeni parser deduplicating repeated SAST/Secrets findings in the same file#15003
Fix Xygeni parser deduplicating repeated SAST/Secrets findings in the same file#15003lmrb-1968 wants to merge 2 commits into
Conversation
Dedup on the per-occurrence issueId instead of uniqueHash (which Xygeni reuses across occurrences of the same value), keeping uniqueHash as vuln_id_from_tool. SCA is unchanged. Adds regression tests and updates the parser docs.
valentijnscholten
left a comment
There was a problem hiding this comment.
Thanks for looking into this and the PR. The changes does mean that users will have to take a one time "hit" as existing findings will be closed when a reimport occurs where the unique_id_from_tool will now no longer match. Do you see any way around it? If not we'll need some notes in the upgrade notes for 2.59.1 (version number to be determined).
Another approach would be to create a V2 parser like we have done in other places. My feeling is for this parser it would be OK to take the one time hit.
|
Thanks, @valentijnscholten. I don't see a clean way around the one-time hit. The old findings were stored with I agree the one-time hit is acceptable here rather than a V2 parser, since the old behavior was hiding real findings. I've added an upgrade note to |
This is a bug fix for an existing parser (the Xygeni parser added in #14769), which falls under the "Acceptable examples" in CONTRIBUTING (bug fix for an existing parser), so no separate pre-approval issue is filed.
Description
The Xygeni parser deduplicated away legitimate findings when the same secret value or code pattern appeared more than once in a single file, so only one occurrence showed up after import.
Root cause: Xygeni reuses a single
uniqueHashacross every occurrence of the same value in a file (it hashes the value, not the location), while giving each occurrence a distinctissueIdthat encodes the file path and line. The SAST and Secrets scan types deduplicate onunique_id_from_tool, which was set touniqueHash— so occurrences after the first were marked as duplicates and hidden.Fix: for SAST and Secrets, key
unique_id_from_toolon the per-occurrenceissueId(falling back touniqueHashifissueIdis absent), and keepuniqueHashasvuln_id_from_toolso it still groups occurrences of the same value. Each occurrence is now its own Finding.SCA is intentionally left unchanged: there
uniqueHash(CVE#package:version) is unique per finding, whileissueId(SCA.CVE-…) collides across packages, souniqueHashremains the correct dedup key.No model, settings, or migration changes.
Test results
Extended
unittests/tools/test_xygeni_parser.py:unique_id_from_tool(issueId) values and addedvuln_id_from_tool == uniqueHashchecks.test_secrets_repeated_in_same_file_stay_distinct— a secret repeated on lines 9 and 29 of one file now yields two findings with distinctunique_id_from_tool.test_sast_repeated_detector_in_same_file_stays_distinct— a detector flagged on four lines now yields four distinct findings.Against the checked-in fixtures the parser now produces 61/61 distinct ids for Secrets and 501/501 for SAST (previously several collapsed). Ruff (0.15.15) passes on all changed files.
Documentation
Updated
docs/content/supported_tools/parsers/file/xygeni.md— the Deduplication section now shows the per-scan-typeunique_id_from_tool/vuln_id_from_toolmapping and explains why SAST/Secrets key onissueIdwhile SCA keys onuniqueHash.Checklist
dev. (Rebased onto the latestbugfix, the correct base for a bug fix.)bugfixbranch.bugfix)