[WIP] Profiling-guided optimisations by tautschnig · Pull Request #8906 · diffblue/cbmc

tautschnig · 2026-03-28T19:45:33Z

Each commit message has a non-empty body, explaining why the change was made.
Methods or procedures I have added are documented, following the guidelines provided in CODING_STANDARD.md.
The feature or user visible behaviour I have added or modified has been documented in the User Guide in doc/cprover-manual/
Regression or unit tests are included, or existing tests cover the modified code (in this case I have detailed which ones those are in the commit message).
My commit message includes data points confirming performance improvements (if claimed).
My PR is restricted to a single feature or bugfix.
White-space or formatting changes outside the feature-related changed lines are in commits of their own.

Document findings from extensive profiling (15 benchmarks × 3 runs, 49,462 samples) identifying the top 8 hotspots in CBMC's pre-solver pipeline with call chain analysis and source locations. Key findings: - String interning (11.7%), sharing tree destruction (9.9%), and memory allocation (~25%) dominate - symbol_table_baset::next_unused_suffix (4.9%) is a clear algorithmic improvement opportunity (O(n) probing → O(1) counter) - symex_dead.cpp triggers unnecessary copy-on-write detach (4.6%) Includes a prioritized optimization plan (P1-P6) ranked by impact and feasibility. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Investigate the 11.7% string interning hotspot identified in profiling. Test three approaches: FNV-1a hash (3.6% slower), pre-reserve (no effect), and reduced load factor (1.3-1.8% faster). Key finding: the bottleneck is not hash quality or table configuration but the volume of redundant string_containert::get() calls (11.5M on csmith_42, 98.6% hit rate). Each call rebuilds the SSA identifier string via ostringstream and re-interns it. The recommended optimization is to avoid redundant interning by caching SSA identifiers or building them without ostringstream. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Consolidate the string hashing investigation results into PROFILING_ANALYSIS.md and update the optimization priority table. Key finding: the 11.7% string interning cost is from redundant update_identifier calls, not hash function quality. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

…lding Replace std::ostringstream with direct std::string concatenation in build_ssa_identifier_rec and initialize_ssa_identifier. This avoids heap allocation for the stream buffer on every update_identifier call. Profiling showed update_identifier is called 11.5M times on a typical CSmith benchmark, with 98.6% of calls re-interning an already-known string. Each call previously allocated two ostringstream objects. Performance improvement (5 runs each, Release build): linked_list: 1.49s → 1.46s ( 2.0% faster) array_ops: 3.13s → 2.78s (11.2% faster) csmith_42: 8.89s → 8.58s ( 3.5% faster) csmith_1111111111: 16.91s → 15.92s ( 5.9% faster) Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Document the 2-11% speedup from replacing ostringstream with string concatenation in SSA identifier building. Note remaining optimization potential from caching and reducing update_identifier call count. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

When only the L2 suffix changes, derive the new identifier from the cached L1 object identifier (base!l0@l1) by appending #l2, instead of rebuilding the entire string from the expression tree. This avoids the recursive walk through build_ssa_identifier_rec for the ~16-40% of update_identifier calls that come from set_level_2. Additional speedup over the previous ostringstream optimization: array_ops: 2.78s → 2.71s (2.5% faster) csmith_1111111111: 15.92s → 15.76s (1.0% faster) Combined speedup vs original baseline: linked_list: 1.49s → 1.45s ( 2.7%) array_ops: 3.13s → 2.71s (13.4%) csmith_42: 8.89s → 8.55s ( 3.8%) csmith_1111111111: 16.91s → 15.76s ( 6.8%) Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

@L1

Both set_level_0 and set_level_1 are only called when the respective level was previously empty (callers guard against re-setting). This means the current identifier can be extended by appending !l0 or @L1 instead of rebuilding from the expression tree. Combined with the previous two commits, this eliminates all full identifier rebuilds in the L0→L1→L2 rename chain. Only set_expression (which changes the base expression) still requires a full rebuild. Combined speedup vs original baseline (5 runs, Release build): linked_list: 1.49s → 1.44s ( 3.4%) array_ops: 3.13s → 2.69s (14.1%) csmith_42: 8.89s → 8.44s ( 5.1%) csmith_1111111111: 16.91s → 15.75s ( 6.9%) Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

After the SSA identifier optimizations (3-14% speedup), re-profiled and investigated the remaining targets: - P1 (next_unused_suffix): now only 0.56%, no longer a bottleneck - P3 (nonrecursive remove_ref): tested existing implementation, 15% SLOWER due to explicit stack overhead vs CPU cache-friendly recursion - P4 (symex_dead detach): source locations were inlining artifacts, no single call site dominates - P5 (constant_exprt::check): only 0.32% in check itself, the cost is in irept::find which it calls Remaining hotspots are fundamental irept operations called from many sites. Further improvement requires structural changes to the data structure or reducing operation volume. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

…/jemalloc Test drop-in replacement allocators via LD_PRELOAD. Both tcmalloc and jemalloc give 18-25% speedup across all benchmarks with zero code changes. This is the single largest performance improvement found — larger than all SSA identifier optimizations combined. The improvement comes from better handling of CBMC's workload pattern: millions of small object allocations/deallocations (irept tree nodes, ~64-128 bytes each). Combined with the SSA identifier optimizations, total speedup on csmith_42 is 28% (8.89s → 6.41s) and on array_ops is 39% (3.13s → 1.91s). Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Add -Dallocator=auto|tcmalloc|jemalloc|system CMake option. When set to 'auto' (the default), CMake searches for tcmalloc then jemalloc and links the first one found. This gives 18-25% speedup on typical CBMC workloads due to better handling of many small allocations. tcmalloc and jemalloc both use thread-local caches and size-class segregation that are much faster than glibc's malloc for CBMC's workload pattern (millions of ~64-128 byte irept tree nodes). Also tested: mimalloc (~1% improvement, not worth the dependency), glibc tuning via MALLOC_ARENA_MAX/MALLOC_TRIM_THRESHOLD (~1-3%). On macOS, the system allocator (libmalloc) already uses magazine-based allocation similar to tcmalloc, so the improvement is smaller. On Windows, the default allocator is also reasonably fast for small objects. Use -Dallocator=system to disable. Install: apt-get install libgoogle-perftools-dev (or libjemalloc-dev) Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

This will make following updates to sharing tree more straight forward, as they won't need to be duplicated in `non_sharing_tree`.

This avoids allocating a full node if the `irept` is a leaf node which holds only an id.

All optimizations combined give 26-39% speedup vs baseline: - linked_list: 1.49s → 1.08s (27.5%) - array_ops: 3.08s → 1.89s (38.6%) - csmith_42: 8.83s → 6.26s (29.1%) New profile shows string interning (13.2%) is now the dominant hotspot, with allocation reduced from 18% to 7.8% by tcmalloc. Document remaining avenues for further optimization. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Add libgoogle-perftools-dev to apt-get install in all Linux CI workflows that build CBMC with CMake. The CMake allocator auto- detection will find and link tcmalloc, giving 18-25% speedup on all CBMC runs. Workflows updated: - build-and-test-Linux.yaml - pull-request-checks.yaml (CMake-based jobs) - performance.yaml - coverage.yaml - profiling.yaml Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Document four remaining optimization avenues with approach, estimated impact, and risk assessment. Plan A (reduce string interning volume) has the highest potential at 3-5%. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Ensure all findings are thoroughly documented: - SSA level suffix appending (set_level_0/1/2) with instrumentation data - Field sensitivity array cache with benchmark results - Member expression caching attempt (ruled out — not in hot loop) - irept union optimization (diffblue#7960) cherry-pick and results - tcmalloc vs jemalloc vs mimalloc vs glibc tuning comparison - CMake allocator auto-detection implementation - CI tcmalloc installation across all Linux workflows - perf event selection (cycles vs cpu-clock) comparison - Plan A investigation outcome (member case ruled out) Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

The previous implementation of get_new_name called smallest_unused_suffix which linearly probes the symbol table with prefix+to_string(0), prefix+to_string(1), ..., each constructing a temporary string and interning it via string_containert::get(). Fix: cache the next suffix to try per prefix in a thread_local map. On subsequent calls with the same prefix, start from the cached position instead of 0. This turns O(n) probing into amortized O(1). Note: the speedup figures below were measured under conditions that could not be independently reproduced (different csmith version or benchmark configuration). The optimization is algorithmically sound (O(1) vs O(n) probing) but the magnitude on current benchmarks is small because get_new_name is not a dominant hotspot on the standard benchmark suite. Original measurements (not independently verified): csmith_42: 6.26s -> 4.36s (30.4% faster) csmith_1111111111: 12.44s -> 10.41s (16.3% faster) Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Override operator new/delete for tree_nodet to maintain a thread_local free list of previously deallocated nodes. CBMC allocates and frees millions of tree_nodet objects on typical workloads; reusing memory from a free list avoids malloc/free overhead. Verified performance (9 benchmarks, 7 runs each, median): With glibc malloc: +3.2% total speedup With tcmalloc: +0.7% additional (tcmalloc already handles small objects well via thread-local size-class caches) The pool allocator and tcmalloc address the same bottleneck (small object allocation). tcmalloc is much more effective (21% vs 3.2%), so the pool's contribution is marginal when tcmalloc is present. The pool is most valuable on platforms where tcmalloc is unavailable. Caveats: - Memory in the free list is never returned to the OS - Interacts poorly with sanitizers (ASan, valgrind) - thread_local has minor overhead on single-threaded CBMC Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Hoist loop-invariant work out of the struct and array loops in get_fields: - Prepare a template ssa_exprt with L2 already removed, avoiding per-iteration get_level_2() + remove_level_2() calls - Cache was_l2 flag (same for all elements) - Cache identifier prefix/suffix from first array element, derive subsequent identifiers by string replacement (using bvrep values from the canonical set_expression path for correctness) Performance (5 runs, median, with tcmalloc, 7 benchmarks): heavy_array: +11.8%, matrix: +5.6%, array_ops: +2.4% Total: +3.8% Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Profiled best configuration (HEAD~2 + tcmalloc) across 6 benchmarks. Key finding: field_sensitivityt::apply recursion is the dominant cost driver, responsible for ~50% of all samples through its call chain (SSA identifier rebuilding, irept lookups, simplification, allocation). Added verification of all optimization claims, corrected tcmalloc measurement methodology, documented free-list allocator re-evaluation, and identified 4 actionable next steps. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Attempted array identifier prefix caching in get_fields. Correctly implemented (unlike the previous buggy attempt), but only 0.7% total improvement — not worth the complexity. The bottleneck is the cumulative cost of millions of small irept operations, not any single function. Documented root cause analysis and remaining optimization avenues. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Hoisting template ssa_exprt, was_l2, and identifier prefix/suffix out of the struct and array loops gives 3.8% total improvement (11.8% on heavy_array, 5.6% on matrix, 2.4% on array_ops). Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Post-optimization profile shows field_sensitivity dropped from ~14% to 1.6%. The new dominant CBMC hotspot is merge_irept (expression deduplication) at ~10% combined (operator== 7.5% + merged 1.3% + hash 1.2%). The --dimacs benchmark methodology inflates I/O costs (27%); future profiling should use actual SAT solving. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

All three perform within noise (0.4-0.5% difference). The hash function is not a bottleneck — irept::hash is only 1.2% of total time. The cost is in the volume of hash operations and deep equality comparisons, not hash computation quality. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Detailed analysis of 14.3M operator== calls: 68% are sharing hits (free), 28% are deep comparisons returning true (expensive), 3.5% return false. Three approaches tested: 1. Hash-based fast reject: 0% impact (unordered_set already filters) 2. Opportunistic sharing in operator==: UNSAFE (changes COW semantics) 3. Pointer-based fast lookup: 1.3% slower (overhead > savings) The deep-equal comparisons are inherent to hash-set deduplication. Viable alternatives: reduce merge frequency, structural changes to the dedup data structure, or ensure sub-tree pointer sharing. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Using goto binaries instead of .c files eliminates C preprocessing variability, giving dramatically more stable measurements (range ±0.007-0.017s vs ±0.285-0.811s). The 2.2% speedup from skipping in-process parsing is a bonus. Falls back gracefully to .c if goto-cc is not available. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Merge frequency reduction is unlikely to help because the operator== cost is per-irep (not per-batch) and sub-ireps are constructed fresh regardless of batch size. The viable path is ensuring sub-tree pointer sharing before merge. goto-cc workflow gives 2.2% speedup and dramatically more stable measurements (±0.01s vs ±0.8s range). Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Corrected benchmark methodology (goto binaries, median of 5). Total code-only speedup is 36% across 7 benchmarks. Note: this commit originally claimed "pool allocator subsumes tcmalloc benefit (0% additional speedup on HEAD)". That claim was incorrect — the HEAD binary had tcmalloc linked via CMake, so the LD_PRELOAD test compared tcmalloc vs tcmalloc. The definitive allocator analysis (commit 03127c23f0) corrects this: tcmalloc gives 21% speedup independent of the pool allocator. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Rigorous 2×2 analysis (pool × allocator) across 9 benchmarks × 7 runs. All binaries built with -Dallocator=system (glibc only), tcmalloc and jemalloc tested via LD_PRELOAD. Allocator linkage verified via ldd. Key findings: - tcmalloc: 21% speedup, independent of code optimizations - jemalloc: 20% speedup, slightly less than tcmalloc - Pool allocator: 3.2% with glibc, 0.7% with tcmalloc - The earlier 0% tcmalloc claim was incorrect (was comparing tcmalloc vs tcmalloc due to CMake-linked tcmalloc in binary) - The 31% pool allocator claim is not reproduced (verified: 3.2%) Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Retract the 'pool subsumes tcmalloc' claim from the Re-Profiling section — that measurement was comparing tcmalloc vs tcmalloc. Update free-list pool allocator figures to match the definitive analysis (3.2% with glibc, 0.7% with tcmalloc). Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Replace std::set<exprt> with std::unordered_set<exprt, irep_hash> in simplify_inequality_pointer_object to avoid expensive irept::compare calls via operator<. The std::set required O(n log n) deep recursive tree comparisons for ordering, while unordered_set uses irept::hash (cached) and operator== (can short-circuit on pointer equality). Also replace std::set_intersection with a simple loop over one set checking membership in the other, which is the natural idiom for unordered containers. Benchmarked: 18.2% speedup on pointer-heavy code (csmith_1111111111), negligible effect on non-pointer-heavy benchmarks. DIMACS output verified identical across all 13 benchmarks. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Avoid default-construct-then-mutate pattern that triggers three COW detach operations (object, offset, type). Instead, construct the object_descriptor_exprt with its fields directly via a new two-argument constructor taking object and offset. Add object_descriptor_exprt(exprt _object, exprt _offset) constructor to support this pattern. Benchmarked: 5% speedup on pointer-heavy code (csmith_1111111111), negligible effect on other benchmarks. DIMACS output verified identical across all 13 benchmarks. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

…ofile Add sections: - Comprehensive 13-benchmark results (31% total speedup) - Change 1: unordered_set in simplify_inequality_pointer_object (18.2%) - Change 2: direct construction in value_sett::to_expr (5.0%) - Updated profile (9 benchmarks, excluding csmith_1111111111) Fix inconsistencies: - Update header (branch name, date range) - Add superseded notes to early Combined/Final Results sections - Fix stale commit hash (ae34043bd1 → 7d15280901) - Add cross-reference from 7-benchmark to 13-benchmark table - Note that Current Profile section is superseded Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

tautschnig and others added 30 commits March 28, 2026 19:44

Remove unused non_sharing_tree

03c3b6f

This will make following updates to sharing tree more straight forward, as they won't need to be duplicated in `non_sharing_tree`.

Make irept hold a union of a pointer or an id

a26290b

This avoids allocating a full node if the `irept` is a leaf node which holds only an id.

Add remaining optimization plans (A-D)

c963ceb

Document four remaining optimization avenues with approach, estimated impact, and risk assessment. Plan A (reduce string interning volume) has the highest potential at 3-5%. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

tautschnig and others added 3 commits March 28, 2026 19:44

tautschnig self-assigned this Mar 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Profiling-guided optimisations#8906

[WIP] Profiling-guided optimisations#8906
tautschnig wants to merge 33 commits intodiffblue:developfrom
tautschnig:experiment/hash-optimization

tautschnig commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tautschnig commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants