Skip to content

[WIP] Profiling-guided optimisations#8906

Draft
tautschnig wants to merge 33 commits intodiffblue:developfrom
tautschnig:experiment/hash-optimization
Draft

[WIP] Profiling-guided optimisations#8906
tautschnig wants to merge 33 commits intodiffblue:developfrom
tautschnig:experiment/hash-optimization

Conversation

@tautschnig
Copy link
Copy Markdown
Collaborator

  • Each commit message has a non-empty body, explaining why the change was made.
  • Methods or procedures I have added are documented, following the guidelines provided in CODING_STANDARD.md.
  • The feature or user visible behaviour I have added or modified has been documented in the User Guide in doc/cprover-manual/
  • Regression or unit tests are included, or existing tests cover the modified code (in this case I have detailed which ones those are in the commit message).
  • My commit message includes data points confirming performance improvements (if claimed).
  • My PR is restricted to a single feature or bugfix.
  • White-space or formatting changes outside the feature-related changed lines are in commits of their own.

tautschnig and others added 30 commits March 28, 2026 19:44
Document findings from extensive profiling (15 benchmarks × 3 runs,
49,462 samples) identifying the top 8 hotspots in CBMC's pre-solver
pipeline with call chain analysis and source locations.

Key findings:
- String interning (11.7%), sharing tree destruction (9.9%), and
  memory allocation (~25%) dominate
- symbol_table_baset::next_unused_suffix (4.9%) is a clear algorithmic
  improvement opportunity (O(n) probing → O(1) counter)
- symex_dead.cpp triggers unnecessary copy-on-write detach (4.6%)

Includes a prioritized optimization plan (P1-P6) ranked by impact
and feasibility.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Investigate the 11.7% string interning hotspot identified in profiling.
Test three approaches: FNV-1a hash (3.6% slower), pre-reserve (no
effect), and reduced load factor (1.3-1.8% faster).

Key finding: the bottleneck is not hash quality or table configuration
but the volume of redundant string_containert::get() calls (11.5M on
csmith_42, 98.6% hit rate). Each call rebuilds the SSA identifier
string via ostringstream and re-interns it.

The recommended optimization is to avoid redundant interning by
caching SSA identifiers or building them without ostringstream.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Consolidate the string hashing investigation results into
PROFILING_ANALYSIS.md and update the optimization priority table.
Key finding: the 11.7% string interning cost is from redundant
update_identifier calls, not hash function quality.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…lding

Replace std::ostringstream with direct std::string concatenation in
build_ssa_identifier_rec and initialize_ssa_identifier. This avoids
heap allocation for the stream buffer on every update_identifier call.

Profiling showed update_identifier is called 11.5M times on a typical
CSmith benchmark, with 98.6% of calls re-interning an already-known
string. Each call previously allocated two ostringstream objects.

Performance improvement (5 runs each, Release build):
  linked_list:       1.49s → 1.46s  ( 2.0% faster)
  array_ops:         3.13s → 2.78s  (11.2% faster)
  csmith_42:         8.89s → 8.58s  ( 3.5% faster)
  csmith_1111111111: 16.91s → 15.92s ( 5.9% faster)

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Document the 2-11% speedup from replacing ostringstream with string
concatenation in SSA identifier building. Note remaining optimization
potential from caching and reducing update_identifier call count.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
When only the L2 suffix changes, derive the new identifier from the
cached L1 object identifier (base!l0@l1) by appending #l2, instead
of rebuilding the entire string from the expression tree.

This avoids the recursive walk through build_ssa_identifier_rec for
the ~16-40% of update_identifier calls that come from set_level_2.

Additional speedup over the previous ostringstream optimization:
  array_ops:         2.78s → 2.71s (2.5% faster)
  csmith_1111111111: 15.92s → 15.76s (1.0% faster)

Combined speedup vs original baseline:
  linked_list:       1.49s → 1.45s ( 2.7%)
  array_ops:         3.13s → 2.71s (13.4%)
  csmith_42:         8.89s → 8.55s ( 3.8%)
  csmith_1111111111: 16.91s → 15.76s ( 6.8%)

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Both set_level_0 and set_level_1 are only called when the respective
level was previously empty (callers guard against re-setting). This
means the current identifier can be extended by appending !l0 or @L1
instead of rebuilding from the expression tree.

Combined with the previous two commits, this eliminates all full
identifier rebuilds in the L0→L1→L2 rename chain. Only set_expression
(which changes the base expression) still requires a full rebuild.

Combined speedup vs original baseline (5 runs, Release build):
  linked_list:       1.49s → 1.44s ( 3.4%)
  array_ops:         3.13s → 2.69s (14.1%)
  csmith_42:         8.89s → 8.44s ( 5.1%)
  csmith_1111111111: 16.91s → 15.75s ( 6.9%)

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
After the SSA identifier optimizations (3-14% speedup), re-profiled
and investigated the remaining targets:

- P1 (next_unused_suffix): now only 0.56%, no longer a bottleneck
- P3 (nonrecursive remove_ref): tested existing implementation,
  15% SLOWER due to explicit stack overhead vs CPU cache-friendly
  recursion
- P4 (symex_dead detach): source locations were inlining artifacts,
  no single call site dominates
- P5 (constant_exprt::check): only 0.32% in check itself, the cost
  is in irept::find which it calls

Remaining hotspots are fundamental irept operations called from many
sites. Further improvement requires structural changes to the data
structure or reducing operation volume.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…/jemalloc

Test drop-in replacement allocators via LD_PRELOAD. Both tcmalloc
and jemalloc give 18-25% speedup across all benchmarks with zero
code changes. This is the single largest performance improvement
found — larger than all SSA identifier optimizations combined.

The improvement comes from better handling of CBMC's workload pattern:
millions of small object allocations/deallocations (irept tree nodes,
~64-128 bytes each).

Combined with the SSA identifier optimizations, total speedup on
csmith_42 is 28% (8.89s → 6.41s) and on array_ops is 39%
(3.13s → 1.91s).

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Add -Dallocator=auto|tcmalloc|jemalloc|system CMake option. When set
to 'auto' (the default), CMake searches for tcmalloc then jemalloc
and links the first one found. This gives 18-25% speedup on typical
CBMC workloads due to better handling of many small allocations.

tcmalloc and jemalloc both use thread-local caches and size-class
segregation that are much faster than glibc's malloc for CBMC's
workload pattern (millions of ~64-128 byte irept tree nodes).

Also tested: mimalloc (~1% improvement, not worth the dependency),
glibc tuning via MALLOC_ARENA_MAX/MALLOC_TRIM_THRESHOLD (~1-3%).

On macOS, the system allocator (libmalloc) already uses magazine-based
allocation similar to tcmalloc, so the improvement is smaller. On
Windows, the default allocator is also reasonably fast for small objects.
Use -Dallocator=system to disable.

Install: apt-get install libgoogle-perftools-dev  (or libjemalloc-dev)

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
This will make following updates to sharing tree more straight forward,
as they won't need to be duplicated in `non_sharing_tree`.
This avoids allocating a full node if the `irept` is a leaf node which
holds only an id.
All optimizations combined give 26-39% speedup vs baseline:
- linked_list: 1.49s → 1.08s (27.5%)
- array_ops:   3.08s → 1.89s (38.6%)
- csmith_42:   8.83s → 6.26s (29.1%)

New profile shows string interning (13.2%) is now the dominant
hotspot, with allocation reduced from 18% to 7.8% by tcmalloc.
Document remaining avenues for further optimization.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Add libgoogle-perftools-dev to apt-get install in all Linux CI
workflows that build CBMC with CMake. The CMake allocator auto-
detection will find and link tcmalloc, giving 18-25% speedup on
all CBMC runs.

Workflows updated:
- build-and-test-Linux.yaml
- pull-request-checks.yaml (CMake-based jobs)
- performance.yaml
- coverage.yaml
- profiling.yaml

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Document four remaining optimization avenues with approach, estimated
impact, and risk assessment. Plan A (reduce string interning volume)
has the highest potential at 3-5%.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Ensure all findings are thoroughly documented:
- SSA level suffix appending (set_level_0/1/2) with instrumentation data
- Field sensitivity array cache with benchmark results
- Member expression caching attempt (ruled out — not in hot loop)
- irept union optimization (diffblue#7960) cherry-pick and results
- tcmalloc vs jemalloc vs mimalloc vs glibc tuning comparison
- CMake allocator auto-detection implementation
- CI tcmalloc installation across all Linux workflows
- perf event selection (cycles vs cpu-clock) comparison
- Plan A investigation outcome (member case ruled out)

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
The previous implementation of get_new_name called
smallest_unused_suffix which linearly probes the symbol table with
prefix+to_string(0), prefix+to_string(1), ..., each constructing a
temporary string and interning it via string_containert::get().

Fix: cache the next suffix to try per prefix in a thread_local map.
On subsequent calls with the same prefix, start from the cached
position instead of 0. This turns O(n) probing into amortized O(1).

Note: the speedup figures below were measured under conditions that
could not be independently reproduced (different csmith version or
benchmark configuration). The optimization is algorithmically sound
(O(1) vs O(n) probing) but the magnitude on current benchmarks is
small because get_new_name is not a dominant hotspot on the standard
benchmark suite.

Original measurements (not independently verified):
  csmith_42:         6.26s -> 4.36s (30.4% faster)
  csmith_1111111111: 12.44s -> 10.41s (16.3% faster)

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Override operator new/delete for tree_nodet to maintain a thread_local
free list of previously deallocated nodes. CBMC allocates and frees
millions of tree_nodet objects on typical workloads; reusing memory
from a free list avoids malloc/free overhead.

Verified performance (9 benchmarks, 7 runs each, median):
  With glibc malloc: +3.2% total speedup
  With tcmalloc:     +0.7% additional (tcmalloc already handles
    small objects well via thread-local size-class caches)

The pool allocator and tcmalloc address the same bottleneck (small
object allocation). tcmalloc is much more effective (21% vs 3.2%),
so the pool's contribution is marginal when tcmalloc is present.
The pool is most valuable on platforms where tcmalloc is unavailable.

Caveats:
- Memory in the free list is never returned to the OS
- Interacts poorly with sanitizers (ASan, valgrind)
- thread_local has minor overhead on single-threaded CBMC

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Hoist loop-invariant work out of the struct and array loops in
get_fields:
- Prepare a template ssa_exprt with L2 already removed, avoiding
  per-iteration get_level_2() + remove_level_2() calls
- Cache was_l2 flag (same for all elements)
- Cache identifier prefix/suffix from first array element, derive
  subsequent identifiers by string replacement (using bvrep values
  from the canonical set_expression path for correctness)

Performance (5 runs, median, with tcmalloc, 7 benchmarks):
  heavy_array: +11.8%, matrix: +5.6%, array_ops: +2.4%
  Total: +3.8%

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Profiled best configuration (HEAD~2 + tcmalloc) across 6 benchmarks.
Key finding: field_sensitivityt::apply recursion is the dominant cost
driver, responsible for ~50% of all samples through its call chain
(SSA identifier rebuilding, irept lookups, simplification, allocation).

Added verification of all optimization claims, corrected tcmalloc
measurement methodology, documented free-list allocator re-evaluation,
and identified 4 actionable next steps.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Attempted array identifier prefix caching in get_fields. Correctly
implemented (unlike the previous buggy attempt), but only 0.7% total
improvement — not worth the complexity. The bottleneck is the cumulative
cost of millions of small irept operations, not any single function.

Documented root cause analysis and remaining optimization avenues.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Hoisting template ssa_exprt, was_l2, and identifier prefix/suffix
out of the struct and array loops gives 3.8% total improvement
(11.8% on heavy_array, 5.6% on matrix, 2.4% on array_ops).

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Post-optimization profile shows field_sensitivity dropped from ~14%
to 1.6%. The new dominant CBMC hotspot is merge_irept (expression
deduplication) at ~10% combined (operator== 7.5% + merged 1.3% +
hash 1.2%). The --dimacs benchmark methodology inflates I/O costs
(27%); future profiling should use actual SAT solving.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
All three perform within noise (0.4-0.5% difference). The hash
function is not a bottleneck — irept::hash is only 1.2% of total
time. The cost is in the volume of hash operations and deep equality
comparisons, not hash computation quality.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Detailed analysis of 14.3M operator== calls: 68% are sharing hits
(free), 28% are deep comparisons returning true (expensive), 3.5%
return false. Three approaches tested:

1. Hash-based fast reject: 0% impact (unordered_set already filters)
2. Opportunistic sharing in operator==: UNSAFE (changes COW semantics)
3. Pointer-based fast lookup: 1.3% slower (overhead > savings)

The deep-equal comparisons are inherent to hash-set deduplication.
Viable alternatives: reduce merge frequency, structural changes to
the dedup data structure, or ensure sub-tree pointer sharing.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Using goto binaries instead of .c files eliminates C preprocessing
variability, giving dramatically more stable measurements (range
±0.007-0.017s vs ±0.285-0.811s). The 2.2% speedup from skipping
in-process parsing is a bonus.

Falls back gracefully to .c if goto-cc is not available.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Merge frequency reduction is unlikely to help because the operator==
cost is per-irep (not per-batch) and sub-ireps are constructed fresh
regardless of batch size. The viable path is ensuring sub-tree pointer
sharing before merge.

goto-cc workflow gives 2.2% speedup and dramatically more stable
measurements (±0.01s vs ±0.8s range).

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Corrected benchmark methodology (goto binaries, median of 5).
Total code-only speedup is 36% across 7 benchmarks.

Note: this commit originally claimed "pool allocator subsumes
tcmalloc benefit (0% additional speedup on HEAD)". That claim was
incorrect — the HEAD binary had tcmalloc linked via CMake, so the
LD_PRELOAD test compared tcmalloc vs tcmalloc. The definitive
allocator analysis (commit 03127c23f0) corrects this: tcmalloc
gives 21% speedup independent of the pool allocator.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Rigorous 2×2 analysis (pool × allocator) across 9 benchmarks × 7 runs.
All binaries built with -Dallocator=system (glibc only), tcmalloc and
jemalloc tested via LD_PRELOAD. Allocator linkage verified via ldd.

Key findings:
- tcmalloc: 21% speedup, independent of code optimizations
- jemalloc: 20% speedup, slightly less than tcmalloc
- Pool allocator: 3.2% with glibc, 0.7% with tcmalloc
- The earlier 0% tcmalloc claim was incorrect (was comparing
  tcmalloc vs tcmalloc due to CMake-linked tcmalloc in binary)
- The 31% pool allocator claim is not reproduced (verified: 3.2%)

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Retract the 'pool subsumes tcmalloc' claim from the Re-Profiling
section — that measurement was comparing tcmalloc vs tcmalloc.
Update free-list pool allocator figures to match the definitive
analysis (3.2% with glibc, 0.7% with tcmalloc).

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
tautschnig and others added 3 commits March 28, 2026 19:44
Replace std::set<exprt> with std::unordered_set<exprt, irep_hash> in
simplify_inequality_pointer_object to avoid expensive irept::compare
calls via operator<. The std::set required O(n log n) deep recursive
tree comparisons for ordering, while unordered_set uses irept::hash
(cached) and operator== (can short-circuit on pointer equality).

Also replace std::set_intersection with a simple loop over one set
checking membership in the other, which is the natural idiom for
unordered containers.

Benchmarked: 18.2% speedup on pointer-heavy code (csmith_1111111111),
negligible effect on non-pointer-heavy benchmarks. DIMACS output
verified identical across all 13 benchmarks.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Avoid default-construct-then-mutate pattern that triggers three COW
detach operations (object, offset, type). Instead, construct the
object_descriptor_exprt with its fields directly via a new two-argument
constructor taking object and offset.

Add object_descriptor_exprt(exprt _object, exprt _offset) constructor
to support this pattern.

Benchmarked: 5% speedup on pointer-heavy code (csmith_1111111111),
negligible effect on other benchmarks. DIMACS output verified identical
across all 13 benchmarks.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…ofile

Add sections:
- Comprehensive 13-benchmark results (31% total speedup)
- Change 1: unordered_set in simplify_inequality_pointer_object (18.2%)
- Change 2: direct construction in value_sett::to_expr (5.0%)
- Updated profile (9 benchmarks, excluding csmith_1111111111)

Fix inconsistencies:
- Update header (branch name, date range)
- Add superseded notes to early Combined/Final Results sections
- Fix stale commit hash (ae34043bd1 → 7d15280901)
- Add cross-reference from 7-benchmark to 13-benchmark table
- Note that Current Profile section is superseded

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
@tautschnig tautschnig self-assigned this Mar 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants