Fast MPS parser for free-format MPS files by aliceb-nv · Pull Request #1429 · NVIDIA/cuopt

aliceb-nv · 2026-06-12T15:33:41Z

This PR introduces an experimental, opt-in fast parser for well-formed MPS files, which takes advantage of available parallelism and SIMD extensions on the target machine. This is mostly intended for huge (>1GB) MPS files, e.g. for distributed solves.

The parser relies on parallel I/O overlapping with MPS section parsing workers which start parsing as soon as read/decoded data is made available, in order to hide latency on slow NFS filesystems.
LZ4 is added as a supported decompression format, which is done in a parallel fashion. LZ4 tends to perform unusually well (~10-30% compression ratios) on MPS data due to their very regular nature and common prefixes in row/column names, and decompresses at a few GB/s.

This feature is exposed via a new CLI flag --mps-reader, which can take either default (existing parser), or experimental-fast (this PR) as its argument.

LP, MIP, QP, and MPS SOCP formats are supported.

Results, on a corpus of 42 instances with sizes ranging from 1.55GB to 50.5GB:

 NFS reference vs NFS fast:
  - mean: 17.10x
  - median: 18.08x
  - range: 6.72x to 29.05x

  NFS reference vs NFS LZ4-fast:
  - mean: 34.98x
  - median: 35.28x
  - range: 11.80x to 97.99x
  
 NVMe reference vs NVMe fast:
   - mean: 25.59x
  - median: 23.74x
  - range: 11.81x to 45.62x

Results for some >100GB files, NFS storage, cold cache (reference parser times out at 1 hour)

Name | Size | Runtime
psr_100.mps | 58 GB | 104s
psr_100.mps.lz4 | 8.6 GB | 40s
design_match.mps | 131.8 GB | 209.5s
design_match.mps.lz4 | 44 GB | 60.4s
tsp-gaia-100m.mps | 397GB | 620s
tsp-gaia-100m.mps.lz4 | 83GB | 362s

Description

Issue

Checklist

I am familiar with the Contributing Guidelines.
Testing
- New or existing tests cover these changes
- Added tests
- Created an issue to follow-up
- NA
Documentation
- The documentation is up to date with these changes
- Added new documentation
- NA

…refactor

copy-pr-bot · 2026-06-12T15:33:45Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-06-12T15:49:03Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c1a45c0b-7f03-47b1-acb7-ccddb3e26f69

📥 Commits

Reviewing files that changed from the base of the PR and between 1990c06 and d7358f6.

📒 Files selected for processing (1)

cpp/src/io/experimental_mps_fast/file_reader.cpp

🚧 Files skipped from review as they are similar to previous changes (1)

cpp/src/io/experimental_mps_fast/file_reader.cpp

📝 Walkthrough

Walkthrough

Adds an opt-in experimental fast MPS reader with SIMD tokenization and Eisel–Lemire FP64 conversion, mmap-backed input streams, multi-threaded LZ4 frame decoding, a phase-based section scanner, and extended test coverage with CMake wiring for reader selection and optional LZ4 support.

Changes

Experimental Fast MPS Parser with LZ4 Support

Layer / File(s)	Summary
Build wiring and reader selection API `cpp/CMakeLists.txt`, `cpp/src/CMakeLists.txt`, `cpp/src/io/CMakeLists.txt`, `cpp/include/cuopt/linear_programming/io/parser.hpp`, `cpp/src/io/parser.cpp`, `cpp/cuopt_cli.cpp`, `cpp/src/io/utilities/error.hpp`	Adds `simde` via CPM with `INTERFACE_INCLUDE_DIRECTORIES` target and optional `CUOPT_PARSER_WITH_LZ4` configuration. Exports `MPS_FAST_SRC_FILES` and applies CPU-specific compile flags (BMI2/AVX2/SSE4.2) for x86-64. Introduces `mps_reader_type_t` enum and 3-argument `read(path, mps_reader_type_t, fixed_mps_format)` dispatch with case-insensitive extension matching. CLI adds `--mps-reader` option with default/experimental-fast choices. Centralizes parser error formatting via `mps_parser_throw` (JSON error) and `mps_parser_fail` (varargs printf).
FP64 conversion and cursor-based tokenization `cpp/src/io/experimental_mps_fast/fast_fp64_parser.hpp`, `cpp/src/io/experimental_mps_fast/fast_parse_primitives.hpp`, `cpp/src/io/experimental_mps_fast/fast_parser.hpp`	Constexpr Eisel–Lemire power-of-10 lookup table using custom 256-bit arithmetic. SWAR digit parsing up to 19 significant digits with sign, mantissa, fractional count, and base-10 exponent tracking; fast-eligibility flags gate optimized paths. Fallback to `std::strtod` with 32-byte buffer check and `d`/`D` normalization. Cursor abstraction over character buffers with SIMD/scalar whitespace scanning, end-of-line handling, comment skipping, error reporting. Token extraction via SIMD with 32B loads and short-buffer fallbacks. Two-field optimized read. Numeric parsing via `fp64::parse_fp64_advance`, plus fast `-1`/`1` path. Section/comment acceptance helpers.
MPS section scanner and phase registry `cpp/src/io/experimental_mps_fast/mps_section_scanner.hpp`, `cpp/src/io/experimental_mps_fast/mps_section_scanner.cpp`	`mps_phase_registry_t` with atomic-backed phase ranges, readiness acquire/release semantics, and optional OpenMP event attachment. `mps_section_block_scanner_t` observes LZ4-decoded blocks out-of-order, identifies MPS section headers via SIMD newline+non-blank column-1 masks followed by scalar prefix validation, records earliest section hits, advances contiguous decoded-byte frontier with overlapping boundary rescans, and publishes phase ranges (mandatory header/rows/columns; optional rhs/bounds/ranges with present/absent semantics).
Stream interfaces, memory management, and utilities `cpp/src/io/experimental_mps_fast/file_reader.hpp`, `cpp/src/io/experimental_mps_fast/file_reader.cpp`, `cpp/src/io/experimental_mps_fast/mmap_region.hpp`, `cpp/src/io/experimental_mps_fast/hash_table_smallstr.hpp`, `cpp/src/io/experimental_mps_fast/nvtx_ranges.hpp`, `cpp/src/utilities/perf_counters.hpp`	CRTP `input_stream_base_t<Derived>` and `input_stream_view_t` unify accessor interface. `mmap_region_t` RAII move-only wrapper for anonymous/aligned/fixed-address mappings with prefix/suffix unmap support. `parallel_error_latch_t` captures first exception for cooperative thread stopping. `scoped_thread_group` and `parallel_for_indexed` support bounded work distribution with exception propagation and optional NVTX thread naming. `smallstr_hash_table_t` optimizes row-name lookups with short-key inline storage (≤28 bytes) and long-key `std::unordered_map` fallback; supports serial and partitioned bucket layouts with optional probe-count instrumentation. NVTX color selection and RAII range instrumentation. Linux `perf_event_open` counter collection for cycles/instructions/cache/branch/DTLB with per-thread snapshots and aggregated IPC/miss-rate reporting.
Raw and memory input stream execution `cpp/src/io/experimental_mps_fast/file_reader.cpp`	`raw_input_stream_t` mmap-backs file inputs with optional `O_DIRECT` for large non-NFS files, parallelizes bounded `pread` window reads with `EINVAL`→buffered fallback per-window, publishes decoded ranges to section scanner. `memory_input_stream_t` wraps caller-provided buffers with padding enforcement. Utilities: file-size/page-size caching, `EINTR`-tolerant `pread_full` (fails on partial reads), NFS path detection via magic file-system ID, extension-based `FileReadMethod` normalization with `Lz4` validation, best-effort cache dropping via `posix_fadvise`.
LZ4 frame decoding pipeline and compressed file dispatch `cpp/src/io/experimental_mps_fast/lz4_file_reader.cpp`, `cpp/src/io/file_to_string.cpp`, `cpp/src/io/file_to_string.hpp`	Runtime `dlopen`/`dlsym` liblz4 loading with minimal frame ABI. LZ4 frame header parsing validates magic/version, requires independent blocks, extracts optional content-size/dict-id/checksum flags, and computes per-block max size. Parallel 3-stage pipeline: readers (resident-window `pread` with readiness signaling via condition variable), metadata scanner (block-stream walk, per-block payload staging with zero-copy or crossing-buffer copies, optional checksum consumption, frame-end verification), decoders (`LZ4_decompress_safe_runtime` or `memcpy`, per-window decode-ref counting for release, ready-frontier publishing). Case-insensitive extension dispatch in `file_to_string` routes `.lz4`/`.gz`/`.bz2` to respective decompressors.
Default parser refinement and parameterized test coverage `cpp/src/io/mps_parser.cpp`, `cpp/tests/linear_programming/parser_test.cpp`, `cpp/tests/linear_programming/CMakeLists.txt`	Minor fix to default parser objective-name handling: assign only when empty; treat subsequent differing names as ignored (instead of broader `else` branch). Existing MPS/QPS parser fixtures converted from `TEST_F` to `TEST_P` to run against both default and fast experimental readers via parameterization. New test verifies `read(..., mps_reader_type_t::fast_experimental)` dispatch for `.qps` inputs. CMake wires `MPS_FAST_PARSER_TEST` target with experimental fast parser sources, includes directory, and simde linking.
Fast FP64 parser numeric validation `cpp/tests/linear_programming/experimental_mps_fast/fast_fp64_parser_test.cpp`	Gtest suite validating `fp64::parse_fp64_advance` bitwise equivalence to `std::strtod` across fixed numeric string cases (including `D`-exponent form), cursor position advancement to token end, malformed numeric suffixes, and 100k deterministic randomized tokens with sign and optional scientific-notation variation.
Fast parser comprehensive edge case and integration tests `cpp/tests/linear_programming/experimental_mps_fast/fast_parser_edge_test.cpp`	Temporary-file fixtures with strict bitwise model comparison (via `std::bit_cast` and `memcmp` on vectors). Scanner boundary detection under arbitrary byte-offset block partitioning; rejection of malformed unknown-record scenarios. BOUNDS defaults, last-statement-wins duplicate semantics, bounds-only variable ordering, integer marker (`INTORG`/`INTEND`) type/bound assignment, scientific-notation fidelity, CRLF equivalence, comment placement across section records, `OBJNAME` objective selection. Large/repeated MPS structure stress tests. Compressed-format parity (`.lz4`/`.gz`/`.bz2`) generation via external CLIs (when available) and full bitwise structural equality validation. QMATRIX/QC-matrix edge cases, malformed QC definition/reference rejection, and unsupported quadratic-record rejection.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Suggested reviewers

rg20
chris-maes
Bubullzz

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

coderabbitai

Actionable comments posted: 12

🧹 Nitpick comments (2)

cpp/tests/linear_programming/parser_test.cpp (1)
2553-2675: ⚡ Quick win

Add direct coverage for the new fast-reader rejection branches.

These dispatch tests still only exercise success paths. They never assert the two new guards in read(...): rejecting fast_experimental with fixed_mps_format=true, and rejecting .qps* when the fast reader is selected. A small pair of EXPECT_THROW cases would lock down the new CLI/API contract.

As per coding guidelines, "Add test coverage for edge cases (empty, infeasible, unbounded, degenerate) when adding new code paths."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tests/linear_programming/parser_test.cpp` around lines 2553 - 2675, Add
two negative tests exercising the new fast-reader rejection branches: (1) call
read<int,double> (or use dispatch_parse) with fast_experimental enabled while
passing fixed_mps_format=true and EXPECT_THROW(std::logic_error) to cover the
"reject fast_experimental with fixed_mps_format" guard (reference the read
function signature and any CLI flag/parameter used to enable
fast_experimental/fixed_mps_format in your API), and (2) attempt to parse a
".qps" (or ".qps.gz"/".qps.bz2") file while forcing the fast reader and
EXPECT_THROW(std::logic_error) to cover the "reject .qps* when fast reader
selected" branch; add these as new TEST cases next to the existing dispatch
tests (e.g., alongside read, qps_extension_dispatches_to_mps_parser) so they run
with the other parser dispatch tests.
Source: Coding guidelines
cpp/tests/linear_programming/experimental_mps_fast/fast_fp64_parser_test.cpp (1)

117-149: ⚡ Quick win

Add explicit overflow/underflow and subnormal boundary cases.

The current suite never exercises the hardest FP64 paths: overflow, underflow-to-zero, and subnormal boundaries. Because the randomized generator caps exponents at [-30, 30], it also won’t cover the fallback/equivalence edge cases this parser is most likely to regress on. Please add a few fixed cases like max-finite overflow, min-normal neighbors, and min-subnormal rounding boundaries.

Based on learnings: "Tests must validate numerical correctness, not just run without error" and "Add test coverage for edge cases (empty, infeasible, unbounded, degenerate) when adding new code paths."

Also applies to: 168-176

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/include/cuopt/linear_programming/io/parser.hpp`:
- Around line 20-25: The documentation for the experimental fast MPS reader is
incorrect: it claims LP/MIP/QP support but the fast path throws on any .qps*
input when mps_reader_type_t::fast_experimental is selected; update the docs to
accurately reflect supported formats (remove QP/.qps claim) wherever the enum
mps_reader_type_t and the dispatch comment block are defined (the enum
declaration and the doc comment above the MPS reader dispatcher), or
alternatively implement QP/.qps support in the fast reader; pick the
documentation fix unless you also add QP handling.
- Around line 155-180: The code unconditionally accepts and advertises “.lz4”
extensions even when LZ4 support may be compiled out; update the suffix checks
and the supported-extensions error message to be gated by the same compile-time
flag used in the build (e.g., MPS_PARSER_WITH_LZ4/CUOPT_PARSER_WITH_LZ4).
Concretely, wrap the checks that call lower.ends_with(".mps.lz4"), ".qps.lz4",
and ".lp.lz4" and the inclusion of ".mps.lz4/.qps.lz4/.lp.lz4" in the thrown
message with `#ifdef` MPS_PARSER_WITH_LZ4 (or the project’s appropriate macro),
and when the macro is not defined ensure those suffixes are not matched and not
listed in the supported extensions string referenced by
read_mps_fast_experimental, read_mps, and read_lp routing logic.

In `@cpp/src/io/experimental_mps_fast/fast_fp64_parser.hpp`:
- Around line 417-427: parse_fp64_advance currently accepts partially-parsed
numbers because it trusts parse_decimal_advance even when it stops early; change
it so that after a successful parse_decimal_advance(p, end, dec) you verify the
parser consumed the entire token by checking that p == end, and if not return
fallback_strtod(std::string_view(start, (size_t)(p - start))); only when p ==
end should you call assemble_fp64(dec) and return its value (with the existing
NaN check that falls back to fallback_strtod). This uses the existing symbols
parse_fp64_advance, parse_decimal_advance, assemble_fp64, and fallback_strtod.

In `@cpp/src/io/experimental_mps_fast/file_reader.cpp`:
- Around line 41-46: The case-sensitive suffix checks in path_has_suffix (and
the similar checks at 274-296) cause inconsistency with file_to_string.cpp which
lowercases filenames before detecting .lz4/.gz/.bz2; normalize the filename to a
lowercase copy once before calling effective_file_read_method()/before any
suffix checks and use that lowercase string for all suffix comparisons (or
update path_has_suffix to perform case-insensitive comparison), so files like
MODEL.MPS.LZ4 are detected as compressed by parse_mps_fast_file and the legacy
reader alike.
- Around line 97-108: get_file_size(const std::string& path) currently opens fd
then calls get_file_size(fd, path) and closes fd only on the success path,
leaking the descriptor if get_file_size(fd, path) throws; wrap the raw fd in a
small RAII/scoped guard (or use a unique_fd/ScopeExit) immediately after ::open
so the descriptor is closed on all exit paths, then call get_file_size(fd, path)
using the RAII handle and let the guard close the fd when it goes out of scope.

In `@cpp/src/io/experimental_mps_fast/file_reader.hpp`:
- Around line 156-183: The helper parallel_for_indexed currently accepts
thread_count==0 and silently does no work; validate thread_count at the start of
parallel_for_indexed (before reserving/spawning threads) and either clamp it to
at least 1 or throw an exception on zero. Update the function to check the
thread_count parameter (the one passed into parallel_for_indexed) immediately
and then proceed to use the validated value with scoped_thread_group workers,
next, and the existing worker lambda so no silent no-op occurs.

In `@cpp/src/io/experimental_mps_fast/lz4_file_reader.cpp`:
- Around line 536-557: Wrap the entire body of read_window in a try { ... }
catch(...) block so any exception (e.g., from new char[w.size], pread_full, or
other statements) is caught and forwarded by calling
fail_and_notify(std::current_exception()); retain the existing mps_parser_fail
usage for pread errors but remove any paths that let exceptions escape
read_window (ensure all exception flows end up invoking
fail_and_notify(std::current_exception()), using the function names read_window,
pread_full, mps_parser_fail, and fail_and_notify to locate the changes).

In `@cpp/src/io/experimental_mps_fast/mmap_region.hpp`:
- Around line 76-100: anonymous_aligned currently unmaps prefix/suffix using
byte counts which can produce non-page-aligned munmap calls (EINVAL) and
leak/steal pages; fix by making the helper either (A) retain the original raw
mapping and raw_size in mmap_region_t and unmap that entire raw mapping in the
destructor/reset(), or (B) compute prefix and suffix rounded up/down to the
system page size (use sysconf(_SC_PAGESIZE) or getpagesize()) so every munmap
boundary is page-aligned before calling ::munmap; implement RAII by adding
raw_ptr/raw_size members to mmap_region_t, unmapping in its destructor and
reset() and checking munmap return values, and update anonymous_aligned to
populate those members rather than attempting partial unmaps at arbitrary byte
offsets.

In `@cpp/src/io/experimental_mps_fast/mps_section_scanner.cpp`:
- Around line 125-130: mps_phase_registry_t::publish_endata currently overwrites
endata_begin_ and endata_present_ even after endata_ready_ was set, creating
races with readers that read the plain members after an acquire on
endata_ready_; change publish_endata so it performs a single-shot publication:
only update endata_begin_ and endata_present_ when endata_ready_ was not yet set
(use an atomic test-and-set or compare_exchange on endata_ready_ to detect first
publication) and return without mutating the payload if endata_ready_ is already
true; ensure this mirrors the behavior of publish() and apply the same fix to
the other occurrence mentioned (lines ~455-459) so readers of
endata_begin()/endata_present() see a single, immutable publication.

In `@cpp/src/io/utilities/error.hpp`:
- Around line 37-40: mps_parser_throw currently injects msg verbatim into a JSON
string which can produce invalid JSON when msg contains quotes, backslashes or
newlines; add a small JSON-escaping helper (e.g., json_escape(const
std::string&)) that replaces backslash, quote and control chars (e.g., \n, \r,
\t) with their JSON-escaped forms and use it when building the thrown
std::logic_error message (replace std::string(msg) with json_escape(msg));
reference mps_parser_throw and error_to_string when updating the construction so
the thrown payload remains {"MPS_PARSER_ERROR_TYPE": "...", "msg": "escaped
text"}.

In `@cpp/src/utilities/perf_counters.hpp`:
- Around line 11-15: This header is missing direct includes for utilities it
uses: add `#include` <utility> for std::pair, `#include` <cstring> for std::strlen
and std::strncmp, and `#include` <cstdlib> for std::strtol so perf_counters.hpp is
self-contained; update the include block (which currently has <array>, <cerrno>,
<cstdint>, <cstdio>, <vector>) to also include those three headers to satisfy
IWYU and the self-contained-header rule.

In
`@cpp/tests/linear_programming/experimental_mps_fast/fast_parser_edge_test.cpp`:
- Around line 94-129: The test helper check_models_match_reference_bitwise
currently uses EXPECT_EQ to compare floating-point vectors (A_, b_, c_,
variable_lower_bounds_, variable_upper_bounds_, constraint_lower_bounds_,
constraint_upper_bounds_), which compares values not IEEE-754 bit patterns;
change those comparisons to element-wise bit-wise comparisons (use the existing
bits() helper or std::bit_cast<uint64_t> on each double) so each corresponding
element in parser_model_t::A_, mps_data_model_t::A_, and the b_/c_/bound vectors
is compared by its bit representation; update the assertions for A_, b_, c_,
variable_lower_bounds_, variable_upper_bounds_, constraint_lower_bounds_, and
constraint_upper_bounds_ in check_models_match_reference_bitwise to iterate
elements and EXPECT_EQ(bits(ref[i]), bits(fast[i])) (with clear context strings)
instead of EXPECT_EQ on the whole vector.

---

Nitpick comments:
In `@cpp/tests/linear_programming/parser_test.cpp`:
- Around line 2553-2675: Add two negative tests exercising the new fast-reader
rejection branches: (1) call read<int,double> (or use dispatch_parse) with
fast_experimental enabled while passing fixed_mps_format=true and
EXPECT_THROW(std::logic_error) to cover the "reject fast_experimental with
fixed_mps_format" guard (reference the read function signature and any CLI
flag/parameter used to enable fast_experimental/fixed_mps_format in your API),
and (2) attempt to parse a ".qps" (or ".qps.gz"/".qps.bz2") file while forcing
the fast reader and EXPECT_THROW(std::logic_error) to cover the "reject .qps*
when fast reader selected" branch; add these as new TEST cases next to the
existing dispatch tests (e.g., alongside read,
qps_extension_dispatches_to_mps_parser) so they run with the other parser
dispatch tests.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ae230bf3-b5ba-42f6-a034-5f88021e23b3

📥 Commits

Reviewing files that changed from the base of the PR and between 1ec39b3 and fe0aa31.

📒 Files selected for processing (27)

cpp/CMakeLists.txt
cpp/cuopt_cli.cpp
cpp/include/cuopt/linear_programming/io/parser.hpp
cpp/src/CMakeLists.txt
cpp/src/io/CMakeLists.txt
cpp/src/io/experimental_mps_fast/fast_fp64_parser.hpp
cpp/src/io/experimental_mps_fast/fast_parse_primitives.hpp
cpp/src/io/experimental_mps_fast/fast_parser.cpp
cpp/src/io/experimental_mps_fast/fast_parser.hpp
cpp/src/io/experimental_mps_fast/file_reader.cpp
cpp/src/io/experimental_mps_fast/file_reader.hpp
cpp/src/io/experimental_mps_fast/hash_table_smallstr.hpp
cpp/src/io/experimental_mps_fast/lz4_file_reader.cpp
cpp/src/io/experimental_mps_fast/mmap_region.hpp
cpp/src/io/experimental_mps_fast/mps_section_scanner.cpp
cpp/src/io/experimental_mps_fast/mps_section_scanner.hpp
cpp/src/io/experimental_mps_fast/nvtx_ranges.hpp
cpp/src/io/file_to_string.cpp
cpp/src/io/file_to_string.hpp
cpp/src/io/mps_parser.cpp
cpp/src/io/parser.cpp
cpp/src/io/utilities/error.hpp
cpp/src/utilities/perf_counters.hpp
cpp/tests/linear_programming/CMakeLists.txt
cpp/tests/linear_programming/experimental_mps_fast/fast_fp64_parser_test.cpp
cpp/tests/linear_programming/experimental_mps_fast/fast_parser_edge_test.cpp
cpp/tests/linear_programming/parser_test.cpp

Bubullzz · 2026-06-12T15:50:49Z

I can confirm I extensively used this code to parse several big mps instances (> 50GB) and the results were bitwise equal with the original parser.

aliceb-nv · 2026-06-15T16:00:24Z

/ok to test 1990c06

aliceb-nv · 2026-06-15T16:08:19Z

/ok to test d7358f6

aliceb-nv · 2026-06-15T17:06:04Z

/ok to test b1abd6f

nguidotti

Thanks for the hard work, Alice! A fast parser is a great improvement.
I have a few general comments:

Is there a reason for not using AVX512? Nowadays, hardware support is quite good in newer processors
I never work with ARM SIMD, but NEOS seems to be the most common choice. Due to Grace/Vera, I think is important for us to support this architecture.
Unless you see a large performance improvement for your custom implementation, I would use the STL methods when possible for simplicity (e.g., for malloc or reading/parsing outside the hot path).

nguidotti · 2026-06-12T17:32:34Z

+  constexpr uint32_t mul_u32(uint32_t m)
+  {
+    unsigned __int128 carry = 0;
+    for (uint64_t& v : limb) {


You can use pragma omp simd to give a hint to the compiler (although this should be easily auto-vectorizable)

This is a constexpr-only path

nguidotti · 2026-06-12T17:36:19Z

+
+inline constexpr auto fast_fp64_parse_lut = make_power_table();
+
+inline constexpr std::array<double, 23> small_powers = {


If you want some a little fancier, you can use compile-time index sequences (https://cppreference.com/cpp/utility/integer_sequence)

Template metaprogramming tricks are awesome but tend to tank compile times a lot more than what modern constexpr expressivity allows :)

nguidotti · 2026-06-18T09:29:51Z

  });
-  if (lower.ends_with(".mps") || lower.ends_with(".mps.gz") || lower.ends_with(".mps.bz2") ||
-      lower.ends_with(".qps") || lower.ends_with(".qps.gz") || lower.ends_with(".qps.bz2")) {
+  if (lower.ends_with(".mps.lz4") || lower.ends_with(".mps.bz2") || lower.ends_with(".mps.gz") ||


Could you create a function that returns the filename without the compression (if it exists)? This will simplify the logic here

nguidotti · 2026-06-18T09:58:48Z

+        });
+      }
+
+#pragma omp taskwait


You do not need the taskwait here. There is an implicit barrier at the end of the single and the parallel section that waits for all tasks to be completed before proceeding

nguidotti · 2026-06-18T10:02:26Z

+
+// Contract every input stream fed to parse_mps_fast_stream must satisfy.
+template <typename Stream>
+concept InputStream = requires(Stream stream)


I think the InputStream concept is not needed here since it is only used once and the templated method is enclosed in this file.

nguidotti · 2026-06-18T10:05:03Z

+  return page_size;
+}
+
+bool pread_full(int fd, char* dst, std::size_t bytes, std::size_t offset)


Is there a function in the STL that does this? Seems like a common method

Not really for this scenario :( The goal here is to submit parallel chunk reads to a single file descriptor. The C stdlib only really provides serial stateful options (FILE*, or std::ifstream in C++)

nguidotti · 2026-06-18T10:06:55Z

+// are named "<thread_name_prefix><worker-id>" when a prefix is supplied.
+// OMP just doesn't really play well with blocking pread()
+template <typename Body>
+void parallel_for_indexed(std::size_t count,


In OpenMP, you already have dynamic scheduling policies that dynamically balance the load across the threads

In this scenario (blocking i/o pread()s, the simplest option was just to stick to std::thread. These threads strictly do I/O, so it won't really lead to contention either way

nguidotti · 2026-06-18T10:11:50Z

@@ -0,0 +1,194 @@
+// SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights


How this differ from running perf from outside?

Convenience (directly logged, not having to tweak my existing scripts)
If need be this can be dropped, naturally

It is fine to keep it. I just curious if there is a difference.

aliceb-nv · 2026-06-18T12:21:24Z

Regarding AVX512 - it is still not consistently present on modern consumer CPUs (either no support at all, or only a subset), and the AMD implementations are essentially still using 256bit ALUs in a trench coat; so I did not feel comfortable making this a requirement. I also don't expect the improvements to be huge without a significant refactor - most lines in MPS files are >16B and <32B wide, which fit AVX2 perfectly. AVX2 is also basically guaranteed to exist on any modern x86 CPU released in the last 10 years, consumer or server

SVE2 support could be awesome :) Would be lovely to benchmark. As it stands, NEON is guaranteed to be present on all aarch64 CPUs and that's what SIMDe uses under the hood to translate the intel-style intrinsics.

All custom implementations used here were motivated by benchmarking. std::from_chars came close to the fast fp64 parser used here on modern GCC libstdc++, but I did not feel too comfortable using it directly since if it falls back to a locale-based parser (or if compiled with any other libstdc+++), performance would tank significantly and seemingly unpredictably to the customer

nguidotti · 2026-06-18T14:31:54Z

Regarding AVX512 - it is still not consistently present on modern consumer CPUs (either no support at all, or only a subset), and the AMD implementations are essentially still using 256bit ALUs in a trench coat; so I did not feel comfortable making this a requirement. I also don't expect the improvements to be huge without a significant refactor - most lines in MPS files are >16B and <32B wide, which fit AVX2 perfectly. AVX2 is also basically guaranteed to exist on any modern x86 CPU released in the last 10 years, consumer or server

SVE2 support could be awesome :) Would be lovely to benchmark. As it stands, NEON is guaranteed to be present on all aarch64 CPUs and that's what SIMDe uses under the hood to translate the intel-style intrinsics.

All custom implementations used here were motivated by benchmarking. std::from_chars came close to the fast fp64 parser used here on modern GCC libstdc++, but I did not feel too comfortable using it directly since if it falls back to a locale-based parser (or if compiled with any other libstdc+++), performance would tank significantly and seemingly unpredictably to the customer

Fair enough. Thanks for explaining it to me!

aliceb-nv added 14 commits June 10, 2026 08:33

port fast mps parser tp tree

4d2ec82

thread count cap

68daf3d

fix crashes, more opti

eb0e285

improved iee754 compliant float parsing, warn on nnz > INT_MAX

91742cd

decode performance metrics

be97a05

lots of cleanup

1e4d7c9

moved perf counters

8e01e28

extend the lz4 decompression to the regular parser, more cleanup and …

94bfbc7

…refactor

further cleanup

62c8dcd

cleanup for clarity

79e958e

more cleanup, fix som eedge case failures

2614137

ai review comments

9185d7a

ai review

a1e14d5

Comments on the build flags

fe0aa31

aliceb-nv added this to the 26.08 milestone Jun 12, 2026

aliceb-nv requested review from a team as code owners June 12, 2026 15:33

aliceb-nv requested a review from Iroy30 June 12, 2026 15:33

aliceb-nv added non-breaking Introduces a non-breaking change improvement Improves an existing functionality labels Jun 12, 2026

aliceb-nv requested review from kaatish, msarahan and rg20 June 12, 2026 15:33

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

aliceb-nv added 2 commits June 13, 2026 04:36

gate O_DIRECT behind non-nfs, add missing license notices

cfaccc3

fix bitwise comps, more cleanup and comments

7220838

AI review comments

1990c06

coderabbitai Bot mentioned this pull request Jun 15, 2026

Escape msg before embedding in JSON payload in mps_parser_throw #1436

Open

fix sloppy fix

d7358f6

Merge branch 'main' into fast-mps-parser-final

b1abd6f

nguidotti reviewed Jun 18, 2026

View reviewed changes


		inline constexpr auto fast_fp64_parse_lut = make_power_table();

		inline constexpr std::array<double, 23> small_powers = {

		@@ -0,0 +1,194 @@
		// SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights

Conversation

aliceb-nv commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issue

Checklist

Uh oh!

copy-pr-bot Bot commented Jun 12, 2026

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Bubullzz commented Jun 12, 2026

Uh oh!

aliceb-nv commented Jun 15, 2026

Uh oh!

aliceb-nv commented Jun 15, 2026

Uh oh!

aliceb-nv commented Jun 15, 2026

Uh oh!

nguidotti left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aliceb-nv commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nguidotti commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

aliceb-nv commented Jun 12, 2026 •

edited

Loading

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading

nguidotti left a comment •

edited

Loading

aliceb-nv commented Jun 18, 2026 •

edited

Loading