Structured Event logging by hansent · Pull Request #1930 · roboflow/inference

hansent · 2026-01-20T19:57:45Z

Summary

Adds a structured logging system that captures model lifecycle events for observability and analysis. This enables understanding cache hit rates, model thrashing patterns, and memory usage in production.

Key features:

JSON structured logs compatible with GCP Cloud Logging
7 event types covering the full model lifecycle
Tiered sampling to control log volume/costs (high-volume events sampled, lifecycle events always logged)
Tiered memory measurement (basic vs detailed mode)
Request context propagation (correlates events across workflow steps)
Support for both legacy model loading and inference-models paths

Event types:

request_received - Direct inference request (sampled)
workflow_request_received - Workflow execution started (sampled)
model_cache_status - Cache hit/miss per request (sampled)
model_loaded_to_disk - Model artifacts downloaded (always logged)
model_loaded_to_memory - Model loaded to GPU/CPU (always logged)
model_evicted - Model removed from cache (always logged)
inference_completed - Inference finished with timing (sampled)

Configuration (env vars):

STRUCTURED_LOGGING_ENABLED - Master switch (default: True when GCP_SERVERLESS)
STRUCTURED_LOGGING_SAMPLE_RATE - Sample rate for high-volume events (default: 1.0)
STRUCTURED_LOGGING_DETAILED_MEMORY - Enable full memory snapshots (default: False)

Test plan

Run server locally with STRUCTURED_LOGGING_ENABLED=true and verify JSON logs appear in stdout
Make direct inference request → verify request_received, model_cache_status, model_loaded_to_memory, inference_completed events
Make second request to same model → verify cache_hit: true in events
Execute workflow → verify workflow_request_received and events include workflow_instance_id, step_name
Test with USE_INFERENCE_EXP_MODELS=true → verify model_loaded_to_disk event logged via callback
Test with STRUCTURED_LOGGING_DETAILED_MEMORY=true → verify memory snapshot fields populated
Test with STRUCTURED_LOGGING_SAMPLE_RATE=0.1 → verify ~10% of high-volume events logged

Introduces a separate structured logging system for GCP serverless deployments that captures model lifecycle events to understand cache hit rates, model thrashing, and memory usage patterns. New module: inference/core/gcp_logging/ - events.py: 7 event dataclasses (request_received, workflow_request_received, model_cache_status, model_loaded_to_disk, model_loaded_to_memory, model_evicted, inference_completed) - context.py: Request context tracking via ContextVar - memory.py: Tiered memory measurement (basic vs detailed mode) - logger.py: GCPServerlessLogger singleton with sampling support Environment variables: - GCP_LOGGING_ENABLED: Master switch (default: True when GCP_SERVERLESS) - GCP_LOGGING_SAMPLE_RATE: Sample rate for high-volume events (default: 1.0) - GCP_LOGGING_DETAILED_MEMORY: Enable detailed memory introspection (default: False) Instrumentation points: - HTTP middleware: Sets up request context with correlation ID - Workflow executor: Adds step context for workflow invocations - ModelManager: Logs cache status, model loads, and inference completion - Fixed size cache: Logs eviction events with lifetime and inference count - Roboflow model: Logs disk download events Features: - Tiered sampling: High-volume events sampled, lifecycle events always logged - Tiered memory: Basic mode (cheap) vs detailed mode (GPU sync + syscalls) - Context propagation: request_id, workflow_instance_id, step_name flow through - Cache hit tracking: inference_completed includes whether model was pre-loaded Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

When USE_INFERENCE_EXP_MODELS is enabled, the model is loaded through the InferenceExpObjectDetectionModelAdapter which wraps the inference_models AutoModel. This change detects that adapter pattern and: - Sets backend to "inference-models" or "inference-models/{backend}" - Extracts device info from the underlying exp_model This ensures ModelLoadedToMemoryEvent correctly identifies models loaded through the inference-models path vs the legacy ONNX path. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

When USE_INFERENCE_EXP_MODELS is enabled, models are loaded via the inference-models library's AutoModel.from_pretrained(). This path bypasses the legacy download code, so ModelLoadedToDiskEvent wasn't being logged. This adds GCPLoggingModelAccessManager - a custom ModelAccessManager that hooks into inference-models download callbacks: - on_model_package_access_granted: Start tracking download - on_file_created: Accumulate file sizes and count - on_model_loaded: Log ModelLoadedToDiskEvent with totals The adapter only creates the access manager when GCP logging is enabled. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

inference/core/structured_logging/context.py

grzegorz-roboflow · 2026-01-22T08:57:36Z

bugbot review

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

inference/core/managers/decorators/fixed_size_cache.py

…cp logger

…ability (#1962) * Rename GCP logging to generic structured logging Rename gcp_logging module to structured_logging since the serverless deployment is no longer GCP-exclusive and this feature is useful in other contexts. Changes: - Rename directory gcp_logging -> structured_logging - Rename env vars: GCP_LOGGING_* -> STRUCTURED_LOGGING_* - Rename classes: GCPServerlessLogger -> StructuredLogger, GCPRequestContext -> RequestContext, BaseGCPEvent -> BaseEvent, GCPLoggingModelAccessManager -> LoggingModelAccessManager - Rename functions: set/get/clear/update_gcp_context -> set/get/clear/update_request_context - Rename singleton: gcp_logger -> structured_logger - Update all imports and usages across consumer files - Update documentation to reflect generic naming https://claude.ai/code/session_01LXU2GBiYWHQM7Qi7hPYTPT * Rename StructuredLogger to StructuredEventLogger for clarity Make the naming more explicit by using StructuredEventLogger instead of just StructuredLogger, to clearly indicate this is an event logger. - Rename class: StructuredLogger -> StructuredEventLogger - Rename singleton: structured_logger -> structured_event_logger - Update all imports and usages across consumer files https://claude.ai/code/session_01LXU2GBiYWHQM7Qi7hPYTPT --------- Co-authored-by: Claude <noreply@anthropic.com>

inference/core/structured_logging/context.py

…leware (#1964) Decouple the structured logging request context setup from GCPServerlessMiddleware so it runs whenever STRUCTURED_LOGGING_ENABLED=True, regardless of deployment environment. - Create new StructuredLoggingMiddleware that sets up RequestContext with request_id (from correlation ID) and hashed api_key - Keep GCPServerlessMiddleware focused on GCP-specific concerns only (execution_id header, processing time header) - Register StructuredLoggingMiddleware when STRUCTURED_LOGGING_ENABLED is true https://claude.ai/code/session_01XHjBRQV5q7Nyq7eNEiTAaH Co-authored-by: Claude <noreply@anthropic.com>

hansent · 2026-02-02T19:32:48Z

bugbot review

cursor · 2026-02-02T19:32:51Z

Skipping Bugbot: Unable to authenticate your request. Please make sure Bugbot is properly installed and configured for this repository.

codeflash-ai · 2026-02-02T20:10:50Z

inference/core/structured_logging/memory.py

+    state = _measure_detailed_state()
+    return MemorySnapshot(
+        gpu_free=state.get("gpu_free"),
+    )
+
+


⚡️Codeflash found 1,813% (18.13x) speedup for measure_memory_for_eviction in inference/core/structured_logging/memory.py

⏱️ Runtime : 8.28 milliseconds → 433 microseconds (best of 89 runs)

📝 Explanation and details

The optimized code achieves an 1812% speedup by eliminating unnecessary work in measure_memory_for_eviction() when detailed=True.

Key Optimization

Inlined the critical path: Instead of calling _measure_detailed_state() which collects comprehensive GPU metrics (memory_allocated(), memory_reserved(), gpu_free, gpu_total) and system memory info via psutil, the optimized version directly captures only gpu_free - the single field actually used by the caller.

What Changed

Removed intermediate function call: Eliminated the call to _measure_detailed_state() and its associated function call overhead

Reduced GPU synchronization: Cut GPU memory queries from 4 operations (mem_get_info(), memory_allocated(), memory_reserved(), plus total from mem_get_info()) down to just 1 (mem_get_info() for gpu_free)

Eliminated psutil syscalls: Removed Process().memory_info().rss and virtual_memory().available calls entirely

Simplified data flow: Directly return MemorySnapshot(gpu_free=gpu_free) instead of building an intermediate dictionary and extracting from it

Performance Impact

Based on the line profiler, the original code spent:

~3.55s importing torch (99% of time in _measure_detailed_state)

~37ms importing psutil

~17ms on psutil memory queries

The optimized version eliminates all psutil operations and reduces GPU memory queries by 75%, keeping only the essential gpu_free measurement.

Context from Function References

The function is called during model eviction in WithFixedSizeCache.add_model() - a hot path executed when models are swapped in/out of memory. The optimization is particularly valuable here because:

Model eviction can happen frequently under memory pressure or capacity constraints

The function is called in a tight loop (for _ in range(3)) during eviction

Only gpu_free is actually logged in the ModelEvictedEvent, making the other measurements pure overhead

Test Results Analysis

The optimization excels across all scenarios:

Fast path preserved: detailed=False remains unchanged (no regression)

Biggest gains in realistic scenarios: Tests with actual torch/psutil mocking show 100-33000% speedups

Scales well: Repeated calls test shows 3490% speedup, confirming the optimization benefits accumulate in the eviction loop scenario

Memory-constrained environments: When torch is available but psutil is missing, the speedup is dramatic (33243%) since it was doing unnecessary work before

✅ Correctness verification report:

Test Status

⏪ Replay Tests 🔘 None Found

⚙️ Existing Unit Tests 🔘 None Found

🔎 Concolic Coverage Tests 🔘 None Found

🌀 Generated Regression Tests ✅ 309 Passed

📊 Tests Coverage 100.0%

🌀 Click to see Generated Regression Tests

import sys import types from types import ModuleType, SimpleNamespace from typing import Optional from unittest.mock import Mock # imports import pytest # used for our unit tests # function to test from inference.core.structured_logging.events import MemorySnapshot from inference.core.structured_logging.memory import measure_memory_for_eviction # unit tests # Helper utilities to construct fake external modules (torch, psutil) for deterministic testing. def _make_fake_torch( mem_info_sequence=None, is_available=True, alloc_value=0, reserved_value=0 ): """ Create a fake 'torch' module with a cuda submodule that provides: - is_available(): returns the provided is_available boolean - mem_get_info(): returns a tuple (gpu_free, gpu_total). If mem_info_sequence is provided, values are popped from it for each call to simulate changing GPU free values. - memory_allocated() and memory_reserved(): return fixed values for determinism. """ torch_mod = ModuleType("torch") cuda_mod = ModuleType("torch.cuda") # make mem_info_sequence mutable list if provided if mem_info_sequence is not None: seq = list(mem_info_sequence) def mem_get_info(): # pop from front to simulate changing readings across calls if not seq: # fallback deterministic value if exhausted return (0, 0) return (seq.pop(0), 999999) else: def mem_get_info(): return (123456, 789012) # default deterministic tuple def is_available_fn(): return bool(is_available) def memory_allocated_fn(): return int(alloc_value) def memory_reserved_fn(): return int(reserved_value) # attach functions to cuda submodule cuda_mod.mem_get_info = mem_get_info cuda_mod.is_available = is_available_fn cuda_mod.memory_allocated = memory_allocated_fn cuda_mod.memory_reserved = memory_reserved_fn # attach cuda submodule to torch module torch_mod.cuda = cuda_mod return torch_mod def _make_fake_psutil(rss_value=0, available_value=0): """ Create a fake 'psutil' module with: - Process() returning an object whose memory_info().rss equals rss_value - virtual_memory() returning an object with attribute available == available_value """ psutil_mod = ModuleType("psutil") class _Proc: def memory_info(self_inner): # return a simple object with rss attribute return SimpleNamespace(rss=int(rss_value)) def Process(): return _Proc() def virtual_memory(): return SimpleNamespace(available=int(available_value)) psutil_mod.Process = Process psutil_mod.virtual_memory = virtual_memory return psutil_mod def test_returns_none_when_not_detailed(): # Call with detailed=False should be a fast path returning None codeflash_output = measure_memory_for_eviction(detailed=False) result = codeflash_output # 771ns -> 782ns (1.41% slower) def test_detailed_with_fake_torch_and_psutil_returns_expected_gpu_free(monkeypatch): # Prepare deterministic fake modules fake_torch = _make_fake_torch( mem_info_sequence=[4444], is_available=True, alloc_value=111, reserved_value=222 ) fake_psutil = _make_fake_psutil(rss_value=555, available_value=666) # Inject into sys.modules so import inside _measure_detailed_state picks them up monkeypatch.setitem(sys.modules, "torch", fake_torch) monkeypatch.setitem(sys.modules, "psutil", fake_psutil) # Call measured function with detailed=True; should reflect our fake GPU free value codeflash_output = measure_memory_for_eviction(detailed=True) snapshot = codeflash_output # 11.4μs -> 5.29μs (116% faster) def test_torch_present_but_cuda_not_available_psutil_present(monkeypatch): # Fake torch with cuda.is_available returning False fake_torch = _make_fake_torch(is_available=False) fake_psutil = _make_fake_psutil(rss_value=1000, available_value=2000) monkeypatch.setitem(sys.modules, "torch", fake_torch) monkeypatch.setitem(sys.modules, "psutil", fake_psutil) # When CUDA is not available, GPU keys should not be added; snapshot.gpu_free is None codeflash_output = measure_memory_for_eviction(detailed=True) snapshot = codeflash_output # 6.30μs -> 3.07μs (106% faster) def test_torch_available_psutil_missing(monkeypatch): fake_torch = _make_fake_torch(mem_info_sequence=[7777], is_available=True) monkeypatch.setitem(sys.modules, "torch", fake_torch) # Ensure psutil import raises ImportError by removing any real psutil monkeypatch.delitem(sys.modules, "psutil", raising=False) codeflash_output = measure_memory_for_eviction(detailed=True) snapshot = codeflash_output # 1.25ms -> 3.74μs (33243% faster) def test_large_scale_repeated_calls_are_consistent(monkeypatch): # Prepare a sequence of GPU free values for multiple calls (below 1000 elements as required) values = list(range(1000, 1010)) # 10 calls, small and deterministic fake_torch = _make_fake_torch(mem_info_sequence=list(values), is_available=True) fake_psutil = _make_fake_psutil(rss_value=123, available_value=456) monkeypatch.setitem(sys.modules, "torch", fake_torch) monkeypatch.setitem(sys.modules, "psutil", fake_psutil) # Call measure_memory_for_eviction multiple times and ensure we get the sequence back in order observed = [] for _ in range(len(values)): codeflash_output = measure_memory_for_eviction(detailed=True) snap = codeflash_output # 41.7μs -> 16.3μs (156% faster) observed.append(snap.gpu_free) def test_mem_get_info_returns_non_int_types_are_passed_through(monkeypatch): # mem_get_info returns a tuple where gpu_free is a string (to simulate unexpected but possible API shape) def mem_get_info_str(): return ("9999_as_str", "total_dummy") fake_torch = ModuleType("torch") cuda_mod = ModuleType("torch.cuda") cuda_mod.is_available = lambda: True cuda_mod.mem_get_info = mem_get_info_str cuda_mod.memory_allocated = lambda: 0 cuda_mod.memory_reserved = lambda: 0 fake_torch.cuda = cuda_mod fake_psutil = _make_fake_psutil(rss_value=1, available_value=2) monkeypatch.setitem(sys.modules, "torch", fake_torch) monkeypatch.setitem(sys.modules, "psutil", fake_psutil) codeflash_output = measure_memory_for_eviction(detailed=True) snapshot = codeflash_output # 7.73μs -> 3.10μs (150% faster) def test_repeated_calls_with_stable_fake_modules_are_idempotent(monkeypatch): fake_torch = _make_fake_torch( mem_info_sequence=None, is_available=True, alloc_value=10, reserved_value=20 ) fake_psutil = _make_fake_psutil(rss_value=321, available_value=654) monkeypatch.setitem(sys.modules, "torch", fake_torch) monkeypatch.setitem(sys.modules, "psutil", fake_psutil) # Call twice and ensure same values are returned codeflash_output = measure_memory_for_eviction(detailed=True) s1 = codeflash_output # 7.75μs -> 3.27μs (137% faster) codeflash_output = measure_memory_for_eviction(detailed=True) s2 = codeflash_output # 4.45μs -> 1.56μs (185% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from unittest.mock import MagicMock, patch # imports import pytest from inference.core.structured_logging.events import MemorySnapshot from inference.core.structured_logging.memory import measure_memory_for_eviction class TestBasicFunctionality: """Test suite for basic functionality of measure_memory_for_eviction.""" def test_measure_memory_detailed_false_returns_none(self): """ Test that when detailed=False, the function returns None. This is the default behavior and should not perform any memory measurement. """ codeflash_output = measure_memory_for_eviction(detailed=False) result = codeflash_output # 792ns -> 811ns (2.34% slower) def test_measure_memory_detailed_false_is_default(self): """ Test that detailed=False is the default parameter value. Calling without arguments should return None. """ codeflash_output = measure_memory_for_eviction() result = codeflash_output # 521ns -> 521ns (0.000% faster) def test_measure_memory_detailed_true_returns_memory_snapshot(self): """ Test that when detailed=True, the function returns a MemorySnapshot object. This requires actual memory measurement capability. """ codeflash_output = measure_memory_for_eviction(detailed=True) result = codeflash_output # 165μs -> 11.0μs (1396% faster) def test_return_type_is_memory_snapshot_or_none(self): """ Test that the return type is always either MemorySnapshot or None. """ codeflash_output = measure_memory_for_eviction(detailed=True) result = codeflash_output # 149μs -> 8.04μs (1753% faster) def test_memory_snapshot_has_gpu_free_attribute(self): """ Test that MemorySnapshot contains gpu_free attribute when available. """ codeflash_output = measure_memory_for_eviction(detailed=True) result = codeflash_output # 147μs -> 7.49μs (1870% faster) if result is not None: pass class TestEdgeCases: """Test suite for edge cases and unusual conditions.""" def test_multiple_calls_consistent_behavior(self): """ Test that multiple calls with detailed=False always return None. """ for _ in range(5): codeflash_output = measure_memory_for_eviction(detailed=False) result = codeflash_output # 1.83μs -> 1.86μs (1.61% slower) def test_boolean_true_explicitly_passed(self): """ Test that explicitly passing True for detailed parameter works correctly. """ codeflash_output = measure_memory_for_eviction(detailed=True) result = codeflash_output # 148μs -> 8.04μs (1747% faster) def test_boolean_false_explicitly_passed(self): """ Test that explicitly passing False for detailed parameter returns None. """ codeflash_output = measure_memory_for_eviction(detailed=False) result = codeflash_output # 721ns -> 721ns (0.000% faster) def test_detailed_parameter_type_coercion(self): """ Test that truthy/falsy values work as detailed parameter. 0 should be falsy (like False), 1 should be truthy (like True). """ # Test with 0 (falsy) codeflash_output = measure_memory_for_eviction(detailed=0) result = codeflash_output # 742ns -> 751ns (1.20% slower) # Test with 1 (truthy) codeflash_output = measure_memory_for_eviction(detailed=1) result = codeflash_output # 149μs -> 7.61μs (1858% faster) def test_gpu_free_value_when_torch_unavailable(self): """ Test behavior when torch is not available. The function should still work but gpu_free may be None. """ with patch("torch.cuda.is_available", side_effect=ImportError): codeflash_output = measure_memory_for_eviction(detailed=True) result = codeflash_output # 173μs -> 30.6μs (465% faster) def test_gpu_free_value_when_psutil_unavailable(self): """ Test behavior when psutil is not available. The function should still work without system memory info. """ with patch("psutil.Process", side_effect=ImportError): codeflash_output = measure_memory_for_eviction(detailed=True) result = codeflash_output # 29.8μs -> 9.35μs (219% faster) @patch("inference.core.structured_logging.memory._measure_detailed_state") def test_memory_snapshot_created_with_gpu_free_value(self, mock_measure): """ Test that MemorySnapshot is properly created with gpu_free from state dict. """ # Mock the detailed state to return known values mock_measure.return_value = { "gpu_free": 12345678, "gpu_total": 16000000000, "gpu_allocated": 3654322, "gpu_reserved": 3900000, } codeflash_output = measure_memory_for_eviction(detailed=True) result = codeflash_output # 19.5μs -> 8.44μs (131% faster) @patch("inference.core.structured_logging.memory._measure_detailed_state") def test_memory_snapshot_with_empty_state(self, mock_measure): """ Test that MemorySnapshot handles empty state dict gracefully. """ # Mock the detailed state to return empty dict mock_measure.return_value = {} codeflash_output = measure_memory_for_eviction(detailed=True) result = codeflash_output # 19.0μs -> 8.30μs (129% faster) @patch("inference.core.structured_logging.memory._measure_detailed_state") def test_memory_snapshot_with_none_gpu_free(self, mock_measure): """ Test that MemorySnapshot handles None gpu_free value correctly. """ mock_measure.return_value = {"gpu_free": None} codeflash_output = measure_memory_for_eviction(detailed=True) result = codeflash_output # 18.7μs -> 8.35μs (124% faster) @patch("inference.core.structured_logging.memory._measure_detailed_state") def test_memory_snapshot_with_zero_gpu_free(self, mock_measure): """ Test that MemorySnapshot handles zero gpu_free value (out of memory condition). """ mock_measure.return_value = {"gpu_free": 0} codeflash_output = measure_memory_for_eviction(detailed=True) result = codeflash_output # 18.3μs -> 8.06μs (126% faster) class TestLargeScale: """Test suite for performance and scalability with large data samples.""" def test_repeated_calls_detailed_false(self): """ Test that repeated calls with detailed=False remain fast and don't accumulate memory. Performs 100 calls to ensure no performance degradation. """ for i in range(100): codeflash_output = measure_memory_for_eviction(detailed=False) result = codeflash_output # 26.5μs -> 26.8μs (1.15% slower) def test_repeated_calls_detailed_true(self): """ Test that repeated calls with detailed=True work consistently. Performs 10 calls (limited due to GPU sync overhead). """ results = [] for i in range(10): codeflash_output = measure_memory_for_eviction(detailed=True) result = codeflash_output # 1.23ms -> 34.4μs (3490% faster) results.append(result) # All results should be None or MemorySnapshot for result in results: pass def test_concurrent_behavior_simulation(self): """ Test that calling the function in sequence (simulating concurrent usage) works correctly. Uses 50 calls alternating between detailed=True and False. """ for i in range(50): detailed = i % 2 == 0 codeflash_output = measure_memory_for_eviction(detailed=detailed) result = codeflash_output # 2.95ms -> 84.3μs (3395% faster) if detailed: pass else: pass @patch("inference.core.structured_logging.memory._measure_detailed_state") def test_large_gpu_free_values(self, mock_measure): """ Test handling of very large GPU memory values (e.g., for 80GB GPUs). """ # 80GB GPU with 50GB free large_gpu_free = 80 * 1024 * 1024 * 1024 mock_measure.return_value = {"gpu_free": large_gpu_free} codeflash_output = measure_memory_for_eviction(detailed=True) result = codeflash_output # 20.6μs -> 8.46μs (143% faster) @patch("inference.core.structured_logging.memory._measure_detailed_state") def test_small_gpu_free_values(self, mock_measure): """ Test handling of very small GPU memory values. """ # 1KB free (extreme low memory condition) small_gpu_free = 1024 mock_measure.return_value = {"gpu_free": small_gpu_free} codeflash_output = measure_memory_for_eviction(detailed=True) result = codeflash_output # 19.4μs -> 8.24μs (136% faster) @patch("inference.core.structured_logging.memory._measure_detailed_state") def test_multiple_measurement_states(self, mock_measure): """ Test with multiple different state dictionaries to simulate state changes. """ states = [ {"gpu_free": 1000000}, {"gpu_free": 2000000}, {"gpu_free": 3000000}, {"gpu_free": 4000000}, ] for state in states: mock_measure.return_value = state codeflash_output = measure_memory_for_eviction(detailed=True) result = codeflash_output # 46.5μs -> 18.2μs (155% faster) def test_parameter_variations_100_calls(self): """ Test 100 rapid calls with varying detailed parameter values. """ call_count = 0 none_count = 0 snapshot_count = 0 for i in range(100): detailed = i % 10 == 0 # Vary the parameter codeflash_output = measure_memory_for_eviction(detailed=detailed) result = codeflash_output # 1.29ms -> 63.4μs (1928% faster) call_count += 1 if result is None: none_count += 1 elif isinstance(result, MemorySnapshot): snapshot_count += 1 class TestIntegration: """Integration tests combining multiple aspects of the function.""" def test_detailed_false_then_true(self): """ Test sequential calls with different detailed values. """ codeflash_output = measure_memory_for_eviction(detailed=False) result1 = codeflash_output # 721ns -> 731ns (1.37% slower) codeflash_output = measure_memory_for_eviction(detailed=True) result2 = codeflash_output # 145μs -> 7.11μs (1946% faster) def test_detailed_true_then_false(self): """ Test sequential calls in opposite order. """ codeflash_output = measure_memory_for_eviction(detailed=True) result1 = codeflash_output # 143μs -> 7.32μs (1864% faster) codeflash_output = measure_memory_for_eviction(detailed=False) result2 = codeflash_output # 631ns -> 381ns (65.6% faster) @patch("inference.core.structured_logging.memory._measure_detailed_state") def test_state_dict_with_all_possible_keys(self, mock_measure): """ Test with a complete state dictionary containing all possible keys. """ complete_state = { "gpu_allocated": 5000000, "gpu_reserved": 6000000, "gpu_free": 10000000, "gpu_total": 16000000000, "process_rss_bytes": 2000000000, "system_available_bytes": 8000000000, } mock_measure.return_value = complete_state codeflash_output = measure_memory_for_eviction(detailed=True) result = codeflash_output # 20.1μs -> 8.26μs (143% faster) @patch("inference.core.structured_logging.memory._measure_detailed_state") def test_state_dict_with_partial_keys(self, mock_measure): """ Test with a state dictionary containing only gpu_free. """ partial_state = {"gpu_free": 5000000} mock_measure.return_value = partial_state codeflash_output = measure_memory_for_eviction(detailed=True) result = codeflash_output # 19.0μs -> 8.10μs (135% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To test or edit this optimization locally git merge codeflash/optimize-pr1930-2026-02-02T20.10.49

Suggested change

state = _measure_detailed_state()

return MemorySnapshot(

gpu_free=state.get("gpu_free"),

)

gpu_free = None

try:

import torch

if torch.cuda.is_available():

gpu_free, _ = torch.cuda.mem_get_info()

except ImportError:

pass

return MemorySnapshot(gpu_free=gpu_free)

hansent · 2026-02-02T21:35:48Z

inference/core/models/exp_adapter.py

            api_key=self.api_key,
            allow_untrusted_packages=ALLOW_INFERENCE_EXP_UNTRUSTED_MODELS,
            allow_direct_local_storage_loading=False,
+            model_access_manager=model_access_manager,


@PawelPeczek-Roboflow seeing that as part of 1.0rc there are many new adapters.

I can make change in all of those for 1.0 similar to here unless you have a better idea?

grzegorz-roboflow

first batch of comments

grzegorz-roboflow · 2026-02-04T16:21:34Z

inference/core/interfaces/http/http_api.py

+            clear_request_context,
+            set_request_context,
+            structured_event_logger,
+        )


dispatch is executed with each request, I would vote for moving those imports away from this method

inference/core/managers/decorators/fixed_size_cache.py

inference/core/interfaces/http/http_api.py

grzegorz-roboflow · 2026-02-05T09:57:39Z

inference/core/interfaces/http/http_api.py

        if GCP_SERVERLESS:
            app.add_middleware(GCPServerlessMiddleware)
+        if STRUCTURED_LOGGING_ENABLED:
+            app.add_middleware(StructuredLoggingMiddleware)


Please add comment here indicating that this middleware relies on CorrelationIdMiddleware to be already executed
Or maybe move this to where CorrelationIdMiddleware is set, maybe even add this middleware explicitly before CorrelationIdMiddleware (twice) so it's clear those two always go together

inference/core/interfaces/http/http_api.py

inference/core/managers/base.py

grzegorz-roboflow · 2026-02-05T11:48:54Z

inference/core/models/roboflow.py

+
        logger.debug("Downloading model artifacts from Roboflow API")

+        download_start_time = time.time()


looks like this is only used in if structured_event_logger.enabled:, maybe capturing it there would save us a few CPU cycles

inference/core/managers/base.py

inference/core/workflows/execution_engine/v1/executor/core.py

inference/core/interfaces/http/http_api.py

hansent · 2026-02-16T21:30:34Z

holding this until after 1.0 release is done since we'll need to update some of the hooks to log model load times for new adapters

hansent and others added 3 commits January 20, 2026 13:39

github-advanced-security bot found potential problems Jan 20, 2026

View reviewed changes

inference/core/structured_logging/context.py Fixed Show fixed Hide fixed

cursor bot reviewed Jan 22, 2026

View reviewed changes

inference/core/managers/decorators/fixed_size_cache.py Show resolved Hide resolved

hansent and others added 3 commits February 2, 2026 09:09

Merge branch 'main' into feature/gcp-serverless-structured-logging

14d2786

just use GCP_LOGGING_ENABLED to decide whether to enable structured g…

1280137

…cp logger

hansent changed the title ~~gcp serverless structured logging~~ Structured Event logging Feb 2, 2026

github-advanced-security bot found potential problems Feb 2, 2026

View reviewed changes

inference/core/structured_logging/context.py Fixed Show fixed Hide fixed

hansent and others added 3 commits February 2, 2026 13:17

always cleanup model load time

c8d295b

dont log hashed api key

a7144eb

hansent marked this pull request as ready for review February 2, 2026 19:31

hansent requested review from PawelPeczek-Roboflow, grzegorz-roboflow, probicheaux and yeldarby as code owners February 2, 2026 19:31

make style

b8fd51c

codeflash-ai bot reviewed Feb 2, 2026

View reviewed changes

hansent commented Feb 2, 2026

View reviewed changes

Merge branch 'main' into feature/gcp-serverless-structured-logging

8418fae

grzegorz-roboflow requested changes Feb 5, 2026

View reviewed changes

hansent commented Feb 5, 2026

View reviewed changes

inference/core/interfaces/http/http_api.py Outdated Show resolved Hide resolved

hansent added 3 commits February 5, 2026 08:55

Apply suggestion from @hansent

317959c

move imports and address other suggestion

2bfbb6d

leave request id None if its not available

0e29134

hansent added 2 commits February 5, 2026 10:09

add InferenceCompletedEvent to async handler

4b042b8

Merge branch 'main' into feature/gcp-serverless-structured-logging

8c46a05

hansent added the DO-NOT-MERGE label Feb 16, 2026

Test	Status
⏪ Replay Tests	🔘 None Found
⚙️ Existing Unit Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 309 Passed
📊 Tests Coverage	100.0%

-    state = _measure_detailed_state()
-    return MemorySnapshot(
-        gpu_free=state.get("gpu_free"),
-    )
+    gpu_free = None
+    try:
+        import torch
+        if torch.cuda.is_available():
+            gpu_free, _ = torch.cuda.mem_get_info()
+    except ImportError:
+        pass
+    return MemorySnapshot(gpu_free=gpu_free)


		logger.debug("Downloading model artifacts from Roboflow API")

		download_start_time = time.time()

Conversation

hansent commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Uh oh!

grzegorz-roboflow commented Jan 22, 2026

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hansent commented Feb 2, 2026

Uh oh!

cursor bot commented Feb 2, 2026

Uh oh!

codeflash-ai bot Feb 2, 2026

Choose a reason for hiding this comment

⚡️Codeflash found 1,813% (18.13x) speedup for measure_memory_for_eviction in inference/core/structured_logging/memory.py

Key Optimization

What Changed

Performance Impact

Context from Function References

Test Results Analysis

Uh oh!

hansent Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

grzegorz-roboflow left a comment

Choose a reason for hiding this comment

Uh oh!

grzegorz-roboflow Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

hansent Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

grzegorz-roboflow Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

grzegorz-roboflow Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hansent commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hansent commented Jan 20, 2026 •

edited

Loading

⚡️Codeflash found 1,813% (18.13x) speedup for `measure_memory_for_eviction` in `inference/core/structured_logging/memory.py`