Conversation
Introduces a separate structured logging system for GCP serverless deployments that captures model lifecycle events to understand cache hit rates, model thrashing, and memory usage patterns. New module: inference/core/gcp_logging/ - events.py: 7 event dataclasses (request_received, workflow_request_received, model_cache_status, model_loaded_to_disk, model_loaded_to_memory, model_evicted, inference_completed) - context.py: Request context tracking via ContextVar - memory.py: Tiered memory measurement (basic vs detailed mode) - logger.py: GCPServerlessLogger singleton with sampling support Environment variables: - GCP_LOGGING_ENABLED: Master switch (default: True when GCP_SERVERLESS) - GCP_LOGGING_SAMPLE_RATE: Sample rate for high-volume events (default: 1.0) - GCP_LOGGING_DETAILED_MEMORY: Enable detailed memory introspection (default: False) Instrumentation points: - HTTP middleware: Sets up request context with correlation ID - Workflow executor: Adds step context for workflow invocations - ModelManager: Logs cache status, model loads, and inference completion - Fixed size cache: Logs eviction events with lifetime and inference count - Roboflow model: Logs disk download events Features: - Tiered sampling: High-volume events sampled, lifecycle events always logged - Tiered memory: Basic mode (cheap) vs detailed mode (GPU sync + syscalls) - Context propagation: request_id, workflow_instance_id, step_name flow through - Cache hit tracking: inference_completed includes whether model was pre-loaded Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When USE_INFERENCE_EXP_MODELS is enabled, the model is loaded through
the InferenceExpObjectDetectionModelAdapter which wraps the inference_models
AutoModel. This change detects that adapter pattern and:
- Sets backend to "inference-models" or "inference-models/{backend}"
- Extracts device info from the underlying exp_model
This ensures ModelLoadedToMemoryEvent correctly identifies models loaded
through the inference-models path vs the legacy ONNX path.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When USE_INFERENCE_EXP_MODELS is enabled, models are loaded via the inference-models library's AutoModel.from_pretrained(). This path bypasses the legacy download code, so ModelLoadedToDiskEvent wasn't being logged. This adds GCPLoggingModelAccessManager - a custom ModelAccessManager that hooks into inference-models download callbacks: - on_model_package_access_granted: Start tracking download - on_file_created: Accumulate file sizes and count - on_model_loaded: Log ModelLoadedToDiskEvent with totals The adapter only creates the access manager when GCP logging is enabled. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
bugbot review |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
…ability (#1962) * Rename GCP logging to generic structured logging Rename gcp_logging module to structured_logging since the serverless deployment is no longer GCP-exclusive and this feature is useful in other contexts. Changes: - Rename directory gcp_logging -> structured_logging - Rename env vars: GCP_LOGGING_* -> STRUCTURED_LOGGING_* - Rename classes: GCPServerlessLogger -> StructuredLogger, GCPRequestContext -> RequestContext, BaseGCPEvent -> BaseEvent, GCPLoggingModelAccessManager -> LoggingModelAccessManager - Rename functions: set/get/clear/update_gcp_context -> set/get/clear/update_request_context - Rename singleton: gcp_logger -> structured_logger - Update all imports and usages across consumer files - Update documentation to reflect generic naming https://claude.ai/code/session_01LXU2GBiYWHQM7Qi7hPYTPT * Rename StructuredLogger to StructuredEventLogger for clarity Make the naming more explicit by using StructuredEventLogger instead of just StructuredLogger, to clearly indicate this is an event logger. - Rename class: StructuredLogger -> StructuredEventLogger - Rename singleton: structured_logger -> structured_event_logger - Update all imports and usages across consumer files https://claude.ai/code/session_01LXU2GBiYWHQM7Qi7hPYTPT --------- Co-authored-by: Claude <noreply@anthropic.com>
…leware (#1964) Decouple the structured logging request context setup from GCPServerlessMiddleware so it runs whenever STRUCTURED_LOGGING_ENABLED=True, regardless of deployment environment. - Create new StructuredLoggingMiddleware that sets up RequestContext with request_id (from correlation ID) and hashed api_key - Keep GCPServerlessMiddleware focused on GCP-specific concerns only (execution_id header, processing time header) - Register StructuredLoggingMiddleware when STRUCTURED_LOGGING_ENABLED is true https://claude.ai/code/session_01XHjBRQV5q7Nyq7eNEiTAaH Co-authored-by: Claude <noreply@anthropic.com>
|
bugbot review |
|
Skipping Bugbot: Unable to authenticate your request. Please make sure Bugbot is properly installed and configured for this repository. |
| state = _measure_detailed_state() | ||
| return MemorySnapshot( | ||
| gpu_free=state.get("gpu_free"), | ||
| ) | ||
|
|
||
|
|
There was a problem hiding this comment.
⚡️Codeflash found 1,813% (18.13x) speedup for measure_memory_for_eviction in inference/core/structured_logging/memory.py
⏱️ Runtime : 8.28 milliseconds → 433 microseconds (best of 89 runs)
📝 Explanation and details
The optimized code achieves an 1812% speedup by eliminating unnecessary work in measure_memory_for_eviction() when detailed=True.
Key Optimization
Inlined the critical path: Instead of calling _measure_detailed_state() which collects comprehensive GPU metrics (memory_allocated(), memory_reserved(), gpu_free, gpu_total) and system memory info via psutil, the optimized version directly captures only gpu_free - the single field actually used by the caller.
What Changed
- Removed intermediate function call: Eliminated the call to
_measure_detailed_state()and its associated function call overhead - Reduced GPU synchronization: Cut GPU memory queries from 4 operations (
mem_get_info(),memory_allocated(),memory_reserved(), plus total frommem_get_info()) down to just 1 (mem_get_info()forgpu_free) - Eliminated psutil syscalls: Removed
Process().memory_info().rssandvirtual_memory().availablecalls entirely - Simplified data flow: Directly return
MemorySnapshot(gpu_free=gpu_free)instead of building an intermediate dictionary and extracting from it
Performance Impact
Based on the line profiler, the original code spent:
- ~3.55s importing torch (99% of time in
_measure_detailed_state) - ~37ms importing psutil
- ~17ms on psutil memory queries
The optimized version eliminates all psutil operations and reduces GPU memory queries by 75%, keeping only the essential gpu_free measurement.
Context from Function References
The function is called during model eviction in WithFixedSizeCache.add_model() - a hot path executed when models are swapped in/out of memory. The optimization is particularly valuable here because:
- Model eviction can happen frequently under memory pressure or capacity constraints
- The function is called in a tight loop (
for _ in range(3)) during eviction - Only
gpu_freeis actually logged in theModelEvictedEvent, making the other measurements pure overhead
Test Results Analysis
The optimization excels across all scenarios:
- Fast path preserved:
detailed=Falseremains unchanged (no regression) - Biggest gains in realistic scenarios: Tests with actual torch/psutil mocking show 100-33000% speedups
- Scales well: Repeated calls test shows 3490% speedup, confirming the optimization benefits accumulate in the eviction loop scenario
- Memory-constrained environments: When torch is available but psutil is missing, the speedup is dramatic (33243%) since it was doing unnecessary work before
✅ Correctness verification report:
| Test | Status |
|---|---|
| ⏪ Replay Tests | 🔘 None Found |
| ⚙️ Existing Unit Tests | 🔘 None Found |
| 🔎 Concolic Coverage Tests | 🔘 None Found |
| 🌀 Generated Regression Tests | ✅ 309 Passed |
| 📊 Tests Coverage | 100.0% |
🌀 Click to see Generated Regression Tests
import sys
import types
from types import ModuleType, SimpleNamespace
from typing import Optional
from unittest.mock import Mock
# imports
import pytest # used for our unit tests
# function to test
from inference.core.structured_logging.events import MemorySnapshot
from inference.core.structured_logging.memory import measure_memory_for_eviction
# unit tests
# Helper utilities to construct fake external modules (torch, psutil) for deterministic testing.
def _make_fake_torch(
mem_info_sequence=None, is_available=True, alloc_value=0, reserved_value=0
):
"""
Create a fake 'torch' module with a cuda submodule that provides:
- is_available(): returns the provided is_available boolean
- mem_get_info(): returns a tuple (gpu_free, gpu_total). If mem_info_sequence is provided,
values are popped from it for each call to simulate changing GPU free values.
- memory_allocated() and memory_reserved(): return fixed values for determinism.
"""
torch_mod = ModuleType("torch")
cuda_mod = ModuleType("torch.cuda")
# make mem_info_sequence mutable list if provided
if mem_info_sequence is not None:
seq = list(mem_info_sequence)
def mem_get_info():
# pop from front to simulate changing readings across calls
if not seq:
# fallback deterministic value if exhausted
return (0, 0)
return (seq.pop(0), 999999)
else:
def mem_get_info():
return (123456, 789012) # default deterministic tuple
def is_available_fn():
return bool(is_available)
def memory_allocated_fn():
return int(alloc_value)
def memory_reserved_fn():
return int(reserved_value)
# attach functions to cuda submodule
cuda_mod.mem_get_info = mem_get_info
cuda_mod.is_available = is_available_fn
cuda_mod.memory_allocated = memory_allocated_fn
cuda_mod.memory_reserved = memory_reserved_fn
# attach cuda submodule to torch module
torch_mod.cuda = cuda_mod
return torch_mod
def _make_fake_psutil(rss_value=0, available_value=0):
"""
Create a fake 'psutil' module with:
- Process() returning an object whose memory_info().rss equals rss_value
- virtual_memory() returning an object with attribute available == available_value
"""
psutil_mod = ModuleType("psutil")
class _Proc:
def memory_info(self_inner):
# return a simple object with rss attribute
return SimpleNamespace(rss=int(rss_value))
def Process():
return _Proc()
def virtual_memory():
return SimpleNamespace(available=int(available_value))
psutil_mod.Process = Process
psutil_mod.virtual_memory = virtual_memory
return psutil_mod
def test_returns_none_when_not_detailed():
# Call with detailed=False should be a fast path returning None
codeflash_output = measure_memory_for_eviction(detailed=False)
result = codeflash_output # 771ns -> 782ns (1.41% slower)
def test_detailed_with_fake_torch_and_psutil_returns_expected_gpu_free(monkeypatch):
# Prepare deterministic fake modules
fake_torch = _make_fake_torch(
mem_info_sequence=[4444], is_available=True, alloc_value=111, reserved_value=222
)
fake_psutil = _make_fake_psutil(rss_value=555, available_value=666)
# Inject into sys.modules so import inside _measure_detailed_state picks them up
monkeypatch.setitem(sys.modules, "torch", fake_torch)
monkeypatch.setitem(sys.modules, "psutil", fake_psutil)
# Call measured function with detailed=True; should reflect our fake GPU free value
codeflash_output = measure_memory_for_eviction(detailed=True)
snapshot = codeflash_output # 11.4μs -> 5.29μs (116% faster)
def test_torch_present_but_cuda_not_available_psutil_present(monkeypatch):
# Fake torch with cuda.is_available returning False
fake_torch = _make_fake_torch(is_available=False)
fake_psutil = _make_fake_psutil(rss_value=1000, available_value=2000)
monkeypatch.setitem(sys.modules, "torch", fake_torch)
monkeypatch.setitem(sys.modules, "psutil", fake_psutil)
# When CUDA is not available, GPU keys should not be added; snapshot.gpu_free is None
codeflash_output = measure_memory_for_eviction(detailed=True)
snapshot = codeflash_output # 6.30μs -> 3.07μs (106% faster)
def test_torch_available_psutil_missing(monkeypatch):
fake_torch = _make_fake_torch(mem_info_sequence=[7777], is_available=True)
monkeypatch.setitem(sys.modules, "torch", fake_torch)
# Ensure psutil import raises ImportError by removing any real psutil
monkeypatch.delitem(sys.modules, "psutil", raising=False)
codeflash_output = measure_memory_for_eviction(detailed=True)
snapshot = codeflash_output # 1.25ms -> 3.74μs (33243% faster)
def test_large_scale_repeated_calls_are_consistent(monkeypatch):
# Prepare a sequence of GPU free values for multiple calls (below 1000 elements as required)
values = list(range(1000, 1010)) # 10 calls, small and deterministic
fake_torch = _make_fake_torch(mem_info_sequence=list(values), is_available=True)
fake_psutil = _make_fake_psutil(rss_value=123, available_value=456)
monkeypatch.setitem(sys.modules, "torch", fake_torch)
monkeypatch.setitem(sys.modules, "psutil", fake_psutil)
# Call measure_memory_for_eviction multiple times and ensure we get the sequence back in order
observed = []
for _ in range(len(values)):
codeflash_output = measure_memory_for_eviction(detailed=True)
snap = codeflash_output # 41.7μs -> 16.3μs (156% faster)
observed.append(snap.gpu_free)
def test_mem_get_info_returns_non_int_types_are_passed_through(monkeypatch):
# mem_get_info returns a tuple where gpu_free is a string (to simulate unexpected but possible API shape)
def mem_get_info_str():
return ("9999_as_str", "total_dummy")
fake_torch = ModuleType("torch")
cuda_mod = ModuleType("torch.cuda")
cuda_mod.is_available = lambda: True
cuda_mod.mem_get_info = mem_get_info_str
cuda_mod.memory_allocated = lambda: 0
cuda_mod.memory_reserved = lambda: 0
fake_torch.cuda = cuda_mod
fake_psutil = _make_fake_psutil(rss_value=1, available_value=2)
monkeypatch.setitem(sys.modules, "torch", fake_torch)
monkeypatch.setitem(sys.modules, "psutil", fake_psutil)
codeflash_output = measure_memory_for_eviction(detailed=True)
snapshot = codeflash_output # 7.73μs -> 3.10μs (150% faster)
def test_repeated_calls_with_stable_fake_modules_are_idempotent(monkeypatch):
fake_torch = _make_fake_torch(
mem_info_sequence=None, is_available=True, alloc_value=10, reserved_value=20
)
fake_psutil = _make_fake_psutil(rss_value=321, available_value=654)
monkeypatch.setitem(sys.modules, "torch", fake_torch)
monkeypatch.setitem(sys.modules, "psutil", fake_psutil)
# Call twice and ensure same values are returned
codeflash_output = measure_memory_for_eviction(detailed=True)
s1 = codeflash_output # 7.75μs -> 3.27μs (137% faster)
codeflash_output = measure_memory_for_eviction(detailed=True)
s2 = codeflash_output # 4.45μs -> 1.56μs (185% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.from unittest.mock import MagicMock, patch
# imports
import pytest
from inference.core.structured_logging.events import MemorySnapshot
from inference.core.structured_logging.memory import measure_memory_for_eviction
class TestBasicFunctionality:
"""Test suite for basic functionality of measure_memory_for_eviction."""
def test_measure_memory_detailed_false_returns_none(self):
"""
Test that when detailed=False, the function returns None.
This is the default behavior and should not perform any memory measurement.
"""
codeflash_output = measure_memory_for_eviction(detailed=False)
result = codeflash_output # 792ns -> 811ns (2.34% slower)
def test_measure_memory_detailed_false_is_default(self):
"""
Test that detailed=False is the default parameter value.
Calling without arguments should return None.
"""
codeflash_output = measure_memory_for_eviction()
result = codeflash_output # 521ns -> 521ns (0.000% faster)
def test_measure_memory_detailed_true_returns_memory_snapshot(self):
"""
Test that when detailed=True, the function returns a MemorySnapshot object.
This requires actual memory measurement capability.
"""
codeflash_output = measure_memory_for_eviction(detailed=True)
result = codeflash_output # 165μs -> 11.0μs (1396% faster)
def test_return_type_is_memory_snapshot_or_none(self):
"""
Test that the return type is always either MemorySnapshot or None.
"""
codeflash_output = measure_memory_for_eviction(detailed=True)
result = codeflash_output # 149μs -> 8.04μs (1753% faster)
def test_memory_snapshot_has_gpu_free_attribute(self):
"""
Test that MemorySnapshot contains gpu_free attribute when available.
"""
codeflash_output = measure_memory_for_eviction(detailed=True)
result = codeflash_output # 147μs -> 7.49μs (1870% faster)
if result is not None:
pass
class TestEdgeCases:
"""Test suite for edge cases and unusual conditions."""
def test_multiple_calls_consistent_behavior(self):
"""
Test that multiple calls with detailed=False always return None.
"""
for _ in range(5):
codeflash_output = measure_memory_for_eviction(detailed=False)
result = codeflash_output # 1.83μs -> 1.86μs (1.61% slower)
def test_boolean_true_explicitly_passed(self):
"""
Test that explicitly passing True for detailed parameter works correctly.
"""
codeflash_output = measure_memory_for_eviction(detailed=True)
result = codeflash_output # 148μs -> 8.04μs (1747% faster)
def test_boolean_false_explicitly_passed(self):
"""
Test that explicitly passing False for detailed parameter returns None.
"""
codeflash_output = measure_memory_for_eviction(detailed=False)
result = codeflash_output # 721ns -> 721ns (0.000% faster)
def test_detailed_parameter_type_coercion(self):
"""
Test that truthy/falsy values work as detailed parameter.
0 should be falsy (like False), 1 should be truthy (like True).
"""
# Test with 0 (falsy)
codeflash_output = measure_memory_for_eviction(detailed=0)
result = codeflash_output # 742ns -> 751ns (1.20% slower)
# Test with 1 (truthy)
codeflash_output = measure_memory_for_eviction(detailed=1)
result = codeflash_output # 149μs -> 7.61μs (1858% faster)
def test_gpu_free_value_when_torch_unavailable(self):
"""
Test behavior when torch is not available.
The function should still work but gpu_free may be None.
"""
with patch("torch.cuda.is_available", side_effect=ImportError):
codeflash_output = measure_memory_for_eviction(detailed=True)
result = codeflash_output # 173μs -> 30.6μs (465% faster)
def test_gpu_free_value_when_psutil_unavailable(self):
"""
Test behavior when psutil is not available.
The function should still work without system memory info.
"""
with patch("psutil.Process", side_effect=ImportError):
codeflash_output = measure_memory_for_eviction(detailed=True)
result = codeflash_output # 29.8μs -> 9.35μs (219% faster)
@patch("inference.core.structured_logging.memory._measure_detailed_state")
def test_memory_snapshot_created_with_gpu_free_value(self, mock_measure):
"""
Test that MemorySnapshot is properly created with gpu_free from state dict.
"""
# Mock the detailed state to return known values
mock_measure.return_value = {
"gpu_free": 12345678,
"gpu_total": 16000000000,
"gpu_allocated": 3654322,
"gpu_reserved": 3900000,
}
codeflash_output = measure_memory_for_eviction(detailed=True)
result = codeflash_output # 19.5μs -> 8.44μs (131% faster)
@patch("inference.core.structured_logging.memory._measure_detailed_state")
def test_memory_snapshot_with_empty_state(self, mock_measure):
"""
Test that MemorySnapshot handles empty state dict gracefully.
"""
# Mock the detailed state to return empty dict
mock_measure.return_value = {}
codeflash_output = measure_memory_for_eviction(detailed=True)
result = codeflash_output # 19.0μs -> 8.30μs (129% faster)
@patch("inference.core.structured_logging.memory._measure_detailed_state")
def test_memory_snapshot_with_none_gpu_free(self, mock_measure):
"""
Test that MemorySnapshot handles None gpu_free value correctly.
"""
mock_measure.return_value = {"gpu_free": None}
codeflash_output = measure_memory_for_eviction(detailed=True)
result = codeflash_output # 18.7μs -> 8.35μs (124% faster)
@patch("inference.core.structured_logging.memory._measure_detailed_state")
def test_memory_snapshot_with_zero_gpu_free(self, mock_measure):
"""
Test that MemorySnapshot handles zero gpu_free value (out of memory condition).
"""
mock_measure.return_value = {"gpu_free": 0}
codeflash_output = measure_memory_for_eviction(detailed=True)
result = codeflash_output # 18.3μs -> 8.06μs (126% faster)
class TestLargeScale:
"""Test suite for performance and scalability with large data samples."""
def test_repeated_calls_detailed_false(self):
"""
Test that repeated calls with detailed=False remain fast and don't accumulate memory.
Performs 100 calls to ensure no performance degradation.
"""
for i in range(100):
codeflash_output = measure_memory_for_eviction(detailed=False)
result = codeflash_output # 26.5μs -> 26.8μs (1.15% slower)
def test_repeated_calls_detailed_true(self):
"""
Test that repeated calls with detailed=True work consistently.
Performs 10 calls (limited due to GPU sync overhead).
"""
results = []
for i in range(10):
codeflash_output = measure_memory_for_eviction(detailed=True)
result = codeflash_output # 1.23ms -> 34.4μs (3490% faster)
results.append(result)
# All results should be None or MemorySnapshot
for result in results:
pass
def test_concurrent_behavior_simulation(self):
"""
Test that calling the function in sequence (simulating concurrent usage)
works correctly. Uses 50 calls alternating between detailed=True and False.
"""
for i in range(50):
detailed = i % 2 == 0
codeflash_output = measure_memory_for_eviction(detailed=detailed)
result = codeflash_output # 2.95ms -> 84.3μs (3395% faster)
if detailed:
pass
else:
pass
@patch("inference.core.structured_logging.memory._measure_detailed_state")
def test_large_gpu_free_values(self, mock_measure):
"""
Test handling of very large GPU memory values (e.g., for 80GB GPUs).
"""
# 80GB GPU with 50GB free
large_gpu_free = 80 * 1024 * 1024 * 1024
mock_measure.return_value = {"gpu_free": large_gpu_free}
codeflash_output = measure_memory_for_eviction(detailed=True)
result = codeflash_output # 20.6μs -> 8.46μs (143% faster)
@patch("inference.core.structured_logging.memory._measure_detailed_state")
def test_small_gpu_free_values(self, mock_measure):
"""
Test handling of very small GPU memory values.
"""
# 1KB free (extreme low memory condition)
small_gpu_free = 1024
mock_measure.return_value = {"gpu_free": small_gpu_free}
codeflash_output = measure_memory_for_eviction(detailed=True)
result = codeflash_output # 19.4μs -> 8.24μs (136% faster)
@patch("inference.core.structured_logging.memory._measure_detailed_state")
def test_multiple_measurement_states(self, mock_measure):
"""
Test with multiple different state dictionaries to simulate state changes.
"""
states = [
{"gpu_free": 1000000},
{"gpu_free": 2000000},
{"gpu_free": 3000000},
{"gpu_free": 4000000},
]
for state in states:
mock_measure.return_value = state
codeflash_output = measure_memory_for_eviction(detailed=True)
result = codeflash_output # 46.5μs -> 18.2μs (155% faster)
def test_parameter_variations_100_calls(self):
"""
Test 100 rapid calls with varying detailed parameter values.
"""
call_count = 0
none_count = 0
snapshot_count = 0
for i in range(100):
detailed = i % 10 == 0 # Vary the parameter
codeflash_output = measure_memory_for_eviction(detailed=detailed)
result = codeflash_output # 1.29ms -> 63.4μs (1928% faster)
call_count += 1
if result is None:
none_count += 1
elif isinstance(result, MemorySnapshot):
snapshot_count += 1
class TestIntegration:
"""Integration tests combining multiple aspects of the function."""
def test_detailed_false_then_true(self):
"""
Test sequential calls with different detailed values.
"""
codeflash_output = measure_memory_for_eviction(detailed=False)
result1 = codeflash_output # 721ns -> 731ns (1.37% slower)
codeflash_output = measure_memory_for_eviction(detailed=True)
result2 = codeflash_output # 145μs -> 7.11μs (1946% faster)
def test_detailed_true_then_false(self):
"""
Test sequential calls in opposite order.
"""
codeflash_output = measure_memory_for_eviction(detailed=True)
result1 = codeflash_output # 143μs -> 7.32μs (1864% faster)
codeflash_output = measure_memory_for_eviction(detailed=False)
result2 = codeflash_output # 631ns -> 381ns (65.6% faster)
@patch("inference.core.structured_logging.memory._measure_detailed_state")
def test_state_dict_with_all_possible_keys(self, mock_measure):
"""
Test with a complete state dictionary containing all possible keys.
"""
complete_state = {
"gpu_allocated": 5000000,
"gpu_reserved": 6000000,
"gpu_free": 10000000,
"gpu_total": 16000000000,
"process_rss_bytes": 2000000000,
"system_available_bytes": 8000000000,
}
mock_measure.return_value = complete_state
codeflash_output = measure_memory_for_eviction(detailed=True)
result = codeflash_output # 20.1μs -> 8.26μs (143% faster)
@patch("inference.core.structured_logging.memory._measure_detailed_state")
def test_state_dict_with_partial_keys(self, mock_measure):
"""
Test with a state dictionary containing only gpu_free.
"""
partial_state = {"gpu_free": 5000000}
mock_measure.return_value = partial_state
codeflash_output = measure_memory_for_eviction(detailed=True)
result = codeflash_output # 19.0μs -> 8.10μs (135% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.To test or edit this optimization locally git merge codeflash/optimize-pr1930-2026-02-02T20.10.49
| state = _measure_detailed_state() | |
| return MemorySnapshot( | |
| gpu_free=state.get("gpu_free"), | |
| ) | |
| gpu_free = None | |
| try: | |
| import torch | |
| if torch.cuda.is_available(): | |
| gpu_free, _ = torch.cuda.mem_get_info() | |
| except ImportError: | |
| pass | |
| return MemorySnapshot(gpu_free=gpu_free) | |
| api_key=self.api_key, | ||
| allow_untrusted_packages=ALLOW_INFERENCE_EXP_UNTRUSTED_MODELS, | ||
| allow_direct_local_storage_loading=False, | ||
| model_access_manager=model_access_manager, |
There was a problem hiding this comment.
@PawelPeczek-Roboflow seeing that as part of 1.0rc there are many new adapters.
I can make change in all of those for 1.0 similar to here unless you have a better idea?
grzegorz-roboflow
left a comment
There was a problem hiding this comment.
first batch of comments
| clear_request_context, | ||
| set_request_context, | ||
| structured_event_logger, | ||
| ) |
There was a problem hiding this comment.
dispatch is executed with each request, I would vote for moving those imports away from this method
| if GCP_SERVERLESS: | ||
| app.add_middleware(GCPServerlessMiddleware) | ||
| if STRUCTURED_LOGGING_ENABLED: | ||
| app.add_middleware(StructuredLoggingMiddleware) |
There was a problem hiding this comment.
Please add comment here indicating that this middleware relies on CorrelationIdMiddleware to be already executed
Or maybe move this to where CorrelationIdMiddleware is set, maybe even add this middleware explicitly before CorrelationIdMiddleware (twice) so it's clear those two always go together
|
|
||
| logger.debug("Downloading model artifacts from Roboflow API") | ||
|
|
||
| download_start_time = time.time() |
There was a problem hiding this comment.
looks like this is only used in if structured_event_logger.enabled:, maybe capturing it there would save us a few CPU cycles
|
holding this until after 1.0 release is done since we'll need to update some of the hooks to log model load times for new adapters |
Summary
Adds a structured logging system that captures model lifecycle events for observability and analysis. This enables understanding cache hit rates, model thrashing patterns, and memory usage in production.
Key features:
Event types:
Configuration (env vars):
Test plan