Skip to content

Structured Event logging#1930

Open
hansent wants to merge 16 commits intomainfrom
feature/gcp-serverless-structured-logging
Open

Structured Event logging#1930
hansent wants to merge 16 commits intomainfrom
feature/gcp-serverless-structured-logging

Conversation

@hansent
Copy link
Collaborator

@hansent hansent commented Jan 20, 2026

Summary

Adds a structured logging system that captures model lifecycle events for observability and analysis. This enables understanding cache hit rates, model thrashing patterns, and memory usage in production.

Key features:

  • JSON structured logs compatible with GCP Cloud Logging
  • 7 event types covering the full model lifecycle
  • Tiered sampling to control log volume/costs (high-volume events sampled, lifecycle events always logged)
  • Tiered memory measurement (basic vs detailed mode)
  • Request context propagation (correlates events across workflow steps)
  • Support for both legacy model loading and inference-models paths

Event types:

  • request_received - Direct inference request (sampled)
  • workflow_request_received - Workflow execution started (sampled)
  • model_cache_status - Cache hit/miss per request (sampled)
  • model_loaded_to_disk - Model artifacts downloaded (always logged)
  • model_loaded_to_memory - Model loaded to GPU/CPU (always logged)
  • model_evicted - Model removed from cache (always logged)
  • inference_completed - Inference finished with timing (sampled)

Configuration (env vars):

  • STRUCTURED_LOGGING_ENABLED - Master switch (default: True when GCP_SERVERLESS)
  • STRUCTURED_LOGGING_SAMPLE_RATE - Sample rate for high-volume events (default: 1.0)
  • STRUCTURED_LOGGING_DETAILED_MEMORY - Enable full memory snapshots (default: False)

Test plan

  • Run server locally with STRUCTURED_LOGGING_ENABLED=true and verify JSON logs appear in stdout
  • Make direct inference request → verify request_received, model_cache_status, model_loaded_to_memory, inference_completed events
  • Make second request to same model → verify cache_hit: true in events
  • Execute workflow → verify workflow_request_received and events include workflow_instance_id, step_name
  • Test with USE_INFERENCE_EXP_MODELS=true → verify model_loaded_to_disk event logged via callback
  • Test with STRUCTURED_LOGGING_DETAILED_MEMORY=true → verify memory snapshot fields populated
  • Test with STRUCTURED_LOGGING_SAMPLE_RATE=0.1 → verify ~10% of high-volume events logged

hansent and others added 3 commits January 20, 2026 13:39
Introduces a separate structured logging system for GCP serverless deployments
that captures model lifecycle events to understand cache hit rates, model
thrashing, and memory usage patterns.

New module: inference/core/gcp_logging/
- events.py: 7 event dataclasses (request_received, workflow_request_received,
  model_cache_status, model_loaded_to_disk, model_loaded_to_memory,
  model_evicted, inference_completed)
- context.py: Request context tracking via ContextVar
- memory.py: Tiered memory measurement (basic vs detailed mode)
- logger.py: GCPServerlessLogger singleton with sampling support

Environment variables:
- GCP_LOGGING_ENABLED: Master switch (default: True when GCP_SERVERLESS)
- GCP_LOGGING_SAMPLE_RATE: Sample rate for high-volume events (default: 1.0)
- GCP_LOGGING_DETAILED_MEMORY: Enable detailed memory introspection (default: False)

Instrumentation points:
- HTTP middleware: Sets up request context with correlation ID
- Workflow executor: Adds step context for workflow invocations
- ModelManager: Logs cache status, model loads, and inference completion
- Fixed size cache: Logs eviction events with lifetime and inference count
- Roboflow model: Logs disk download events

Features:
- Tiered sampling: High-volume events sampled, lifecycle events always logged
- Tiered memory: Basic mode (cheap) vs detailed mode (GPU sync + syscalls)
- Context propagation: request_id, workflow_instance_id, step_name flow through
- Cache hit tracking: inference_completed includes whether model was pre-loaded

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When USE_INFERENCE_EXP_MODELS is enabled, the model is loaded through
the InferenceExpObjectDetectionModelAdapter which wraps the inference_models
AutoModel. This change detects that adapter pattern and:

- Sets backend to "inference-models" or "inference-models/{backend}"
- Extracts device info from the underlying exp_model

This ensures ModelLoadedToMemoryEvent correctly identifies models loaded
through the inference-models path vs the legacy ONNX path.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When USE_INFERENCE_EXP_MODELS is enabled, models are loaded via the
inference-models library's AutoModel.from_pretrained(). This path
bypasses the legacy download code, so ModelLoadedToDiskEvent wasn't
being logged.

This adds GCPLoggingModelAccessManager - a custom ModelAccessManager
that hooks into inference-models download callbacks:
- on_model_package_access_granted: Start tracking download
- on_file_created: Accumulate file sizes and count
- on_model_loaded: Log ModelLoadedToDiskEvent with totals

The adapter only creates the access manager when GCP logging is enabled.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@grzegorz-roboflow
Copy link
Collaborator

bugbot review

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

hansent and others added 3 commits February 2, 2026 09:09
…ability (#1962)

* Rename GCP logging to generic structured logging

Rename gcp_logging module to structured_logging since the serverless
deployment is no longer GCP-exclusive and this feature is useful in
other contexts.

Changes:
- Rename directory gcp_logging -> structured_logging
- Rename env vars: GCP_LOGGING_* -> STRUCTURED_LOGGING_*
- Rename classes: GCPServerlessLogger -> StructuredLogger,
  GCPRequestContext -> RequestContext, BaseGCPEvent -> BaseEvent,
  GCPLoggingModelAccessManager -> LoggingModelAccessManager
- Rename functions: set/get/clear/update_gcp_context ->
  set/get/clear/update_request_context
- Rename singleton: gcp_logger -> structured_logger
- Update all imports and usages across consumer files
- Update documentation to reflect generic naming

https://claude.ai/code/session_01LXU2GBiYWHQM7Qi7hPYTPT

* Rename StructuredLogger to StructuredEventLogger for clarity

Make the naming more explicit by using StructuredEventLogger instead of
just StructuredLogger, to clearly indicate this is an event logger.

- Rename class: StructuredLogger -> StructuredEventLogger
- Rename singleton: structured_logger -> structured_event_logger
- Update all imports and usages across consumer files

https://claude.ai/code/session_01LXU2GBiYWHQM7Qi7hPYTPT

---------

Co-authored-by: Claude <noreply@anthropic.com>
@hansent hansent changed the title gcp serverless structured logging Structured Event logging Feb 2, 2026
hansent and others added 3 commits February 2, 2026 13:17
…leware (#1964)

Decouple the structured logging request context setup from GCPServerlessMiddleware
so it runs whenever STRUCTURED_LOGGING_ENABLED=True, regardless of deployment
environment.

- Create new StructuredLoggingMiddleware that sets up RequestContext with
  request_id (from correlation ID) and hashed api_key
- Keep GCPServerlessMiddleware focused on GCP-specific concerns only
  (execution_id header, processing time header)
- Register StructuredLoggingMiddleware when STRUCTURED_LOGGING_ENABLED is true

https://claude.ai/code/session_01XHjBRQV5q7Nyq7eNEiTAaH

Co-authored-by: Claude <noreply@anthropic.com>
@hansent hansent marked this pull request as ready for review February 2, 2026 19:31
@hansent
Copy link
Collaborator Author

hansent commented Feb 2, 2026

bugbot review

@cursor
Copy link

cursor bot commented Feb 2, 2026

Skipping Bugbot: Unable to authenticate your request. Please make sure Bugbot is properly installed and configured for this repository.

Comment on lines +103 to +108
state = _measure_detailed_state()
return MemorySnapshot(
gpu_free=state.get("gpu_free"),
)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚡️Codeflash found 1,813% (18.13x) speedup for measure_memory_for_eviction in inference/core/structured_logging/memory.py

⏱️ Runtime : 8.28 milliseconds 433 microseconds (best of 89 runs)

📝 Explanation and details

The optimized code achieves an 1812% speedup by eliminating unnecessary work in measure_memory_for_eviction() when detailed=True.

Key Optimization

Inlined the critical path: Instead of calling _measure_detailed_state() which collects comprehensive GPU metrics (memory_allocated(), memory_reserved(), gpu_free, gpu_total) and system memory info via psutil, the optimized version directly captures only gpu_free - the single field actually used by the caller.

What Changed

  1. Removed intermediate function call: Eliminated the call to _measure_detailed_state() and its associated function call overhead
  2. Reduced GPU synchronization: Cut GPU memory queries from 4 operations (mem_get_info(), memory_allocated(), memory_reserved(), plus total from mem_get_info()) down to just 1 (mem_get_info() for gpu_free)
  3. Eliminated psutil syscalls: Removed Process().memory_info().rss and virtual_memory().available calls entirely
  4. Simplified data flow: Directly return MemorySnapshot(gpu_free=gpu_free) instead of building an intermediate dictionary and extracting from it

Performance Impact

Based on the line profiler, the original code spent:

  • ~3.55s importing torch (99% of time in _measure_detailed_state)
  • ~37ms importing psutil
  • ~17ms on psutil memory queries

The optimized version eliminates all psutil operations and reduces GPU memory queries by 75%, keeping only the essential gpu_free measurement.

Context from Function References

The function is called during model eviction in WithFixedSizeCache.add_model() - a hot path executed when models are swapped in/out of memory. The optimization is particularly valuable here because:

  • Model eviction can happen frequently under memory pressure or capacity constraints
  • The function is called in a tight loop (for _ in range(3)) during eviction
  • Only gpu_free is actually logged in the ModelEvictedEvent, making the other measurements pure overhead

Test Results Analysis

The optimization excels across all scenarios:

  • Fast path preserved: detailed=False remains unchanged (no regression)
  • Biggest gains in realistic scenarios: Tests with actual torch/psutil mocking show 100-33000% speedups
  • Scales well: Repeated calls test shows 3490% speedup, confirming the optimization benefits accumulate in the eviction loop scenario
  • Memory-constrained environments: When torch is available but psutil is missing, the speedup is dramatic (33243%) since it was doing unnecessary work before

Correctness verification report:

Test Status
⏪ Replay Tests 🔘 None Found
⚙️ Existing Unit Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
🌀 Generated Regression Tests 309 Passed
📊 Tests Coverage 100.0%
🌀 Click to see Generated Regression Tests
import sys
import types
from types import ModuleType, SimpleNamespace
from typing import Optional
from unittest.mock import Mock

# imports
import pytest  # used for our unit tests

# function to test
from inference.core.structured_logging.events import MemorySnapshot
from inference.core.structured_logging.memory import measure_memory_for_eviction

# unit tests


# Helper utilities to construct fake external modules (torch, psutil) for deterministic testing.
def _make_fake_torch(
    mem_info_sequence=None, is_available=True, alloc_value=0, reserved_value=0
):
    """
    Create a fake 'torch' module with a cuda submodule that provides:
      - is_available(): returns the provided is_available boolean
      - mem_get_info(): returns a tuple (gpu_free, gpu_total). If mem_info_sequence is provided,
        values are popped from it for each call to simulate changing GPU free values.
      - memory_allocated() and memory_reserved(): return fixed values for determinism.
    """
    torch_mod = ModuleType("torch")
    cuda_mod = ModuleType("torch.cuda")

    # make mem_info_sequence mutable list if provided
    if mem_info_sequence is not None:
        seq = list(mem_info_sequence)

        def mem_get_info():
            # pop from front to simulate changing readings across calls
            if not seq:
                # fallback deterministic value if exhausted
                return (0, 0)
            return (seq.pop(0), 999999)

    else:

        def mem_get_info():
            return (123456, 789012)  # default deterministic tuple

    def is_available_fn():
        return bool(is_available)

    def memory_allocated_fn():
        return int(alloc_value)

    def memory_reserved_fn():
        return int(reserved_value)

    # attach functions to cuda submodule
    cuda_mod.mem_get_info = mem_get_info
    cuda_mod.is_available = is_available_fn
    cuda_mod.memory_allocated = memory_allocated_fn
    cuda_mod.memory_reserved = memory_reserved_fn

    # attach cuda submodule to torch module
    torch_mod.cuda = cuda_mod
    return torch_mod


def _make_fake_psutil(rss_value=0, available_value=0):
    """
    Create a fake 'psutil' module with:
      - Process() returning an object whose memory_info().rss equals rss_value
      - virtual_memory() returning an object with attribute available == available_value
    """
    psutil_mod = ModuleType("psutil")

    class _Proc:
        def memory_info(self_inner):
            # return a simple object with rss attribute
            return SimpleNamespace(rss=int(rss_value))

    def Process():
        return _Proc()

    def virtual_memory():
        return SimpleNamespace(available=int(available_value))

    psutil_mod.Process = Process
    psutil_mod.virtual_memory = virtual_memory
    return psutil_mod


def test_returns_none_when_not_detailed():
    # Call with detailed=False should be a fast path returning None
    codeflash_output = measure_memory_for_eviction(detailed=False)
    result = codeflash_output  # 771ns -> 782ns (1.41% slower)


def test_detailed_with_fake_torch_and_psutil_returns_expected_gpu_free(monkeypatch):
    # Prepare deterministic fake modules
    fake_torch = _make_fake_torch(
        mem_info_sequence=[4444], is_available=True, alloc_value=111, reserved_value=222
    )
    fake_psutil = _make_fake_psutil(rss_value=555, available_value=666)

    # Inject into sys.modules so import inside _measure_detailed_state picks them up
    monkeypatch.setitem(sys.modules, "torch", fake_torch)
    monkeypatch.setitem(sys.modules, "psutil", fake_psutil)

    # Call measured function with detailed=True; should reflect our fake GPU free value
    codeflash_output = measure_memory_for_eviction(detailed=True)
    snapshot = codeflash_output  # 11.4μs -> 5.29μs (116% faster)


def test_torch_present_but_cuda_not_available_psutil_present(monkeypatch):
    # Fake torch with cuda.is_available returning False
    fake_torch = _make_fake_torch(is_available=False)
    fake_psutil = _make_fake_psutil(rss_value=1000, available_value=2000)

    monkeypatch.setitem(sys.modules, "torch", fake_torch)
    monkeypatch.setitem(sys.modules, "psutil", fake_psutil)

    # When CUDA is not available, GPU keys should not be added; snapshot.gpu_free is None
    codeflash_output = measure_memory_for_eviction(detailed=True)
    snapshot = codeflash_output  # 6.30μs -> 3.07μs (106% faster)


def test_torch_available_psutil_missing(monkeypatch):
    fake_torch = _make_fake_torch(mem_info_sequence=[7777], is_available=True)
    monkeypatch.setitem(sys.modules, "torch", fake_torch)
    # Ensure psutil import raises ImportError by removing any real psutil
    monkeypatch.delitem(sys.modules, "psutil", raising=False)

    codeflash_output = measure_memory_for_eviction(detailed=True)
    snapshot = codeflash_output  # 1.25ms -> 3.74μs (33243% faster)


def test_large_scale_repeated_calls_are_consistent(monkeypatch):
    # Prepare a sequence of GPU free values for multiple calls (below 1000 elements as required)
    values = list(range(1000, 1010))  # 10 calls, small and deterministic
    fake_torch = _make_fake_torch(mem_info_sequence=list(values), is_available=True)
    fake_psutil = _make_fake_psutil(rss_value=123, available_value=456)

    monkeypatch.setitem(sys.modules, "torch", fake_torch)
    monkeypatch.setitem(sys.modules, "psutil", fake_psutil)

    # Call measure_memory_for_eviction multiple times and ensure we get the sequence back in order
    observed = []
    for _ in range(len(values)):
        codeflash_output = measure_memory_for_eviction(detailed=True)
        snap = codeflash_output  # 41.7μs -> 16.3μs (156% faster)
        observed.append(snap.gpu_free)


def test_mem_get_info_returns_non_int_types_are_passed_through(monkeypatch):
    # mem_get_info returns a tuple where gpu_free is a string (to simulate unexpected but possible API shape)
    def mem_get_info_str():
        return ("9999_as_str", "total_dummy")

    fake_torch = ModuleType("torch")
    cuda_mod = ModuleType("torch.cuda")
    cuda_mod.is_available = lambda: True
    cuda_mod.mem_get_info = mem_get_info_str
    cuda_mod.memory_allocated = lambda: 0
    cuda_mod.memory_reserved = lambda: 0
    fake_torch.cuda = cuda_mod

    fake_psutil = _make_fake_psutil(rss_value=1, available_value=2)

    monkeypatch.setitem(sys.modules, "torch", fake_torch)
    monkeypatch.setitem(sys.modules, "psutil", fake_psutil)

    codeflash_output = measure_memory_for_eviction(detailed=True)
    snapshot = codeflash_output  # 7.73μs -> 3.10μs (150% faster)


def test_repeated_calls_with_stable_fake_modules_are_idempotent(monkeypatch):
    fake_torch = _make_fake_torch(
        mem_info_sequence=None, is_available=True, alloc_value=10, reserved_value=20
    )
    fake_psutil = _make_fake_psutil(rss_value=321, available_value=654)
    monkeypatch.setitem(sys.modules, "torch", fake_torch)
    monkeypatch.setitem(sys.modules, "psutil", fake_psutil)

    # Call twice and ensure same values are returned
    codeflash_output = measure_memory_for_eviction(detailed=True)
    s1 = codeflash_output  # 7.75μs -> 3.27μs (137% faster)
    codeflash_output = measure_memory_for_eviction(detailed=True)
    s2 = codeflash_output  # 4.45μs -> 1.56μs (185% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from unittest.mock import MagicMock, patch

# imports
import pytest
from inference.core.structured_logging.events import MemorySnapshot
from inference.core.structured_logging.memory import measure_memory_for_eviction


class TestBasicFunctionality:
    """Test suite for basic functionality of measure_memory_for_eviction."""

    def test_measure_memory_detailed_false_returns_none(self):
        """
        Test that when detailed=False, the function returns None.
        This is the default behavior and should not perform any memory measurement.
        """
        codeflash_output = measure_memory_for_eviction(detailed=False)
        result = codeflash_output  # 792ns -> 811ns (2.34% slower)

    def test_measure_memory_detailed_false_is_default(self):
        """
        Test that detailed=False is the default parameter value.
        Calling without arguments should return None.
        """
        codeflash_output = measure_memory_for_eviction()
        result = codeflash_output  # 521ns -> 521ns (0.000% faster)

    def test_measure_memory_detailed_true_returns_memory_snapshot(self):
        """
        Test that when detailed=True, the function returns a MemorySnapshot object.
        This requires actual memory measurement capability.
        """
        codeflash_output = measure_memory_for_eviction(detailed=True)
        result = codeflash_output  # 165μs -> 11.0μs (1396% faster)

    def test_return_type_is_memory_snapshot_or_none(self):
        """
        Test that the return type is always either MemorySnapshot or None.
        """
        codeflash_output = measure_memory_for_eviction(detailed=True)
        result = codeflash_output  # 149μs -> 8.04μs (1753% faster)

    def test_memory_snapshot_has_gpu_free_attribute(self):
        """
        Test that MemorySnapshot contains gpu_free attribute when available.
        """
        codeflash_output = measure_memory_for_eviction(detailed=True)
        result = codeflash_output  # 147μs -> 7.49μs (1870% faster)
        if result is not None:
            pass


class TestEdgeCases:
    """Test suite for edge cases and unusual conditions."""

    def test_multiple_calls_consistent_behavior(self):
        """
        Test that multiple calls with detailed=False always return None.
        """
        for _ in range(5):
            codeflash_output = measure_memory_for_eviction(detailed=False)
            result = codeflash_output  # 1.83μs -> 1.86μs (1.61% slower)

    def test_boolean_true_explicitly_passed(self):
        """
        Test that explicitly passing True for detailed parameter works correctly.
        """
        codeflash_output = measure_memory_for_eviction(detailed=True)
        result = codeflash_output  # 148μs -> 8.04μs (1747% faster)

    def test_boolean_false_explicitly_passed(self):
        """
        Test that explicitly passing False for detailed parameter returns None.
        """
        codeflash_output = measure_memory_for_eviction(detailed=False)
        result = codeflash_output  # 721ns -> 721ns (0.000% faster)

    def test_detailed_parameter_type_coercion(self):
        """
        Test that truthy/falsy values work as detailed parameter.
        0 should be falsy (like False), 1 should be truthy (like True).
        """
        # Test with 0 (falsy)
        codeflash_output = measure_memory_for_eviction(detailed=0)
        result = codeflash_output  # 742ns -> 751ns (1.20% slower)

        # Test with 1 (truthy)
        codeflash_output = measure_memory_for_eviction(detailed=1)
        result = codeflash_output  # 149μs -> 7.61μs (1858% faster)

    def test_gpu_free_value_when_torch_unavailable(self):
        """
        Test behavior when torch is not available.
        The function should still work but gpu_free may be None.
        """
        with patch("torch.cuda.is_available", side_effect=ImportError):
            codeflash_output = measure_memory_for_eviction(detailed=True)
            result = codeflash_output  # 173μs -> 30.6μs (465% faster)

    def test_gpu_free_value_when_psutil_unavailable(self):
        """
        Test behavior when psutil is not available.
        The function should still work without system memory info.
        """
        with patch("psutil.Process", side_effect=ImportError):
            codeflash_output = measure_memory_for_eviction(detailed=True)
            result = codeflash_output  # 29.8μs -> 9.35μs (219% faster)

    @patch("inference.core.structured_logging.memory._measure_detailed_state")
    def test_memory_snapshot_created_with_gpu_free_value(self, mock_measure):
        """
        Test that MemorySnapshot is properly created with gpu_free from state dict.
        """
        # Mock the detailed state to return known values
        mock_measure.return_value = {
            "gpu_free": 12345678,
            "gpu_total": 16000000000,
            "gpu_allocated": 3654322,
            "gpu_reserved": 3900000,
        }

        codeflash_output = measure_memory_for_eviction(detailed=True)
        result = codeflash_output  # 19.5μs -> 8.44μs (131% faster)

    @patch("inference.core.structured_logging.memory._measure_detailed_state")
    def test_memory_snapshot_with_empty_state(self, mock_measure):
        """
        Test that MemorySnapshot handles empty state dict gracefully.
        """
        # Mock the detailed state to return empty dict
        mock_measure.return_value = {}

        codeflash_output = measure_memory_for_eviction(detailed=True)
        result = codeflash_output  # 19.0μs -> 8.30μs (129% faster)

    @patch("inference.core.structured_logging.memory._measure_detailed_state")
    def test_memory_snapshot_with_none_gpu_free(self, mock_measure):
        """
        Test that MemorySnapshot handles None gpu_free value correctly.
        """
        mock_measure.return_value = {"gpu_free": None}

        codeflash_output = measure_memory_for_eviction(detailed=True)
        result = codeflash_output  # 18.7μs -> 8.35μs (124% faster)

    @patch("inference.core.structured_logging.memory._measure_detailed_state")
    def test_memory_snapshot_with_zero_gpu_free(self, mock_measure):
        """
        Test that MemorySnapshot handles zero gpu_free value (out of memory condition).
        """
        mock_measure.return_value = {"gpu_free": 0}

        codeflash_output = measure_memory_for_eviction(detailed=True)
        result = codeflash_output  # 18.3μs -> 8.06μs (126% faster)


class TestLargeScale:
    """Test suite for performance and scalability with large data samples."""

    def test_repeated_calls_detailed_false(self):
        """
        Test that repeated calls with detailed=False remain fast and don't accumulate memory.
        Performs 100 calls to ensure no performance degradation.
        """
        for i in range(100):
            codeflash_output = measure_memory_for_eviction(detailed=False)
            result = codeflash_output  # 26.5μs -> 26.8μs (1.15% slower)

    def test_repeated_calls_detailed_true(self):
        """
        Test that repeated calls with detailed=True work consistently.
        Performs 10 calls (limited due to GPU sync overhead).
        """
        results = []
        for i in range(10):
            codeflash_output = measure_memory_for_eviction(detailed=True)
            result = codeflash_output  # 1.23ms -> 34.4μs (3490% faster)
            results.append(result)

        # All results should be None or MemorySnapshot
        for result in results:
            pass

    def test_concurrent_behavior_simulation(self):
        """
        Test that calling the function in sequence (simulating concurrent usage)
        works correctly. Uses 50 calls alternating between detailed=True and False.
        """
        for i in range(50):
            detailed = i % 2 == 0
            codeflash_output = measure_memory_for_eviction(detailed=detailed)
            result = codeflash_output  # 2.95ms -> 84.3μs (3395% faster)

            if detailed:
                pass
            else:
                pass

    @patch("inference.core.structured_logging.memory._measure_detailed_state")
    def test_large_gpu_free_values(self, mock_measure):
        """
        Test handling of very large GPU memory values (e.g., for 80GB GPUs).
        """
        # 80GB GPU with 50GB free
        large_gpu_free = 80 * 1024 * 1024 * 1024
        mock_measure.return_value = {"gpu_free": large_gpu_free}

        codeflash_output = measure_memory_for_eviction(detailed=True)
        result = codeflash_output  # 20.6μs -> 8.46μs (143% faster)

    @patch("inference.core.structured_logging.memory._measure_detailed_state")
    def test_small_gpu_free_values(self, mock_measure):
        """
        Test handling of very small GPU memory values.
        """
        # 1KB free (extreme low memory condition)
        small_gpu_free = 1024
        mock_measure.return_value = {"gpu_free": small_gpu_free}

        codeflash_output = measure_memory_for_eviction(detailed=True)
        result = codeflash_output  # 19.4μs -> 8.24μs (136% faster)

    @patch("inference.core.structured_logging.memory._measure_detailed_state")
    def test_multiple_measurement_states(self, mock_measure):
        """
        Test with multiple different state dictionaries to simulate state changes.
        """
        states = [
            {"gpu_free": 1000000},
            {"gpu_free": 2000000},
            {"gpu_free": 3000000},
            {"gpu_free": 4000000},
        ]

        for state in states:
            mock_measure.return_value = state
            codeflash_output = measure_memory_for_eviction(detailed=True)
            result = codeflash_output  # 46.5μs -> 18.2μs (155% faster)

    def test_parameter_variations_100_calls(self):
        """
        Test 100 rapid calls with varying detailed parameter values.
        """
        call_count = 0
        none_count = 0
        snapshot_count = 0

        for i in range(100):
            detailed = i % 10 == 0  # Vary the parameter
            codeflash_output = measure_memory_for_eviction(detailed=detailed)
            result = codeflash_output  # 1.29ms -> 63.4μs (1928% faster)
            call_count += 1

            if result is None:
                none_count += 1
            elif isinstance(result, MemorySnapshot):
                snapshot_count += 1


class TestIntegration:
    """Integration tests combining multiple aspects of the function."""

    def test_detailed_false_then_true(self):
        """
        Test sequential calls with different detailed values.
        """
        codeflash_output = measure_memory_for_eviction(detailed=False)
        result1 = codeflash_output  # 721ns -> 731ns (1.37% slower)

        codeflash_output = measure_memory_for_eviction(detailed=True)
        result2 = codeflash_output  # 145μs -> 7.11μs (1946% faster)

    def test_detailed_true_then_false(self):
        """
        Test sequential calls in opposite order.
        """
        codeflash_output = measure_memory_for_eviction(detailed=True)
        result1 = codeflash_output  # 143μs -> 7.32μs (1864% faster)

        codeflash_output = measure_memory_for_eviction(detailed=False)
        result2 = codeflash_output  # 631ns -> 381ns (65.6% faster)

    @patch("inference.core.structured_logging.memory._measure_detailed_state")
    def test_state_dict_with_all_possible_keys(self, mock_measure):
        """
        Test with a complete state dictionary containing all possible keys.
        """
        complete_state = {
            "gpu_allocated": 5000000,
            "gpu_reserved": 6000000,
            "gpu_free": 10000000,
            "gpu_total": 16000000000,
            "process_rss_bytes": 2000000000,
            "system_available_bytes": 8000000000,
        }
        mock_measure.return_value = complete_state

        codeflash_output = measure_memory_for_eviction(detailed=True)
        result = codeflash_output  # 20.1μs -> 8.26μs (143% faster)

    @patch("inference.core.structured_logging.memory._measure_detailed_state")
    def test_state_dict_with_partial_keys(self, mock_measure):
        """
        Test with a state dictionary containing only gpu_free.
        """
        partial_state = {"gpu_free": 5000000}
        mock_measure.return_value = partial_state

        codeflash_output = measure_memory_for_eviction(detailed=True)
        result = codeflash_output  # 19.0μs -> 8.10μs (135% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To test or edit this optimization locally git merge codeflash/optimize-pr1930-2026-02-02T20.10.49

Suggested change
state = _measure_detailed_state()
return MemorySnapshot(
gpu_free=state.get("gpu_free"),
)
gpu_free = None
try:
import torch
if torch.cuda.is_available():
gpu_free, _ = torch.cuda.mem_get_info()
except ImportError:
pass
return MemorySnapshot(gpu_free=gpu_free)

Static Badge

api_key=self.api_key,
allow_untrusted_packages=ALLOW_INFERENCE_EXP_UNTRUSTED_MODELS,
allow_direct_local_storage_loading=False,
model_access_manager=model_access_manager,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PawelPeczek-Roboflow seeing that as part of 1.0rc there are many new adapters.

I can make change in all of those for 1.0 similar to here unless you have a better idea?

Copy link
Collaborator

@grzegorz-roboflow grzegorz-roboflow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first batch of comments

clear_request_context,
set_request_context,
structured_event_logger,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dispatch is executed with each request, I would vote for moving those imports away from this method

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if GCP_SERVERLESS:
app.add_middleware(GCPServerlessMiddleware)
if STRUCTURED_LOGGING_ENABLED:
app.add_middleware(StructuredLoggingMiddleware)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comment here indicating that this middleware relies on CorrelationIdMiddleware to be already executed
Or maybe move this to where CorrelationIdMiddleware is set, maybe even add this middleware explicitly before CorrelationIdMiddleware (twice) so it's clear those two always go together


logger.debug("Downloading model artifacts from Roboflow API")

download_start_time = time.time()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this is only used in if structured_event_logger.enabled:, maybe capturing it there would save us a few CPU cycles

@hansent
Copy link
Collaborator Author

hansent commented Feb 16, 2026

holding this until after 1.0 release is done since we'll need to update some of the hooks to log model load times for new adapters

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants