feat(runtime): TRT-RTX runtime controls via context managers by tp5uiuc · Pull Request #4330 · pytorch/TensorRT

tp5uiuc · 2026-06-09T22:43:10Z

Description

Refs #4310, design discussion at #4323.

Moves cuda_graph_strategy, dynamic_shapes_kernel_specialization_strategy, and runtime_cache off CompilationSettings onto runtime context managers — toggle them without recompiling.

Type of change

New feature (non-breaking change which adds functionality)
Breaking change (the three CompilationSettings fields above move to RuntimeSettings / runtime CMs; old compile-time kwargs are no longer accepted)
This change requires a documentation update

Public API

torch_tensorrt.runtime.runtime_config(target, **overrides) — pool CM applying any RuntimeSettings field to every TRT submodule under target.
torch_tensorrt.runtime.runtime_cache(target, path_or_stream) — shared IRuntimeCache across one or more modules. Accepts str, os.PathLike, or a file-like (io.BytesIO, opened handles).
torch_tensorrt.runtime.enable_cudagraphs(target, *, cuda_graph_strategy=...) — RTX cuda-graph strategy + outer cudagraph wrap in one CM (the strategy is RTX-only; non-RTX builds raise on kwarg use).
torch_tensorrt.runtime.set_dynamic_shapes_kernel_strategy(target, strategy) — sugar wrapper for the dynamic-shapes field.
module.runtime_settings = RuntimeSettings(...) — direct assignment after compile.

User guide: docsrc/user_guide/runtime_performance/runtime_settings.rst.

Examples

Post-compile setter + cudagraph capture with strategy in one CM:

import torch_tensorrt as torchtrt
from torch_tensorrt.runtime import RuntimeSettings, enable_cudagraphs

mod = torchtrt.compile(model, inputs=inputs)
mod.runtime_settings = RuntimeSettings(runtime_cache="/var/cache/jit.bin")

with enable_cudagraphs(mod, cuda_graph_strategy="whole_graph_capture") as wrapped:
    out = wrapped(x)

Putting it all together — shared kernel cache across two modules, dynamic-shapes override + cudagraph capture on the first, the second consuming the first's output under the same cache:

from torch_tensorrt.runtime import (
    runtime_cache,
    runtime_config,
    enable_cudagraphs,
)

with runtime_cache([mod1, mod2], "/var/cache/jit.bin") as rc:
    with (
        runtime_config(
            mod1,
            runtime_cache=rc,
            dynamic_shapes_kernel_specialization_strategy="eager",
        ) as modr,
        enable_cudagraphs(modr, cuda_graph_strategy="whole_graph_capture") as cg,
    ):
        outputs = cg(*inputs)
    mod2(*outputs)

For stream-backed caches (io.BytesIO, opened files), caller-owned RuntimeCache lifetimes, sharing one cache across many modules, and other advanced patterns, see the Runtime Settings user guide at docsrc/user_guide/runtime_performance/runtime_settings.rst.

Architecture

flowchart TB
    classDef api fill:#cce5ff,stroke:#004085,color:#004085
    classDef settings fill:#fff3cd,stroke:#856404,color:#856404
    classDef module fill:#d4edda,stroke:#155724,color:#155724
    classDef py fill:#e7d6f7,stroke:#553375,color:#553375
    classDef cpp fill:#ffe0b3,stroke:#a14400,color:#7a3500
    classDef facade fill:#f5e6cc,stroke:#6b4423,color:#3a2200,stroke-width:3px

    %% Layer 1 — Public API
    A1["runtime_cache CM"]:::api
    A2["runtime_config CM"]:::api
    A3["enable_cudagraphs<br/>(cuda_graph_strategy=...)"]:::api
    A4["mod.runtime_settings = rs"]:::api

    %% Layer 2 — Data model
    RS["RuntimeSettings dataclass<br/>cuda_graph_strategy<br/>dynamic_shapes_kernel_specialization_strategy<br/>runtime_cache : None | str | RuntimeCache"]:::settings
    A1 --> RS
    A2 --> RS
    A3 --> RS
    A4 --> RS

    %% Layer 3 — Module (owner of implicit handle)
    MOD["TorchTensorRTModule<br/>_implicit_cache_handle : RuntimeCache<br/>_resolve_runtime_cache: builds + warm-loads disk → pending<br/>_send_to_engine"]:::module
    RS --> MOD

    %% Layer 3.5 — User-facing facade (sits BETWEEN module and the runtime split)
    RC{{"⭐ RuntimeCache &mdash; USER-FACING FACADE<br/>py/torch_tensorrt/runtime/_runtime_cache.py<br/>path / autosave_on_del<br/>load · save · load_from_stream · save_to_stream<br/>has_cache · is_cpp_runtime · ensure_cache<br/><i>same API regardless of runtime — forwards to ._handle</i>"}}:::facade
    MOD -. "owns" .-> RC

    %% Layer 4 — Runtime branch
    BR{cpp runtime<br/>available?}:::module
    MOD --> BR

    %% Layer 5/6/7 — Side-by-side language columns, each with engine → shim → inner handle
    subgraph PY ["Python runtime path"]
        direction TB
        PYENG["_TRTEngine<br/>.context (lazy @property)<br/>.update_runtime_settings(rs)"]:::py
        PYTRC["TRTRuntimeConfig (Python shim)<br/>_runtime_config.py<br/>owns trt.IRuntimeConfig (lazy)"]:::py
        PYINNER["<b>_RuntimeCacheHandle</b><br/>(python-rt inner — port of cpp class)<br/>_cache : trt.IRuntimeCache<br/>_pending_warm_bytes (drained on first ensure_materialized)<br/>_lock (mirrors cpp state_mu_)"]:::py
        PYENG --> PYTRC
        PYTRC -. "ensure_cache → setRuntimeCache" .-> PYINNER
    end

    subgraph CPP ["C++ runtime path"]
        direction TB
        CPPENG["torch.classes.tensorrt.Engine<br/>.update_runtime_settings(int, int, cache)"]:::cpp
        CPPTRC["TRTRuntimeConfig (C++ struct)<br/>core/runtime/TRTRuntimeConfig.{h,cpp}<br/>owns nvinfer1::IRuntimeConfig"]:::cpp
        CPPINNER["<b>torch.classes.tensorrt.RuntimeCacheHandle</b><br/>(cpp-rt inner — torchbind class, used directly, no wrapper)<br/>core/runtime/RuntimeSettings.{h,cpp}<br/>trt_handle_ : shared_ptr&lt;IRuntimeCache&gt;<br/>pending_warm_bytes_ (drained on ensure_materialized)<br/>state_mu_ (mirrors python _lock)"]:::cpp
        CPPENG --> CPPTRC
        CPPTRC -. "ensure_materialized → setRuntimeCache" .-> CPPINNER
    end

    BR -- "No" --> PYENG
    BR -- "Yes" --> CPPENG

    %% The facade uniformly exposes whichever inner is appropriate — same API surface either side.
    RC == "_handle (python rt)" ==> PYINNER
    RC == "_handle (cpp rt)" ==> CPPINNER

Color key

🟦 Blue — Public API entry points
🟨 Amber — RuntimeSettings dataclass (data model)
🟩 Green — TorchTensorRTModule orchestration (owns the implicit handle)
🟫 Tan ⭐ — RuntimeCache user-facing facade (thick border; runtime-agnostic API; the only handle users touch)
🟪 Purple — Python runtime path: _TRTEngine → Python shim → _RuntimeCacheHandle (inner)
🟧 Orange — C++ runtime path: torchbind engine → C++ struct → torch.classes.tensorrt.RuntimeCacheHandle (inner)

How to read the diagram. Settings flow top → down. The two language columns are mirror images of each other (engine → shim → inner cache handle); the bold inner-handle nodes (_RuntimeCacheHandle in purple, torch.classes.tensorrt.RuntimeCacheHandle in orange) are 1:1 ports — same public surface (serialize / deserialize / has_cache / ensure_materialized), both with a pending-bytes stash and their own lock guarding the GIL-releasing create-cache race. The RuntimeCache facade (the thick-bordered ⭐ node) is the only handle users touch; it forwards every call (load, save, has_cache, etc.) uniformly to whichever inner ._handle references — bold double-arrows on either side show the dispatch.

Implementation

RuntimeSettings dataclass + TRTRuntimeConfig shim (Python + a mirroring C++ struct in core/runtime/) own the live IRuntimeConfig. All ENABLED_FEATURES.tensorrt_rtx gates live inside the shim.
RuntimeCache is a facade wrapping either _RuntimeCacheHandle (Python-rt port of the cpp class) or torch.classes.tensorrt.RuntimeCacheHandle (cpp torchbind, used directly). Both implement a common _RuntimeCacheHandleProtocol; the facade forwards without isinstance branching.
Deferred materialization on both inners: deserialize stashes bytes into a pending buffer if the underlying IRuntimeCache is not yet created; the first ensure_materialized call (driven by the python or cpp _apply_settings) creates the cache and drains the pending bytes atomically. Disk bytes for engine-implicit handles are pre-loaded into the pending buffer at handle construction time (_TorchTensorRTModule._resolve_runtime_cache) — one warm-load callsite covers both runtimes.
Filelocked atomic-rename disk persistence (load / save) plus load_from_stream / save_to_stream primitives. Engine-implicit handles autosave on __del__.
TorchTensorRTModule._implicit_cache_handle is the canonical owner; RuntimeCache.is_cpp_runtime() lets external callers detect which inner is in use.
IExecutionContext is strictly lazy on both runtimes. Python exposes engine.context as a write-protected @property; the C++ engine exposes a single exec_ctx() getter. Runtime knobs are NOT serialized into the engine tuple.

Tests

Added tests/py/dynamo/runtime/: test_000_runtime_cache.py, test_001_cuda_graph_strategy.py, test_001_dynamic_shapes_kernel_strategy.py, test_004_runtime_settings.py. The build's selected runtime determines whether the cpp or Python inner path runs; whitebox introspection tests skip on the other side.

Verified locally on TRT-RTX 1.5.0.114 (A100): cpp-rt 41 passed / 20 skipped / 0 failed; python-rt (libs hidden to force the python path) 58 passed / 3 skipped / 0 failed.

Notes

The cudagraphs wrapper's warm_up() materializes the engine's context with whatever settings are in effect at that moment. enable_cudagraphs(target, cuda_graph_strategy=...) applies the strategy before the wrapper's warm-up, preserving the "one createExecutionContext per setup" invariant.

Checklist:

My code follows the style guidelines of this project (You can use the linters)
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas and hacks
I have made corresponding changes to the documentation
I have added tests to verify my fix or my feature
New and existing unit tests pass locally with my changes
I have added the relevant labels to my PR in so that relevant reviewers are notified

Refs pytorch#4310. Moves ``cuda_graph_strategy``, ``dynamic_shapes_kernel_specialization_strategy``, and ``runtime_cache`` off ``CompilationSettings`` and onto runtime-mode context managers, so callers toggle them without recompiling. Public API ---------- - ``torch_tensorrt.runtime.runtime_config(target, **overrides)`` -- pool CM that applies any ``RuntimeSettings`` field to every TRT submodule. - ``torch_tensorrt.runtime.runtime_cache(target, path_or_stream)`` -- attaches a shared ``IRuntimeCache`` across one or more modules. Accepts ``str``, ``os.PathLike``, or a file-like (``io.BytesIO``, opened file handles, etc.). - Sugar: ``set_cuda_graph_strategy`` / ``set_dynamic_shapes_kernel_strategy``. - ``module.runtime_settings = RuntimeSettings(...)`` for direct assignment. - Compile-time hint via ``torchtrt.compile(..., runtime_settings=...)`` primes the engine without an extra CM enter/exit. Implementation -------------- - ``RuntimeSettings`` dataclass + ``TRTRuntimeConfig`` shim (Python + a mirroring C++ struct in ``core/runtime/``) own the live ``IRuntimeConfig`` and apply settings. All ``ENABLED_FEATURES.tensorrt_rtx`` gates live inside the shim; callers in ``_TRTEngine`` and ``_TorchTensorRTModule`` stay uniform. - ``RuntimeCacheHandle`` (Python wrapper + C++ torchbind sibling) owns the per-engine ``IRuntimeCache`` plus filelocked atomic-rename disk persistence. Three construction modes: engine-implicit (``autosave_on_del=True``), runtime CM (``autosave_on_del=False``, explicit save on ``__exit__``), and user-built (default ``autosave_on_del=False``). - Stream support: ``load_from_stream`` / ``save_to_stream`` are the byte primitives; the path-mode ``load`` / ``save`` delegate to them. - ``TorchTensorRTModule._implicit_cache_handle`` is the single owner across Python and cpp runtimes; ``TRTRuntimeConfig`` is a pure-execution shim. - C++ strategy fields are typed ``enum class : int32_t`` mirroring the ``nvinfer1`` enum integers; ``int32_t`` crosses the torchbind boundary for ABI stability, with reverse-lookup helpers for logging. - Lazy ``IExecutionContext`` creation in ``TRTEngine``; runtime knobs are NOT serialized into the engine tuple (per the issue contract). Tests ----- 65 new tests under ``tests/py/dynamo/runtime/``, parameterized over python and cpp runtimes where applicable. Covers compile-time hints, CM enter/exit, settings-swap save semantics, file-handle and ``BytesIO`` round-trip for the shared cache, and the lazy-context regression.

…pile-time hint Python (``_TRTEngine``): - ``self.context`` becomes a ``@property`` backed by a private ``self._context`` field. Reads lazily materialize on first access; the property has no setter (write raises ``AttributeError``) so external code cannot stash an arbitrary context. - Add ``invalidate_context()`` (drops the cached context; next read rebuilds) and ``has_context()`` (probes without triggering creation). - ``_setup_engine`` no longer creates the context. Distributed engines still materialize eagerly (mirrors cpp) by reading ``self.context`` before the NCCL barrier. - All five recreate sites (``update_runtime_settings``, device-memory budget setter, ``use_dynamically_allocated_resources``, ``disable_profiling``, internal) collapse to ``invalidate_context()``. - All read sites (forward, ``infer_outputs``, ``enable_profiling``, ``setup_nccl_comm``, ``_is_monolithic_capturable``) are unchanged -- the property's lazy semantics absorb the laziness. C++ (``TRTEngine``): - ``exec_ctx`` field moves from public to private (renamed ``exec_ctx_``). - Single public getter ``exec_ctx()`` returns a raw pointer, lazy-creating via the existing private ``recreate_execution_context()``. Drop public ``ensure_execution_context()`` -- the getter IS the ensure. - Rename ``invalidate_execution_context()`` to ``invalidate_exec_ctx()``; add ``has_exec_ctx()`` for null-safe introspection. - All 5-6 call sites in ``TRTEngine.cpp`` collapse to ``exec_ctx()->...``; ``execute_engine.cpp`` swaps ``->exec_ctx->`` for ``->exec_ctx()->``. Drop the compile-time ``runtime_settings`` kwarg: - The kwarg existed to dodge an old 2-create regression on cpp; with both runtimes strictly lazy, that motivation is gone. Users apply settings via ``mod.runtime_settings = rs`` after compile, or use a runtime CM. - Removed from ``compile``, ``compile_module``, ``convert_module``, ``TorchTensorRTModule.__init__``, ``_TRTEngine.__init__``. - Documented composition contract on ``set_cuda_graph_strategy`` and ``enable_cudagraphs`` docstrings: nest ``with runtime_config(...) as m:`` outside ``with enable_cudagraphs(m) as w:`` so settings are applied state-only before the wrapper's warm-up materializes the context. Tests: - Two assertions for ``engine.context is not None`` flip to ``engine.has_context()`` so they probe the lazy field without forcing materialization. - Tests that passed ``runtime_settings=...`` to ``torchtrt.compile`` switch to a small ``_apply_runtime_settings(compiled, rs)`` helper that walks the compiled module and assigns ``mod.runtime_settings = rs`` per inner ``TorchTensorRTModule``. - ``test_one_context_create_with_default_settings`` now expects 0 contexts at setup on both runtimes (was 0 cpp / 1 python). - ``test_one_context_create_with_compile_time_settings`` was redundant once the hint is gone; replaced with ``test_post_compile_settings_then_execute_is_one_create``. All 33 runtime tests pass on TRT-RTX 1.5.0.103 (20 skips unchanged).

Two stale references on the Python-runtime branch of TorchTensorRTModule were left over from earlier in this PR and broke every test that runs a compiled module on the Python runtime: - ``setup_engine`` and ``set_extra_state`` were calling ``torch.ops.tensorrt.execute_engine_python``, but the custom op is registered as ``tensorrt::execute_engine`` (in _TRTEngine.py). The ``_python`` suffix was an intermediate name during PR pytorch#4222's dev cycle and never made it to main. Fixed both call sites to use the single shared op name, matching the C++-runtime branch and the docstring at the top of the class. - ``set_extra_state`` was still passing ``runtime_settings=self._runtime_settings`` to ``TRTEngine.__init__``, but the previous follow-up commit dropped that kwarg. Engine now constructs with default settings (matching what the caller assigned to ``self._runtime_settings`` two lines above) and applies any non-default settings via the post-load setter, same as the live ``setup_engine`` path.

tp5uiuc

Technical note : why `runtime_cache()` needs module arguments

runtime_cache(...) exists to attach a shared IRuntimeCache to one or more engines for the duration of a with block. The "shared" + "for the duration of" parts both require knowing which engines.

A standalone "just give me a cache" API, also exists : we construct RuntimeCache(path="...") directly. Then pass it into mod.runtime_settings = RuntimeSettings(runtime_cache=handle) (single module). The CM is the convenience wrapper that does all three things — construct + attach + auto-load/save — in one block.

Three reasons it can't be standalone:

The cache has to be wired to engines, and not just constructed like the example above
The "shared across modules" semantic only exists with multiple targets listed in the first argument.
Bootstrap depends on a module's engine. This is the most annoying but technically still a valid reason. We can't create a standalone runtime cache today with TRT-RTX APIs, but we need an engine to bootstrap it. In this case tThe CM walks target.named_modules() to find a TorchTensorRTModule whose engine it can use to query runtime properties (engine.runtime_config, the cpp IRuntimeConfig, etc.) and create a runtime cache. Having modules to attach the cache to makes this much easier to manage.

Round of fixes off the latest PR review feedback: Blocking - Initialize ``_implicit_cache_handle`` in ``TorchTensorRTModule.__init__`` so the slot exists on every construction path, not just ``setup_engine``. Drops a regression where ``set_extra_state`` (post-load) skipped the init and any subsequent ``mod.runtime_settings = ...`` raised AttributeError. Removes the matching ``# type: ignore[has-type]``. High-priority - ``_to_torchbind_handle`` now rejects a mixed-runtime case loudly: a Python ``RuntimeCacheHandle`` with a live pybind ``IRuntimeCache`` but no torchbind sibling crossing into the C++ runtime path would silently orphan the cache. - Also gates the str -> torchbind path on a truthy-string check (mirrors ``_materialize_implicit_handle``) so ``""`` doesn't construct a no-op torchbind handle. - ``runtime_settings.setter`` now reconciles the handle that ``_dispatch_runtime_settings_to_engine`` substituted in, matching the ``setup_engine`` post-condition (``self._runtime_settings`` agrees with what the engine actually saw). Medium - ``runtime`` Bazel cc_library now exports ``TRTRuntimeConfig.h`` alongside ``TRTEngine.h`` for symmetry with ``runtime_base`` and the ``include_files`` filegroup. - ``_RuntimeCacheContextManager.__exit__`` synchronizes CUDA before save (avoids the detach-then-save race against a concurrent execute), and wraps save in try/except+warn so a transient filesystem failure on exit can't mask the with-block's actual exception. - ``to_*_strategy`` now take ``int64_t`` and bounds-check before narrowing, so an out-of-range Python caller can't slip past the check via a silent ``int32_t`` overflow. Low - Drop the explicit local_defines select on ``ENABLE_FEATURE_DISABLE_RUNTIME_ALLOCATION`` in ``core/runtime/BUILD`` (no longer needed on TRT-RTX 1.5+). - ``.size()`` -> ``std::size(...)`` on the std::array uses in ``RuntimeSettings.cpp`` for consistency. - Trim docstrings on ``enable_cudagraphs`` and ``set_cuda_graph_strategy`` -- composition guidance moves to the runtime-settings doc page. - Drop ``_NON_RTX_WARNING_EMITTED`` module global; emit unconditionally so test isolation can assert on it. Users wanting once-per-process behavior filter via ``warnings.simplefilter("once", UserWarning)``. - Document multi-process lost-update semantics of the default ``runtime_cache`` path in the dataclass docstring. - Skip the ``num_execution_contexts_created == 0`` assertion for NCCL engines in ``test_one_context_create_with_default_settings``: NCCL engines eagerly bind the comm at setup, which materializes the context. Comment in ``_TRTEngine`` flags the same divergence.

Originally this PR hard-removed ``cuda_graph_strategy``, ``dynamic_shapes_kernel_specialization_strategy``, and ``runtime_cache_path`` from ``torch_tensorrt.compile()``. That left two issues: - **B1**: the model-suite tests (``test_cuda_graph_strategy_models.py``, ``test_dynamic_shapes_kernel_strategy_models.py``, ``test_runtime_cache_models.py``) still passed those kwargs to ``compile()``; they were silently dropped via ``**kwargs`` and the test intent was lost. - **B3**: downstream callers of ``torch_tensorrt.compile(model, ..., cuda_graph_strategy=...)`` would see the same silent drop — no deprecation warning, no error. Fix combines both halves: 1. Re-accept the three kwargs via ``**kwargs`` in ``compile()`` with a single ``DeprecationWarning``, and route them through ``mod.runtime_settings = RuntimeSettings(...)`` after ``compile_module`` returns. ``runtime_cache_path`` → ``runtime_cache`` rename is applied inside the shim. 2. Port the three model-suite test files to the new pattern (post-compile ``_apply_runtime_settings(compiled, RuntimeSettings(...))``), matching the runtime-suite tests. This keeps them clean of deprecation warnings while serving as the canonical example of the new API. 3. **L1**: drop the now-stale ``runtime_cache_path`` parameter line from ``MutableTorchTensorRTModule.__init__``'s docstring. Touched up a parallel docstring in ``_TRTEngine`` (``Set cuda_graph_strategy at compile time`` -> ``Apply RuntimeSettings(...) via the runtime_config CM or mod.runtime_settings setter``).

Reverts the ``compile()`` deprecation shim from the previous commit. Tests in the model and runtime suites have all been ported to the post-compile ``mod.runtime_settings = RuntimeSettings(...)`` pattern; no in-tree caller of ``compile()`` still passes the old kwargs. Going straight to hard-removal is cleaner than carrying the warning machinery.

tp5uiuc

Placeholder.

CI fix - ``RuntimeSettings.cpp`` was relying on ``ATen/core/Tensor.h`` transitively pulling in factory functions; torch_l4t (Jetpack) only ships the minimal ``Tensor.h``, so ``at::empty`` failed to resolve. Add ``ATen/ATen.h`` explicitly. Unblocks the Jetpack lane on this PR. Follow-up review items - **F1**: ``set_extra_state`` now also resets ``_implicit_cache_handle`` to ``None`` alongside the ``_runtime_settings`` reset. Otherwise a stale wrapper from prior use survives ``load_state_dict`` and the next setter could silently write the fresh engine's cache bytes to the old path. - **F2**: added the requested invariant comment above the conditional reconciliation in ``runtime_settings.setter`` so a future reviewer doesn't mistake the ``if`` for an unconditional merge. Tests pinning the fixes - ``TestSaveLoadRuntimeSettingsRoundTrip``: save -> load -> setter must not raise. Catches a regression of B2 if the ``_implicit_cache_handle = None`` init ever moves back out of ``__init__``. - ``TestNestedRuntimeConfigCudagraphs``: asserts the central perf invariant of the runtime CM + cudagraphs composition -- nested form (runtime CM outside) yields one ``createExecutionContext`` call, inverted form yields two. Pins the contract documented in the PR description. - ``TestToTorchbindHandleOrphanGuard``: exercises the H1 raise path so a future change to the silent-orphan fallback gets caught.

…ytorch#5) The class-level skipIf on ``TestRuntimeCacheStreamSupport`` only fenced the round-trip tests (which legitimately can't run on cpp because the ``IRuntimeCache`` materializes lazily on context creation and bytes loaded before that don't survive). The first-run flavor -- CM enter -> forward -> exit-with-bytes-in-buf -- works the same on both runtimes and exercises the cpp dispatch glue (handle construction, attach to torchbind engine, save-on-exit). No content assertion on the saved bytes -- workload-dependent on cpp. The non-raising exit is the contract the test protects.

The prior ``test_setter_after_save_load_does_not_raise`` used ``torch.save`` / ``torch.load`` and so didn't actually exercise the B2 bug path. Pickle goes through ``nn.Module.__setstate__`` (wholesale ``__dict__`` restore); ``set_extra_state`` is never called. So the test would have passed even with the B2 fix reverted, because the slot was preserved as a regular ``__dict__`` entry on the saved side. Swap to the ``state_dict`` / ``load_state_dict`` round-trip, which is the path that actually goes through ``set_extra_state``. Renamed class to ``TestStateDictRoundTripRuntimeSettings`` and updated the docstring to be honest about what it pins down.

Follow-up to the LOG_WARN pass on non-RTX paths. The TRT-RTX branch of ``RuntimeCacheHandle::serialize()`` had one more silent failure: when ``trt_handle_->serialize()`` returns ``nullptr`` (TRT-level allocation or internal failure), we returned an empty tensor without surfacing the error. Add a ``LOG_WARNING`` so the host-memory allocation failure is visible. The pre-materialize ``!trt_handle_`` branch above stays silent on purpose: it fires on normal lifecycle states (autosave-on-del before any forward, CM exit pre-execute) and would be noise as a warning. ``deserialize()``'s ``data.numel() == 0`` early-return likewise stays silent: the Python wrapper already filters empty inputs upstream.

Follow-up to the prior LOG_WARN pass. Per reviewer follow-up, the two remaining silent branches surface as warnings too: - ``RuntimeCacheHandle::serialize()`` with ``!trt_handle_`` (wrapper exists but the underlying ``IRuntimeCache`` was never materialized -- e.g. autosave-on-del before any forward, CM exit pre-execute). Previously returned empty silently; users had no signal that the saved file was empty by design vs by bug. - ``RuntimeCacheHandle::deserialize()`` with an empty input tensor. Reachable via direct torchbind calls that bypass the Python ``load_from_stream`` filter; useful for catching accidental empty loads.

Addresses reviewer follow-up: the ``def_pickle`` comment on the torchbind ``RuntimeCacheHandle`` claimed the underlying ``IRuntimeCache`` is GPU-side state. It is CPU-side -- the cache holds host-memory kernel-compilation metadata, not device buffers. Correct the wording so the rationale for persisting only the ``path`` (no in-memory bytes) matches reality.

Addresses reviewer comment ("Make both code-paths similar"). The dynamic-shapes-kernel-specialization strategy path emits a ``LOG_DEBUG`` on every successful set; the cuda_graph_strategy path only warned on failure with no success-side debug log, so the two paths read asymmetrically. Add the success-branch ``LOG_DEBUG`` to ``setCudaGraphStrategy`` so both strategy attachments produce a uniform "X set to <value>" debug trail. The failure ``LOG_WARNING`` stays -- ``setCudaGraphStrategy`` returns bool unlike its DS counterpart, so the genuine failure signal is preserved.

Per reviewer comment: both ``DynamicShapesKernelSpecializationStrategy`` and ``CudaGraphStrategy`` already have an ``operator<<`` overload that forwards to ``to_string()``. Use the streaming overload directly in ``to_str()`` instead of the explicit ``.to_string()`` calls -- shorter and avoids the redundant conversion at the print site.

Per reviewer comment ("can be constexpr"). The ``to_string()`` methods on ``DynamicShapesKernelSpecializationStrategy`` and ``CudaGraphStrategy`` are pure index lookups over compile-time-known arrays, so they should be constexpr; this lets a future constant-folded ``static_assert(s.to_string() == "lazy")`` work and avoids a function call at print sites. The reverse-lookup arrays (``kDsStrategyNames`` / ``kCgStrategyNames``) move from the .cpp anonymous namespace into the header as ``inline constexpr`` so the inline ``to_string()`` definitions can see them at compile time. ``from_underlying`` / ``from_string`` in the .cpp still reference the same arrays via the header. No behavior change at runtime; the change is purely "values become usable in constant expressions".

Per reviewer suggestion: the four validator call sites (``DynamicShapesKernelSpecializationStrategy`` / ``CudaGraphStrategy``, each with ``from_underlying`` + ``from_string``) repeated the "|"-joined name tail (``"lazy|eager|none"``, ``"disabled|whole_graph_capture"``) in literal form. Adding or renaming a strategy required touching two strings per type. Introduce two ``constexpr`` -fold-friendly templates in the anonymous namespace: - ``join_string_views(sep, parts)`` for ``"a|b|c"``. - ``format_expected_strategy(names)`` for ``"(expected 0..N-1 mapping to a|b|c)"``. - ``format_expected_name(names)`` for ``"(expected a|b|c)"`` (the name-only variant used by ``from_string``). Validator messages now render from ``kDsStrategyNames`` / ``kCgStrategyNames`` directly, so a new strategy value requires only one array edit.

Reviewer suggestion: instead of threading a ``nvinfer1::IExecutionContext*`` parameter through ``setup_input_tensors`` / ``create_output_tensors`` / ``create_output_allocator``, keep the original signatures and hoist ``auto* ctx = compiled_engine->exec_ctx();`` at the top of each helper. Lower diff to the call sites; pays one extra ``exec_ctx()`` call per helper invocation (still vastly fewer than the ~20 per-call rate the original code paid). The top-level hoist inside ``execute_engine`` is unchanged.

Per reviewer follow-up on commit ``f0b488f05``: the cleaner shape uses ``std::cbegin`` / ``std::cend`` / ``std::next`` iterators and an explicit ``N == 0`` early-return, so the loop body avoids the per-iteration "is this the first?" branch. Also pulls in ``<iterator>`` for ``std::cbegin`` / ``std::cend`` / ``std::next``.

Per reviewer follow-up on commit ``13472dd29``: making ``to_string()`` constexpr required exposing ``kDsStrategyNames`` / ``kCgStrategyNames`` as ``inline constexpr`` in the public header. Keep the implementation detail confined to the .cpp translation unit instead. Reverts the header to declaration-only ``to_string()``; moves the arrays + the ``to_string()`` bodies back into the .cpp anonymous namespace. The validator templates (``join_string_views`` / ``format_expected_strategy``) introduced in ``f0b488f05`` keep working unchanged -- they share the same anonymous namespace.

CI gcc 13 rejected the direct ``static_cast<nvinfer1::Strategy>(wrapper)`` calls in ``TRTRuntimeConfig::ensure_initialized``: error: invalid 'static_cast' from type 'DynamicShapesKernelSpecializationStrategy' to type 'nvinfer1::DynamicShapesKernelSpecializationStrategy' The wrapper class has ``operator Value()`` returning the nested enum, but ``static_cast<enum>`` allows only one user-defined conversion in the chain, and the wrapper -> Value -> int -> nvinfer1::enum path uses one UDC plus an extra enum-to-enum conversion that ``static_cast`` does not auto-apply on gcc 13. Insert an explicit ``.to_underlying()`` (constexpr ``int32_t``) before the ``static_cast``: the chain becomes wrapper -> int32_t (member call) -> nvinfer1::enum (single static_cast). Same byte-for-byte mapping; no runtime cost. Matching note also updated in the header comment so the next reader doesn't try the bare cast again.

Escalate the ``!host_mem`` branch of ``RuntimeCacheHandle::serialize()`` from ``LOG_WARNING + empty()`` to ``TORCHTRT_CHECK``. This is a TRT-internal failure (host-memory allocation or internal error in ``IRuntimeCache::serialize()``), not a normal lifecycle state, so it should propagate as a ``RuntimeError`` instead of degrading silently. Asymmetry vs the sibling ``!trt_handle_`` branch is intentional: - ``!trt_handle_`` -- cache wrapper exists but never materialized (autosave-on-del before any forward, CM exit pre-execute). Normal state; keeps the LOG_WARNING + empty(). - ``!host_mem`` -- TRT itself failed to produce serialized bytes from a live cache. Exceptional; throw. Matches the pattern at ``TRTRuntimeConfig.cpp:44`` where the materialize-side null check uses ``TORCHTRT_CHECK``. Implicit callers (CM ``__exit__`` autosave, ``RuntimeCache.__del__`` autosave) already wrap save in try/except + logger.warning, so this escalation only surfaces to explicit ``rc.save()`` callers -- which is precisely who should hear about the failure.

tp5uiuc · 2026-06-14T22:19:36Z

CI status on `d8c32e85a`

Pass: all 8 RTX builds (Linux/Windows × full/Python-only × cu130/cu132). The gcc-13 cast issue from the prior CI run is fixed by 3e396268d.

Remaining failures (16 total), none caused by this PR:

Workflow	Failure	Diagnosis
RTX Python-only runtime tests (×4)	`test_004_weight_streaming.py::test_weight_streaming_manual` (L110) and `::test_weight_streaming_multi_rt` (L203) — `assert 183411 == 0`, `assert 38632 <= 1`	TRT rejects `setWeightStreamingBudgetV2` with `Error Code 3: mExecutionContextCounter.use_count() == 1`. Pre-existing — same two tests failed at the same lines on PR #4337 / main merge.
Python-only (non-RTX) runtime tests (×4)	Same `test_004_weight_streaming.py` failures	Same root cause as above.
L0 dynamo converter tests (×2)	5 fails / 1977 pass — all `test_cumsum_aten.py::TestCumsumConverter::test_cumsum_*` — `RuntimeError: TensorRT build_serialized_network returned None`	TRT-level engine-build issue with cumsum; unrelated to runtime-settings code.
L0 torchscript tests (×2)	`TestInputTypeDefaultsFP32Model` worker crashes + `TestTorchTensorRTModule::test_get_layer_info: AssertionError: Key Bindings is missing`	TorchScript backend (deprecated path); this PR doesn't touch TorchScript. Looks like infra/OOM and pre-existing.
L2 dynamo compile tests (×2, Windows-only)	`models/test_cross_runtime_serde.py::{test_save_cpp_load_python, test_save_python_load_python, test_save_python_load_cpp}` — `AssertionError: C++ runtime should be disabled`	Pre-existing Windows test-infra bug: `_cross_runtime_load_helper.py:_hide_so_files()` globs Linux-only `libtorchtrt`, missing Windows `torchtrt.dll`.

Will rebase/retry once these pre-existing failures are tracked separately. Happy to dig further into any of them on request.

narendasan · 2026-06-15T19:13:33Z

+            return self._cache
+
+
+class RuntimeCache:


@cehongwang @tp5uiuc should we be using the opaque types trick to support python only here?

Maybe can be a different PR but we should at least catch if it wont work in python only for now with the @needs_torch_tensorrt_runtime decorator

Cannot attach a Python-side RuntimeCache (with a live " "pybind IRuntimeCache) to a C++ runtime engine: the cache would " "be orphaned. Reconstruct the handle on the C++ side, or " "serialize/deserialize the cache bytes explicitly.

Looks like the reason why python is disabled is because it doesn't work with the C++ runtime engine. If it is an opaque object, we still cannot link it to the C++ engine right?

Thanks Naren and Adrian. I think Adrian's understanding is right — making _RuntimeCacheHandle an OpaqueBase would only change which layer rejects the cross-runtime case and not whether
it's possible. Today, the cpp engine attaches the cache through TRT's C API (nvinfer1::IRuntimeConfig::setRuntimeCache) and needs a real nvinfer1::IRuntimeCache*. The torchbind handle holds one minted from cpp's IRuntimeConfig::createRuntimeCache(); the Python handle holds one minted from pybind's IRuntimeConfig.create_runtime_cache() which is a different nvinfer1::IRuntimeCache* in process memory owned by the pybind layer.

From what I understand, the torch dispatcher can route objects through op signatures; it can't translate the pybind-owned C++ object to a torchbind-owned one. So an opaque-types refactor would let us pass _RuntimeCacheHandle through a torch.ops.tensorrt.update_runtime_settings signature when both sides are python rt, but the cpp impl behind the op would still need a torchbind RuntimeCacheHandle argument and would still reject the pybind-backed one.

So my intention is to leave the cache as-is in this PR — the current Protocol + explicit guard form is functionally equivalent and cheaper.

Naren : I think part of the confusion, which your comment made clear, is from having the torchbind_handle argument, which signals that you might be able to mix runtimes. This is not needed and may confuse folks reading that "having a C++ runtime cache and passing it to python-only runtime" is a valid strategy. This is not possible today, so I will remove the torchbind_handle_ argument altogether. The constructor would then read

if torch_tensorrt.ENABLED_FEATURES.torch_tensorrt_runtime: self._handle: _RuntimeCacheHandleProtocol = ( torch.classes.tensorrt.RuntimeCacheHandle(path) ) else: self._handle = _RuntimeCacheHandle(path=path)

and so folks using it won't be any wiser, and can't easily switch the runtimes too. This goes one more step in hardening the APIs against misuse.

Now about portability story for the runtime cache, we don't serialize it .pt2 and want folks to attach a new runtime cache in their inference process. The way cross-runtime works for runtime cache is via serialization/deserialization of the cache itself : I put this in intentionally because
a) the runtime cache contains kernels which can technically be re-seeded (and hence is not essential)
b) the kernels are usually SM/OS/version etc. specific, and this may corrupt engines compiled on one machine and running on another machine with a different OS

So the workflow today is (assuming machine A is w/ C++ runtime and machine B is Python-only)

Save .pt2 on machine A and load on machine B : engine works & cache does not travel with the artifact.

To get caching, copy cache.bin separately and set mod.runtime_settings = RuntimeSettings(runtime_cache="cache.bin") after load or use the context manager

If no caches are specified, we get a default runtime cache path which will reseed all kernels

I made the torchbind constructor change in af2ab79

I see so the interop has to come from serializing and de-serializing from disk given a path?

That's correct Naren, and is intentional. The API should make it easy I think.

# ─── Machine A: cpp-rt host, compile + populate cache ────────────────── import torch import torch_tensorrt as torchtrt from torch_tensorrt.runtime import RuntimeSettings, runtime_cache class SmallConvModel(torch.nn.Module): def __init__(self): super().__init__() self.conv = torch.nn.Conv2d(3, 16, 3) self.relu = torch.nn.ReLU() def forward(self, x): return self.relu(self.conv(x)) model = SmallConvModel().eval().cuda() inputs = [torch.randn(1, 3, 32, 32, device="cuda")] # 1. Compile -- no cache yet, the engine just builds. exp_program = torch.export.export(model, tuple(inputs)) compiled = torchtrt.dynamo.compile( exp_program, inputs=inputs, min_block_size=1 ) # 2. Attach a runtime cache via the CM and run inference. The CM # populates the cache file under filelock on ``__exit__``. with runtime_cache(compiled, "/shared/path/cache.bin"): _ = compiled(*inputs) # /shared/path/cache.bin now holds JIT'd kernels for SmallConvModel. # 3. Save the engine artifact. The .pt2 deliberately does NOT bundle # the cache -- kernels are SM/OS/version-specific and bundling them # would risk corrupting the engine on a non-matching target. torchtrt.save(compiled, "/shared/path/model.pt2", inputs=inputs)

And then on machine B (this can be machine A too, but with a different runtime)

# ─── Machine B: any rt (cpp or python-only), load + reuse cache ──────── import torch import torch_tensorrt as torchtrt from torch_tensorrt.runtime import runtime_cache # 1. Load the engine. Works WITHOUT the cache -- kernels just reseed # on first inference if no cache is attached. mod = torchtrt.load("/shared/path/model.pt2") inputs = [torch.randn(1, 3, 32, 32, device="cuda")] # 2. Attach + read + (re-)save the cache. The CM loads disk bytes into # pending state on enter, runs the engine with the cache attached, # and saves any newly JIT'd kernels back to the same file on exit. with runtime_cache(mod, "/shared/path/cache.bin"): _ = mod(*inputs)

If we really need a cross-runtime portability story for runtime cache we can put this later on (this will be : access the current C++ runtime cache -> serialize it to in an in-memory buffer -> create a new runtime cache for the python runtime -> deserialize from the current buffer -> attach to the python runtime).

narendasan

Think this is mostly good, just some open questions

``RuntimeCache.__init__`` now picks the backing implicitly based on ``ENABLED_FEATURES.torch_tensorrt_runtime``: the torchbind sibling ``torch.classes.tensorrt.RuntimeCacheHandle(path)`` when the C++ runtime is loaded, the pure-Python ``_RuntimeCacheHandle`` otherwise. The external ``torchbind_handle=`` kwarg is removed -- nothing outside the two pre-existing call sites passed one, and both of those sites only used it as a stash for a torchbind they had just minted from ``path``. Consequences: - Call sites in ``_TorchTensorRTModule._resolve_runtime_cache`` and the runtime ``_RuntimeCacheContextManager.__enter__`` collapse from conditional 2-branch construction (mint torchbind vs not) to a single ``RuntimeCache(path=...)`` call. The ``isinstance(engine, TRTEngine)`` peek in the CM goes away with it, along with the now-unused ``TRTEngine`` import. - The mixed-runtime case ("python-rt-backed ``RuntimeCache`` carrying a live pybind ``IRuntimeCache`` attached to a cpp engine") can no longer arise by construction. Drop the orphan-hazard branches from ``_to_torchbind_handle`` -- any ``RuntimeCache`` constructed in a cpp-rt process is already torchbind-backed, so the function unwraps ``rc._handle`` directly. The associated regression test ``TestToTorchbindHandleOrphanGuard`` is removed (the case it guarded is structurally impossible now). No behavior change for any normal-flow caller; the construction asymmetry was internal-only.

Replaces the ``*std::cbegin(parts)`` first-element access in ``join_string_views`` with the equivalent ``parts.front()`` member call. Same generated code; reads more directly.

meta-cla Bot added the cla signed label Jun 9, 2026

github-actions Bot requested a review from cehongwang June 9, 2026 22:43

tp5uiuc marked this pull request as draft June 9, 2026 23:00

tp5uiuc commented Jun 9, 2026

View reviewed changes

Comment thread core/runtime/BUILD Outdated

tp5uiuc self-assigned this Jun 9, 2026

tp5uiuc added 2 commits June 9, 2026 22:59