fix: handle HTTP 413 by splitting and retrying in OTLP HTTP exporters#5032
fix: handle HTTP 413 by splitting and retrying in OTLP HTTP exporters#5032Krishnachaitanyakc wants to merge 9 commits intoopen-telemetry:mainfrom
Conversation
|
Do other languge's OTLP HTTP exporters do something similar ? It'd be good to see how other languages handle this.. |
|
I checked on this to see how other OTel language SDKs handle HTTP 413: OTLP Specification (otlp/#failures): Only 429, 502, 503, and 504 are listed as retryable. 413 is not mentioned at all, and the spec says "All other 4xx or 5xx response status codes MUST NOT be retried."
No other SDK currently handles 413 with batch splitting. This PR would make Python the first to implement this recovery strategy. Also, the spec says 4xx codes "MUST NOT be retried" but splitting and retrying with a smaller payload is not the same as retrying the same request. It's a distinct recovery strategy. The current behavior in all SDKs is silently dropping the entire batch, which causes data loss. |
|
Thank you for starting this @Krishnachaitanyakc , and for checking the spec and other instrumentors. The scope of #4533 is vague, especially given the current state of OTel Python's OTLP HTTP vs gRPC, span/metrics/logs export. I'm going to comment there. |
...xporter-otlp-proto-http/src/opentelemetry/exporter/otlp/proto/http/_log_exporter/__init__.py
Outdated
Show resolved
Hide resolved
...xporter-otlp-proto-http/src/opentelemetry/exporter/otlp/proto/http/_log_exporter/__init__.py
Show resolved
Hide resolved
...xporter-otlp-proto-http/src/opentelemetry/exporter/otlp/proto/http/_log_exporter/__init__.py
Outdated
Show resolved
Hide resolved
...xporter-otlp-proto-http/src/opentelemetry/exporter/otlp/proto/http/_log_exporter/__init__.py
Outdated
Show resolved
Hide resolved
a3b2ef6 to
322b31b
Compare
|
If we decide we want to support this, we should make it opt-in only since it's not a part of the spec. We can do this with an experimental environment variable, or a private keyword only argument. |
...porter-otlp-proto-http/src/opentelemetry/exporter/otlp/proto/http/trace_exporter/__init__.py
Outdated
Show resolved
Hide resolved
…rying When a backend returns HTTP 413 (Payload Too Large), the trace and log exporters now split the batch in half and recursively retry each half. This prevents silent data loss when batch sizes exceed backend limits. The splitting includes deadline guards to prevent infinite recursion, short-circuits on first-half failure to avoid wasting time on the second half, and drops individual items that are genuinely too large. Fixes open-telemetry#4533
- Add CHANGELOG.md entry for the 413 splitting feature - Apply ruff format to source files (line wrapping adjustments) - Rename loop variable 'i' to 'idx' to satisfy pylint naming convention
Relax assertAlmostEqual tolerance from 2 decimal places (0.005) to 1 (0.05) in timeout tests. The _export_batch refactoring adds a serialization step between deadline calculation and the HTTP POST, consuming a few extra milliseconds that exceed the tight tolerance on slow runtimes like PyPy on Windows.
…l flow - Add _MAX_BISECTS=5 to cap recursive splitting depth - Combine 413 guard with len>1 and remaining_bisects>0 check so single-item 413 falls through to the existing non-retryable path - Check self._shutdown alongside deadline in the 413 handler - Add tests for max bisect depth exhaustion and shutdown during 413
Add pylint disable comment, matching the pattern used in test_otlp_metrics_exporter.py.
The HTTP 413 payload splitting behavior is not part of the OpenTelemetry specification. Gate it behind the experimental environment variable OTEL_PYTHON_EXPERIMENTAL_OTLP_RETRY_ON_413 (must be set to "true" to enable). When unset, 413 responses are treated as non-retryable errors. Also refactors the control flow per review feedback: the bisectable flag is computed alongside retryable, checked after the retry-exit block, and the splitting logic is moved to after line 257 in the original code.
- Add pylint disable for too-many-public-methods on TestOTLPSpanExporter (21 methods exceeds limit of 20, matching existing log exporter fix) - Reset DuplicateFilter state between log exporter tests via setUp() to prevent log suppression from 413 tests bleeding into test_export_no_collector_available
680a86d to
ee1a1d8
Compare
Summary
When a backend returns HTTP 413 (Payload Too Large), the OTLP HTTP trace and log exporters now split the batch in half and recursively retry each half, preventing silent data loss when batch sizes exceed backend limits.
Fixes #4533
Changes
_is_payload_too_large()helper in_common/__init__.pyexport()to delegate to_export_batch()in both trace and log exporters_export_batch()handles 413 responses with binary splitting:Notes
max_export_batch_sizeand_split_metrics_data(). Reactive 413 handling for metrics is deferred to a follow-up since metric data has a nested protobuf structure that requires different splitting logic.RESOURCE_EXHAUSTED) and would need separate handling in a future PR.Test plan