You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If Redpanda encounters an error while writing a record to the Iceberg table, Redpanda by default writes the record to a separate dead-letter queue (DLQ) Iceberg table named `<topic-name>~dlq`. The following can cause errors to occur when translating records in the `value_schema_id_prefix` and `value_schema_latest` modes to the Iceberg table format:
318
-
319
-
- Redpanda cannot find the embedded schema ID in the Schema Registry.
320
-
- Redpanda fails to translate one or more schema data types to an Iceberg type.
321
-
- In `value_schema_id_prefix` mode, you do not use the Schema Registry wire format with the magic byte.
322
-
323
-
The DLQ table itself uses the `key_value` schema, consisting of two columns: the record metadata including the key, and a binary column for the record's value.
324
-
325
-
NOTE: Topic property misconfiguration, such as xref:manage:iceberg/specify-iceberg-schema.adoc#override-value-schema-latest-default[overriding the default behavior of `value_schema_latest` mode] but not specifying the fully qualified Protobuf message name, does not cause records to be written to the DLQ table. Instead, Redpanda pauses the topic data translation to the Iceberg table until you fix the misconfiguration.
326
-
327
-
=== Inspect DLQ table
328
-
329
-
You can inspect the DLQ table for records that failed to write to the Iceberg table, and you can take further action on these records, such as transforming and reprocessing them, or debugging issues that occurred upstream.
330
-
331
-
The following example produces a record to a topic named `ClickEvent` and does not use the Schema Registry wire format that includes the magic byte and schema ID:
332
-
333
-
[,bash,role=no-copy]
334
-
----
335
-
echo '"key1" {"user_id":2324,"event_type":"BUTTON_CLICK","ts":"2024-11-25T20:23:59.380Z"}' | rpk topic produce ClickEvent --format='%k %v\n'
336
-
----
337
-
338
-
Querying the DLQ table returns the record that was not translated:
339
-
340
-
[,sql]
341
-
----
342
-
SELECT
343
-
value
344
-
FROM <catalog-name>."ClickEvent~dlq"; -- Fully qualified table name
The data is in binary format, and the first byte is not `0x00`, indicating that it was not produced with a schema.
361
-
362
-
=== Reprocess DLQ records
363
-
364
-
You can apply a transformation and reprocess the record in your data lakehouse to the original Iceberg table. In this case, you have a JSON value represented as a UTF-8 binary. Depending on your query engine, you might need to decode the binary value first before extracting the JSON fields. Some engines may automatically decode the binary value for you:
365
-
366
-
.ClickHouse SQL example to reprocess DLQ record
367
-
[,sql]
368
-
----
369
-
SELECT
370
-
CAST(jsonExtractString(json, 'user_id') AS Int32) AS user_id,
371
-
jsonExtractString(json, 'event_type') AS event_type,
372
-
jsonExtractString(json, 'ts') AS ts
373
-
FROM (
374
-
SELECT
375
-
CAST(value AS String) AS json
376
-
FROM <catalog-name>.`ClickEvent~dlq` -- Ensure that the table name is properly parsed
You can now insert the transformed record back into the main Iceberg table. Redpanda recommends employing a strategy for exactly-once processing to avoid duplicates when reprocessing records.
390
-
391
-
=== Drop invalid records
392
-
393
-
ifndef::env-cloud[]
394
-
To disable the default behavior and drop an invalid record, set the xref:reference:properties/topic-properties.adoc#redpanda-iceberg-invalid-record-action[`redpanda.iceberg.invalid.record.action`] topic property to `drop`. You can also configure the default cluster-wide behavior for invalid records by setting the `iceberg_invalid_record_action` property.
395
-
endif::[]
396
-
ifdef::env-cloud[]
397
-
To disable the default behavior and drop an invalid record, set the `redpanda.iceberg.invalid.record.action` topic property to `drop`. You can also configure the default cluster-wide behavior for invalid records by setting the `iceberg_invalid_record_action` property.
398
-
endif::[]
399
-
400
-
== Performance considerations
401
-
402
-
When you enable Iceberg for any substantial workload and start translating topic data to the Iceberg format, you may see most of your cluster's CPU utilization increase. If this additional workload overwhelms the brokers and causes the Iceberg table lag to exceed the configured target lag, Redpanda automatically applies backpressure to producers to prevent Iceberg tables from lagging further. This ensures that Iceberg tables keep up with the volume of incoming data, but sacrifices ingress throughput of the cluster.
403
-
404
-
You may need to increase the size of your Redpanda cluster to accommodate the additional workload. To ensure that your cluster is sized appropriately, contact the Redpanda Customer Success team.
405
-
406
-
=== Use custom partitioning
407
-
408
-
ifndef::env-cloud[]
409
-
To improve query performance, consider implementing custom https://iceberg.apache.org/docs/nightly/partitioning/[partitioning^] for the Iceberg topic. Use the xref:reference:properties/topic-properties.adoc#redpanda-iceberg-partition-spec[`redpanda.iceberg.partition.spec`] topic property to define the partitioning scheme:
410
-
endif::[]
411
-
ifdef::env-cloud[]
412
-
To improve query performance, consider implementing custom https://iceberg.apache.org/docs/nightly/partitioning/[partitioning^] for the Iceberg topic. Use the `redpanda.iceberg.partition.spec` topic property to define the partitioning scheme:
413
-
endif::[]
414
-
415
-
[,bash,]
416
-
----
417
-
# Create new topic with five topic partitions, replication factor 3, and custom table partitioning for Iceberg
Valid `<partition-key>` values include a source column name or a transformation of a column. The columns referenced can be Redpanda-defined (such as `redpanda.timestamp`) or user-defined based on a schema that you register for the topic. The Iceberg table stores records that share different partition key values in separate files based on this specification.
422
-
423
-
For example:
424
-
425
-
* To partition the table by a single key, such as a column `col1`, use: `redpanda.iceberg.partition.spec=(col1)`.
426
-
* To partition by multiple columns, use a comma-separated list: `redpanda.iceberg.partition.spec=(col1, col2)`.
427
-
* To partition by the year of a timestamp column `ts1`, and a string column `col1`, use: `redpanda.iceberg.partition.spec=(year(ts1), col1)`.
428
-
429
-
To learn more about how partitioning schemes can affect query performance, and for details on the partitioning specification such as allowed transforms, see the https://iceberg.apache.org/spec/#partitioning[Apache Iceberg documentation^].
430
-
431
-
[TIP]
432
-
====
433
-
* Partition by columns that you frequently use in queries. Columns with relatively few unique values, also known as low cardinality, are also good candidates for partitioning.
434
-
* If you must partition based on columns with high cardinality, for example timestamps, use Iceberg's available transforms such as extracting the year, month, or day to avoid creating too many partitions. Too many partitions can be detrimental to performance because more files need to be scanned and managed.
435
-
====
436
-
437
-
=== Avoid high column count
438
-
439
-
A high column count or schema field count results in more overhead when translating topics to the Iceberg table format. Small message sizes can also increase CPU utilization. To minimize the performance impact on your cluster, keep to a low column count and large message size for Iceberg topics.
440
-
441
315
== Next steps
442
316
443
317
* xref:manage:iceberg/use-iceberg-catalogs.adoc[]
444
-
* xref:manage:iceberg/migrate-to-iceberg-topics.adoc[Migrate existing Iceberg integrations to Iceberg Topics]
318
+
* xref:manage:iceberg/iceberg-performance-tuning.adoc[Tune Performance for Iceberg Topics]
0 commit comments