HIVE-29341:Iceberg: [V3] NULL initial default for STRUCT column by ramitg254 · Pull Request #6565 · apache/hive

ramitg254 · 2026-06-25T12:47:26Z

What changes were proposed in this pull request?

support initial default for struct

Why are the changes needed?

currently initial default specified for struct isn't get read and stays null.

Does this PR introduce any user-facing change?

yes initial default values will be seen in previous rows

How was this patch tested?

unit tests and ci results

ayushtkn · 2026-06-25T14:27:13Z

+public enum DefaultMetadataPolicy {
+  WRITE_DEFAULT,
+  WRITE_AND_INITIAL_DEFAULT
+}


I don't think you need this at all, For ORC InitialDefault will be null and the end result will be same, at best we can directly pass isOrcTable if required. I won't though

yes, corrected it avoided passing this argument and applied initial defaults to struct via a method created in HiveSchemaUtil.java

(Earlier, I was trying to apply intial defualts directly in convert internal step so introduced this enum because passing isOrc boolean value as a method argument to the existing methods doesn't seem right so introduced this new argument to describe the purpose of adding this argument more clearly )

Change-Id: Iefec29f3d2e320cd94e5944651d65e85471b5cb8

Change-Id: Id2df9e65a64f876600882acffb6df2f628559e1e

ayushtkn

thanx @ramitg254 for the fix, have dropped some comments

ayushtkn · 2026-07-01T01:33:56Z

+    if (field.initialDefaultLiteral() == null && field.writeDefaultLiteral() != null) {
+      builder.withInitialDefault(field.writeDefaultLiteral());
+    }


why are we using writeDefault when initialDefault isn't there?

this method is only required for the sake of only cloning the type in case of adding column as mentioned in https://github.com/apache/hive/pull/6565/changes/BASE..c96b0695f3a0540b568be3cf1c59e59687260593#r3503931068
when write default will always be same as the initial default

Doesn't make sense to me at all and anybody seeing in future will think of it is as a bug only. that we landed up setting writeDefault in Initial Default

ayushtkn · 2026-07-01T01:36:11Z

+   * Backfills struct column that is null on read using nested {@code initialDefault} metadata.
+   * This applies to rows written before {@code ADD COLUMNS} added the struct.
+   * Spec allows struct defaults as {@code {}} (see https://iceberg.apache.org/spec/#default-values), but
+   * {@code UpdateSchema} add column only supports primitives today;


add column only supports primitives today;

Whaat? Then how come we added struct column & the entire initialDefault initiative here?

I meant here was underlying primitives of struct, corrected the comment

ayushtkn · 2026-07-01T01:43:51Z

+   */
+  public static void backfillStructInitialDefaults(Record iceRecord, List<Types.NestedField> columns) {
+    for (Types.NestedField field : columns) {
+      if (field.type().isStructType() && iceRecord.getField(field.name()) == null) {


can we use fieldId here? I am doubting the case where there was a field with same name which was dropped earlier and then added again as struct.

this works fine in the case you mentioned because the columns passed here are readSchema.columns() which is the current schema columns and chose field.name() over field.id() as Record has direct getter to find field value via get getField which is not the case with fieldId we need loop through all columns and match id to find the field itself,
also I have optimized it please check the latest one.
and verifed as well via these set of queries showing correct results:

set hive.vectorized.execution.enabled=false; CREATE TABLE ice_drop_read ( id INT) STORED BY ICEBERG stored as parquet TBLPROPERTIES ('format-version'='3'); INSERT INTO ice_drop_read (id) VALUES (1); ALTER TABLE ice_drop_read ADD COLUMNS (meta INT DEFAULT 99); INSERT INTO ice_drop_read (id, meta) VALUES (2, 10); ALTER TABLE ice_drop_read DROP COLUMN meta; ALTER TABLE ice_drop_read ADD COLUMNS (meta STRUCT<x:INT, y:INT> DEFAULT '{"x":10,"y":20}'); INSERT INTO ice_drop_read (id) VALUES (3); SELECT * FROM ice_drop_read ORDER BY id;

ayushtkn · 2026-07-01T01:46:01Z

  }
+
+  @Test
+  void testBackfillStructInitialDefaults() {


these tests aren't required, we can have the q test, that is enough

done, removed

ayushtkn · 2026-07-01T01:47:04Z

+    boolean needsBackfill = readSchema.columns().stream()
+        .filter(field -> field.type().isStructType())
+        .anyMatch(field -> !HiveSchemaUtil.getStructInitialDefaults(field.type().asStructType()).isEmpty());
+    if (!needsBackfill) {
+      return iterable;
+    }
+    return CloseableIterable.transform(iterable, row -> {
+      if (row instanceof Record curIceRecord) {
+        HiveSchemaUtil.backfillStructInitialDefaults(curIceRecord, readSchema.columns());
+      }
+      return row;
+    });


How many times are we walking over the schema here? This is pretty confusing, not sure what you are trying to achieve here.

For needsBackfill you go through the schema, call HiveSchemaUtil.getStructInitialDefaults populate the enitre map & then throw it?

couple of redundancy I introduced there, optimized it please check https://github.com/apache/hive/pull/6565/changes/BASE..c96b0695f3a0540b568be3cf1c59e59687260593#r3504074084

ayushtkn · 2026-07-01T01:48:34Z

+    if (!needsBackfill) {
+      return iterable;
+    }
+    return CloseableIterable.transform(iterable, row -> {


The initialDefault value will be same for all the records, missing the field. buildStructWithInitialDefaults recursively parses the schema and calls convertToWriteType() to build the default record from scratch for every single missing struct on every single row. For a table with millions of rows, this continuous object creation and recursive schema parsing will severely degrade read performance.

my bad, optimized it to build default struct only once and then set the field of the row record for that particular field

ayushtkn · 2026-07-01T02:02:45Z

-      Type type = HiveSchemaUtil.convert(TypeInfoUtils.getTypeInfoFromTypeString(addedCol.getType()), defaultValue);
-      Literal<Object> defaultVal = Optional.ofNullable(defaultValue).filter(v -> !type.isStructType())
-          .map(v -> Expressions.lit(HiveSchemaUtil.getDefaultValue(v, type))).orElse(null);
-
+      Type baseType = HiveSchemaUtil.convert(TypeInfoUtils.getTypeInfoFromTypeString(addedCol.getType()), defaultValue);
+      final Type resolvedType = (!isORc && baseType.isStructType()) ?
+          HiveSchemaUtil.applyInitialDefaultsToStruct(baseType) :
+          baseType;
+      Literal<Object> defaultVal = Optional.ofNullable(defaultValue).filter(v -> !resolvedType.isStructType())
+          .map(v -> Expressions.lit(HiveSchemaUtil.getDefaultValue(v, resolvedType))).orElse(null);
      // ORC doesn't have support for initialDefault from iceberg layer, we only need to set default for writeDefault.
-      updateSchema.addColumn(addedCol.getName(), type, addedCol.getComment(), isORc ? null : defaultVal);
+      updateSchema.addColumn(addedCol.getName(), resolvedType, addedCol.getComment(), isORc ? null : defaultVal);


I am not catching what are u trying to achieve here? This used to have the default

Type type = HiveSchemaUtil.convert(TypeInfoUtils.getTypeInfoFromTypeString(addedCol.getType()), defaultValue);

Maybe just adding initialDefault along with writeDefault should have solved the problem.

As of now it only sets writeDefault while converting the schema

Type type = HiveSchemaUtil.convert(TypeInfoUtils.getTypeInfoFromTypeString(addedCol.getType()), defaultValue);

You are doing parsing and all multiple times

yes that was the earlier approach I was using but isOrc argument needs to be passed there in convert to determine initial default needs to be set or not for inner struct fields which doesn't looks good as it will introduce concern "which metadata should be attached while building column type" which is not objective of convert , so building a copy of type already built with the already present metadata and initial defaults of underlying fields if col is of struct type and isOrc is false

this approach ain't nice either, it looks overcomplicated to me

one thing that would be nicer if it was possible is

set initial defaults in

hive/iceberg/iceberg-catalog/src/main/java/org/apache/iceberg/hive/HiveSchemaConverter.java

Line 88 in ada3134

if (icebergDefaultValue != null) {

which is underlying method where convert falls into

and in handleAddColumns do something like:

Type type = HiveSchemaUtil.convert(TypeInfoUtils.getTypeInfoFromTypeString(addedCol.getType()), isORc ? null : defaultValue);

and later after adding column via updateSchema.addColumn would have updated it's default via updateSchema.updateColumnDefault which would only update write default keeping the initial default as null as we specified in type in case of orc with default value.

but this is not possible because when we use updateSchema.updateColumnDefault after updateSchema.addColumn in a single transaction it is not able to find nested field in case of struct, verified via using already present handleDefaultValues method which recursively checks nested fields and is already used in case of updating default for handleChangeColumn
so this is not possible in a single transaction

so alternatives we have are :

set initial default in convert step as mentioned and then later on in case isOrc is true and defaultValue!=null as well then explicitly set initial default value as null for those fields which will trigger a type rebuild as well

pass isOrc as a parameter in convert and in the underlying method we can add the conditional to set initial default based on it is true or false

@ayushtkn WDYT?

I will go with 2, and not call it isORC, shouldAddInitialDefault, the value for which will be isOrc

ayushtkn · 2026-07-01T02:10:00Z

    return hasAnyDefault ? nestedRecord : null;
  }

+  private static Record buildStructWithInitialDefaults(Types.StructType structType) {


this is identical to buildStructWithDefaults just the intialDefault vs writeDefault, refactor!!!!

done merged to a single method

ayushtkn · 2026-07-01T02:22:28Z

+    StructColumnVector structCol = (StructColumnVector) col;
+    List<String> fieldNames = structTypeInfo.getAllStructFieldNames();
+    List<TypeInfo> fieldTypes = structTypeInfo.getAllStructFieldTypeInfos();
+    Map<String, Object> fieldDefaults = defaultValue instanceof Map ? (Map<String, Object>) defaultValue : null;


why not an empty map rather than null, else you are checking fieldDefaults != null in the entire loop

ayushtkn · 2026-07-01T02:24:27Z

+    col.isRepeating = true;
+    col.noNulls = true;
+    col.isNull[0] = false;
+
+    if (typeInfo.getCategory() == ObjectInspector.Category.PRIMITIVE) {
+      fillPrimitive(col, (PrimitiveTypeInfo) typeInfo, fieldDefault);
+    } else if (typeInfo.getCategory() == ObjectInspector.Category.STRUCT) {
+      fillStruct(col, (StructTypeInfo) typeInfo, fieldDefault);


this is getting duped with the above // Case 2: We have a default → fill with constant value refactor rather than duplicating

done, refractored

Change-Id: If7fc697834eda9924177d648a9e96e8e149d7c7d

ayushtkn · 2026-07-02T17:17:33Z


  @Test
-  public void testSimpleSchemaConvertToIcebergSchema() {
+  void testSimpleSchemaConvertToIcebergSchema() {


these changes are out of scope

Change-Id: Ia7939110bf55576cef6f7cdca329bc49454dfd48

sonarqubecloud · 2026-07-02T21:17:17Z

Quality Gate passed

Issues
3 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

ayushtkn

Thanx @ramitg254 for the work here.
Changes LGTM. There is one spotless warning around line length, I believe that can be fixed

asf-ci-hive added tests pending tests failed and removed tests pending labels Jun 25, 2026

ayushtkn reviewed Jun 25, 2026

View reviewed changes

ramitg254 force-pushed the HIVE-29341 branch from a407016 to b47603c Compare June 25, 2026 16:53

asf-ci-hive added tests pending tests failed and removed tests failed tests pending labels Jun 25, 2026

ramitg254 changed the title ~~HIVE-29341:Iceberg: [V3] NULL initial default for STRUCT column~~ [WIP]HIVE-29341:Iceberg: [V3] NULL initial default for STRUCT column Jun 26, 2026

asf-ci-hive added tests unstable and removed tests pending labels Jun 26, 2026

ramitg254 force-pushed the HIVE-29341 branch from b47603c to c5432c8 Compare June 29, 2026 06:55

asf-ci-hive added tests pending and removed tests unstable labels Jun 29, 2026

HIVE-29341:Iceberg: [V3] NULL initial default for STRUCT column

6751ec5

Change-Id: Iefec29f3d2e320cd94e5944651d65e85471b5cb8

ramitg254 force-pushed the HIVE-29341 branch from c5432c8 to 6751ec5 Compare June 29, 2026 06:56

asf-ci-hive added tests failed tests pending and removed tests pending tests failed labels Jun 29, 2026

ramitg254 requested a review from ayushtkn June 29, 2026 07:05

ramitg254 changed the title ~~[WIP]HIVE-29341:Iceberg: [V3] NULL initial default for STRUCT column~~ HIVE-29341:Iceberg: [V3] NULL initial default for STRUCT column Jun 29, 2026

asf-ci-hive added tests unstable tests pending and removed tests pending tests unstable labels Jun 29, 2026

refractor and early return

c96b069

Change-Id: Id2df9e65a64f876600882acffb6df2f628559e1e

asf-ci-hive added tests pending tests passed and removed tests passed tests pending labels Jun 30, 2026

ayushtkn reviewed Jul 1, 2026

View reviewed changes

asf-ci-hive added tests pending tests failed and removed tests passed tests pending labels Jul 1, 2026

ramitg254 force-pushed the HIVE-29341 branch from 5124e25 to 402e933 Compare July 1, 2026 08:34

refractored and optimized

50ca727

Change-Id: If7fc697834eda9924177d648a9e96e8e149d7c7d

ramitg254 force-pushed the HIVE-29341 branch from 402e933 to 50ca727 Compare July 1, 2026 08:55

asf-ci-hive added tests pending and removed tests failed labels Jul 1, 2026

ramitg254 requested a review from ayushtkn July 1, 2026 09:45

asf-ci-hive added tests unstable tests pending tests passed and removed tests pending tests unstable labels Jul 1, 2026

ayushtkn reviewed Jul 2, 2026

View reviewed changes

refractor

f196c9c

Change-Id: Ia7939110bf55576cef6f7cdca329bc49454dfd48

asf-ci-hive added tests pending and removed tests passed labels Jul 2, 2026

asf-ci-hive added tests unstable and removed tests pending labels Jul 2, 2026

ayushtkn approved these changes Jul 3, 2026

View reviewed changes

Uh oh!

Conversation

ramitg254 commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ayushtkn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ayushtkn Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ramitg254 Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ramitg254 Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented Jul 2, 2026

Quality Gate passed

Uh oh!

ayushtkn left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

ramitg254 commented Jun 25, 2026 •

edited

Loading

ayushtkn Jul 1, 2026 •

edited

Loading

ramitg254 Jul 1, 2026 •

edited

Loading

ramitg254 Jul 2, 2026 •

edited

Loading