Skip to content

HIVE-29341:Iceberg: [V3] NULL initial default for STRUCT column#6565

Open
ramitg254 wants to merge 5 commits into
apache:masterfrom
ramitg254:HIVE-29341
Open

HIVE-29341:Iceberg: [V3] NULL initial default for STRUCT column#6565
ramitg254 wants to merge 5 commits into
apache:masterfrom
ramitg254:HIVE-29341

Conversation

@ramitg254

@ramitg254 ramitg254 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

support initial default for struct

Why are the changes needed?

currently initial default specified for struct isn't get read and stays null.

Does this PR introduce any user-facing change?

yes initial default values will be seen in previous rows

How was this patch tested?

unit tests and ci results

Comment on lines +29 to +32
public enum DefaultMetadataPolicy {
WRITE_DEFAULT,
WRITE_AND_INITIAL_DEFAULT
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need this at all, For ORC InitialDefault will be null and the end result will be same, at best we can directly pass isOrcTable if required. I won't though

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, corrected it avoided passing this argument and applied initial defaults to struct via a method created in HiveSchemaUtil.java

(Earlier, I was trying to apply intial defualts directly in convert internal step so introduced this enum because passing isOrc boolean value as a method argument to the existing methods doesn't seem right so introduced this new argument to describe the purpose of adding this argument more clearly )

@ramitg254 ramitg254 changed the title HIVE-29341:Iceberg: [V3] NULL initial default for STRUCT column [WIP]HIVE-29341:Iceberg: [V3] NULL initial default for STRUCT column Jun 26, 2026
Change-Id: Iefec29f3d2e320cd94e5944651d65e85471b5cb8
@ramitg254 ramitg254 requested a review from ayushtkn June 29, 2026 07:05
@ramitg254 ramitg254 changed the title [WIP]HIVE-29341:Iceberg: [V3] NULL initial default for STRUCT column HIVE-29341:Iceberg: [V3] NULL initial default for STRUCT column Jun 29, 2026
Change-Id: Id2df9e65a64f876600882acffb6df2f628559e1e

@ayushtkn ayushtkn left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanx @ramitg254 for the fix, have dropped some comments

Comment on lines +168 to +170
if (field.initialDefaultLiteral() == null && field.writeDefaultLiteral() != null) {
builder.withInitialDefault(field.writeDefaultLiteral());
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we using writeDefault when initialDefault isn't there?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method is only required for the sake of only cloning the type in case of adding column as mentioned in https://github.com/apache/hive/pull/6565/changes/BASE..c96b0695f3a0540b568be3cf1c59e59687260593#r3503931068
when write default will always be same as the initial default

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't make sense to me at all and anybody seeing in future will think of it is as a bug only. that we landed up setting writeDefault in Initial Default

* Backfills struct column that is null on read using nested {@code initialDefault} metadata.
* This applies to rows written before {@code ADD COLUMNS} added the struct.
* Spec allows struct defaults as {@code {}} (see https://iceberg.apache.org/spec/#default-values), but
* {@code UpdateSchema} add column only supports primitives today;

@ayushtkn ayushtkn Jul 1, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add column only supports primitives today;

Whaat? Then how come we added struct column & the entire initialDefault initiative here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant here was underlying primitives of struct, corrected the comment

*/
public static void backfillStructInitialDefaults(Record iceRecord, List<Types.NestedField> columns) {
for (Types.NestedField field : columns) {
if (field.type().isStructType() && iceRecord.getField(field.name()) == null) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use fieldId here? I am doubting the case where there was a field with same name which was dropped earlier and then added again as struct.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this works fine in the case you mentioned because the columns passed here are readSchema.columns() which is the current schema columns and chose field.name() over field.id() as Record has direct getter to find field value via get getField which is not the case with fieldId we need loop through all columns and match id to find the field itself,
also I have optimized it please check the latest one.
and verifed as well via these set of queries showing correct results:

set hive.vectorized.execution.enabled=false;
CREATE TABLE ice_drop_read (
  id INT)
STORED BY ICEBERG stored as parquet
TBLPROPERTIES ('format-version'='3');
INSERT INTO ice_drop_read (id) VALUES (1);
ALTER TABLE ice_drop_read ADD COLUMNS (meta INT DEFAULT 99);
INSERT INTO ice_drop_read (id, meta) VALUES (2, 10);
ALTER TABLE ice_drop_read DROP COLUMN meta;
ALTER TABLE ice_drop_read ADD COLUMNS (meta STRUCT<x:INT, y:INT> DEFAULT '{"x":10,"y":20}');
INSERT INTO ice_drop_read (id) VALUES (3);
SELECT * FROM ice_drop_read ORDER BY id;

}

@Test
void testBackfillStructInitialDefaults() {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these tests aren't required, we can have the q test, that is enough

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, removed

Comment on lines +181 to +192
boolean needsBackfill = readSchema.columns().stream()
.filter(field -> field.type().isStructType())
.anyMatch(field -> !HiveSchemaUtil.getStructInitialDefaults(field.type().asStructType()).isEmpty());
if (!needsBackfill) {
return iterable;
}
return CloseableIterable.transform(iterable, row -> {
if (row instanceof Record curIceRecord) {
HiveSchemaUtil.backfillStructInitialDefaults(curIceRecord, readSchema.columns());
}
return row;
});

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many times are we walking over the schema here? This is pretty confusing, not sure what you are trying to achieve here.

For needsBackfill you go through the schema, call HiveSchemaUtil.getStructInitialDefaults populate the enitre map & then throw it?

@ramitg254 ramitg254 Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (!needsBackfill) {
return iterable;
}
return CloseableIterable.transform(iterable, row -> {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initialDefault value will be same for all the records, missing the field. buildStructWithInitialDefaults recursively parses the schema and calls convertToWriteType() to build the default record from scratch for every single missing struct on every single row. For a table with millions of rows, this continuous object creation and recursive schema parsing will severely degrade read performance.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my bad, optimized it to build default struct only once and then set the field of the row record for that particular field

Comment on lines +747 to +754
Type type = HiveSchemaUtil.convert(TypeInfoUtils.getTypeInfoFromTypeString(addedCol.getType()), defaultValue);
Literal<Object> defaultVal = Optional.ofNullable(defaultValue).filter(v -> !type.isStructType())
.map(v -> Expressions.lit(HiveSchemaUtil.getDefaultValue(v, type))).orElse(null);

Type baseType = HiveSchemaUtil.convert(TypeInfoUtils.getTypeInfoFromTypeString(addedCol.getType()), defaultValue);
final Type resolvedType = (!isORc && baseType.isStructType()) ?
HiveSchemaUtil.applyInitialDefaultsToStruct(baseType) :
baseType;
Literal<Object> defaultVal = Optional.ofNullable(defaultValue).filter(v -> !resolvedType.isStructType())
.map(v -> Expressions.lit(HiveSchemaUtil.getDefaultValue(v, resolvedType))).orElse(null);
// ORC doesn't have support for initialDefault from iceberg layer, we only need to set default for writeDefault.
updateSchema.addColumn(addedCol.getName(), type, addedCol.getComment(), isORc ? null : defaultVal);
updateSchema.addColumn(addedCol.getName(), resolvedType, addedCol.getComment(), isORc ? null : defaultVal);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not catching what are u trying to achieve here? This used to have the default

 Type type = HiveSchemaUtil.convert(TypeInfoUtils.getTypeInfoFromTypeString(addedCol.getType()), defaultValue);

Maybe just adding initialDefault along with writeDefault should have solved the problem.

As of now it only sets writeDefault while converting the schema

 Type type = HiveSchemaUtil.convert(TypeInfoUtils.getTypeInfoFromTypeString(addedCol.getType()), defaultValue);

You are doing parsing and all multiple times

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that was the earlier approach I was using but isOrc argument needs to be passed there in convert to determine initial default needs to be set or not for inner struct fields which doesn't looks good as it will introduce concern "which metadata should be attached while building column type" which is not objective of convert , so building a copy of type already built with the already present metadata and initial defaults of underlying fields if col is of struct type and isOrc is false

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this approach ain't nice either, it looks overcomplicated to me

@ramitg254 ramitg254 Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one thing that would be nicer if it was possible is

Type type = HiveSchemaUtil.convert(TypeInfoUtils.getTypeInfoFromTypeString(addedCol.getType()),
          isORc ? null : defaultValue);

and later after adding column via updateSchema.addColumn would have updated it's default via updateSchema.updateColumnDefault which would only update write default keeping the initial default as null as we specified in type in case of orc with default value.

but this is not possible because when we use updateSchema.updateColumnDefault after updateSchema.addColumn in a single transaction it is not able to find nested field in case of struct, verified via using already present handleDefaultValues method which recursively checks nested fields and is already used in case of updating default for handleChangeColumn
so this is not possible in a single transaction

so alternatives we have are :

  1. set initial default in convert step as mentioned and then later on in case isOrc is true and defaultValue!=null as well then explicitly set initial default value as null for those fields which will trigger a type rebuild as well

  2. pass isOrc as a parameter in convert and in the underlying method we can add the conditional to set initial default based on it is true or false

@ayushtkn WDYT?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will go with 2, and not call it isORC, shouldAddInitialDefault, the value for which will be isOrc

return hasAnyDefault ? nestedRecord : null;
}

private static Record buildStructWithInitialDefaults(Types.StructType structType) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is identical to buildStructWithDefaults just the intialDefault vs writeDefault, refactor!!!!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done merged to a single method

StructColumnVector structCol = (StructColumnVector) col;
List<String> fieldNames = structTypeInfo.getAllStructFieldNames();
List<TypeInfo> fieldTypes = structTypeInfo.getAllStructFieldTypeInfos();
Map<String, Object> fieldDefaults = defaultValue instanceof Map ? (Map<String, Object>) defaultValue : null;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not an empty map rather than null, else you are checking fieldDefaults != null in the entire loop

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +98 to +105
col.isRepeating = true;
col.noNulls = true;
col.isNull[0] = false;

if (typeInfo.getCategory() == ObjectInspector.Category.PRIMITIVE) {
fillPrimitive(col, (PrimitiveTypeInfo) typeInfo, fieldDefault);
} else if (typeInfo.getCategory() == ObjectInspector.Category.STRUCT) {
fillStruct(col, (StructTypeInfo) typeInfo, fieldDefault);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is getting duped with the above // Case 2: We have a default → fill with constant value refactor rather than duplicating

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, refractored

Change-Id: If7fc697834eda9924177d648a9e96e8e149d7c7d

@Test
public void testSimpleSchemaConvertToIcebergSchema() {
void testSimpleSchemaConvertToIcebergSchema() {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these changes are out of scope

Change-Id: Ia7939110bf55576cef6f7cdca329bc49454dfd48
@sonarqubecloud

sonarqubecloud Bot commented Jul 2, 2026

Copy link
Copy Markdown

@ayushtkn ayushtkn left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanx @ramitg254 for the work here.
Changes LGTM. There is one spotless warning around line length, I believe that can be fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants