HIVE-29341:Iceberg: [V3] NULL initial default for STRUCT column#6565
HIVE-29341:Iceberg: [V3] NULL initial default for STRUCT column#6565ramitg254 wants to merge 5 commits into
Conversation
| public enum DefaultMetadataPolicy { | ||
| WRITE_DEFAULT, | ||
| WRITE_AND_INITIAL_DEFAULT | ||
| } |
There was a problem hiding this comment.
I don't think you need this at all, For ORC InitialDefault will be null and the end result will be same, at best we can directly pass isOrcTable if required. I won't though
There was a problem hiding this comment.
yes, corrected it avoided passing this argument and applied initial defaults to struct via a method created in HiveSchemaUtil.java
(Earlier, I was trying to apply intial defualts directly in convert internal step so introduced this enum because passing isOrc boolean value as a method argument to the existing methods doesn't seem right so introduced this new argument to describe the purpose of adding this argument more clearly )
Change-Id: Iefec29f3d2e320cd94e5944651d65e85471b5cb8
Change-Id: Id2df9e65a64f876600882acffb6df2f628559e1e
ayushtkn
left a comment
There was a problem hiding this comment.
thanx @ramitg254 for the fix, have dropped some comments
| if (field.initialDefaultLiteral() == null && field.writeDefaultLiteral() != null) { | ||
| builder.withInitialDefault(field.writeDefaultLiteral()); | ||
| } |
There was a problem hiding this comment.
why are we using writeDefault when initialDefault isn't there?
There was a problem hiding this comment.
this method is only required for the sake of only cloning the type in case of adding column as mentioned in https://github.com/apache/hive/pull/6565/changes/BASE..c96b0695f3a0540b568be3cf1c59e59687260593#r3503931068
when write default will always be same as the initial default
There was a problem hiding this comment.
Doesn't make sense to me at all and anybody seeing in future will think of it is as a bug only. that we landed up setting writeDefault in Initial Default
| * Backfills struct column that is null on read using nested {@code initialDefault} metadata. | ||
| * This applies to rows written before {@code ADD COLUMNS} added the struct. | ||
| * Spec allows struct defaults as {@code {}} (see https://iceberg.apache.org/spec/#default-values), but | ||
| * {@code UpdateSchema} add column only supports primitives today; |
There was a problem hiding this comment.
add column only supports primitives today;
Whaat? Then how come we added struct column & the entire initialDefault initiative here?
There was a problem hiding this comment.
I meant here was underlying primitives of struct, corrected the comment
| */ | ||
| public static void backfillStructInitialDefaults(Record iceRecord, List<Types.NestedField> columns) { | ||
| for (Types.NestedField field : columns) { | ||
| if (field.type().isStructType() && iceRecord.getField(field.name()) == null) { |
There was a problem hiding this comment.
can we use fieldId here? I am doubting the case where there was a field with same name which was dropped earlier and then added again as struct.
There was a problem hiding this comment.
this works fine in the case you mentioned because the columns passed here are readSchema.columns() which is the current schema columns and chose field.name() over field.id() as Record has direct getter to find field value via get getField which is not the case with fieldId we need loop through all columns and match id to find the field itself,
also I have optimized it please check the latest one.
and verifed as well via these set of queries showing correct results:
set hive.vectorized.execution.enabled=false;
CREATE TABLE ice_drop_read (
id INT)
STORED BY ICEBERG stored as parquet
TBLPROPERTIES ('format-version'='3');
INSERT INTO ice_drop_read (id) VALUES (1);
ALTER TABLE ice_drop_read ADD COLUMNS (meta INT DEFAULT 99);
INSERT INTO ice_drop_read (id, meta) VALUES (2, 10);
ALTER TABLE ice_drop_read DROP COLUMN meta;
ALTER TABLE ice_drop_read ADD COLUMNS (meta STRUCT<x:INT, y:INT> DEFAULT '{"x":10,"y":20}');
INSERT INTO ice_drop_read (id) VALUES (3);
SELECT * FROM ice_drop_read ORDER BY id;
| } | ||
|
|
||
| @Test | ||
| void testBackfillStructInitialDefaults() { |
There was a problem hiding this comment.
these tests aren't required, we can have the q test, that is enough
| boolean needsBackfill = readSchema.columns().stream() | ||
| .filter(field -> field.type().isStructType()) | ||
| .anyMatch(field -> !HiveSchemaUtil.getStructInitialDefaults(field.type().asStructType()).isEmpty()); | ||
| if (!needsBackfill) { | ||
| return iterable; | ||
| } | ||
| return CloseableIterable.transform(iterable, row -> { | ||
| if (row instanceof Record curIceRecord) { | ||
| HiveSchemaUtil.backfillStructInitialDefaults(curIceRecord, readSchema.columns()); | ||
| } | ||
| return row; | ||
| }); |
There was a problem hiding this comment.
How many times are we walking over the schema here? This is pretty confusing, not sure what you are trying to achieve here.
For needsBackfill you go through the schema, call HiveSchemaUtil.getStructInitialDefaults populate the enitre map & then throw it?
There was a problem hiding this comment.
couple of redundancy I introduced there, optimized it please check https://github.com/apache/hive/pull/6565/changes/BASE..c96b0695f3a0540b568be3cf1c59e59687260593#r3504074084
| if (!needsBackfill) { | ||
| return iterable; | ||
| } | ||
| return CloseableIterable.transform(iterable, row -> { |
There was a problem hiding this comment.
The initialDefault value will be same for all the records, missing the field. buildStructWithInitialDefaults recursively parses the schema and calls convertToWriteType() to build the default record from scratch for every single missing struct on every single row. For a table with millions of rows, this continuous object creation and recursive schema parsing will severely degrade read performance.
There was a problem hiding this comment.
my bad, optimized it to build default struct only once and then set the field of the row record for that particular field
| Type type = HiveSchemaUtil.convert(TypeInfoUtils.getTypeInfoFromTypeString(addedCol.getType()), defaultValue); | ||
| Literal<Object> defaultVal = Optional.ofNullable(defaultValue).filter(v -> !type.isStructType()) | ||
| .map(v -> Expressions.lit(HiveSchemaUtil.getDefaultValue(v, type))).orElse(null); | ||
|
|
||
| Type baseType = HiveSchemaUtil.convert(TypeInfoUtils.getTypeInfoFromTypeString(addedCol.getType()), defaultValue); | ||
| final Type resolvedType = (!isORc && baseType.isStructType()) ? | ||
| HiveSchemaUtil.applyInitialDefaultsToStruct(baseType) : | ||
| baseType; | ||
| Literal<Object> defaultVal = Optional.ofNullable(defaultValue).filter(v -> !resolvedType.isStructType()) | ||
| .map(v -> Expressions.lit(HiveSchemaUtil.getDefaultValue(v, resolvedType))).orElse(null); | ||
| // ORC doesn't have support for initialDefault from iceberg layer, we only need to set default for writeDefault. | ||
| updateSchema.addColumn(addedCol.getName(), type, addedCol.getComment(), isORc ? null : defaultVal); | ||
| updateSchema.addColumn(addedCol.getName(), resolvedType, addedCol.getComment(), isORc ? null : defaultVal); |
There was a problem hiding this comment.
I am not catching what are u trying to achieve here? This used to have the default
Type type = HiveSchemaUtil.convert(TypeInfoUtils.getTypeInfoFromTypeString(addedCol.getType()), defaultValue);
Maybe just adding initialDefault along with writeDefault should have solved the problem.
As of now it only sets writeDefault while converting the schema
Type type = HiveSchemaUtil.convert(TypeInfoUtils.getTypeInfoFromTypeString(addedCol.getType()), defaultValue);
You are doing parsing and all multiple times
There was a problem hiding this comment.
yes that was the earlier approach I was using but isOrc argument needs to be passed there in convert to determine initial default needs to be set or not for inner struct fields which doesn't looks good as it will introduce concern "which metadata should be attached while building column type" which is not objective of convert , so building a copy of type already built with the already present metadata and initial defaults of underlying fields if col is of struct type and isOrc is false
There was a problem hiding this comment.
this approach ain't nice either, it looks overcomplicated to me
There was a problem hiding this comment.
one thing that would be nicer if it was possible is
- set initial defaults in
which is underlying method where convert falls into - and in
handleAddColumnsdo something like:
Type type = HiveSchemaUtil.convert(TypeInfoUtils.getTypeInfoFromTypeString(addedCol.getType()),
isORc ? null : defaultValue);
and later after adding column via updateSchema.addColumn would have updated it's default via updateSchema.updateColumnDefault which would only update write default keeping the initial default as null as we specified in type in case of orc with default value.
but this is not possible because when we use updateSchema.updateColumnDefault after updateSchema.addColumn in a single transaction it is not able to find nested field in case of struct, verified via using already present handleDefaultValues method which recursively checks nested fields and is already used in case of updating default for handleChangeColumn
so this is not possible in a single transaction
so alternatives we have are :
-
set initial default in convert step as mentioned and then later on in case isOrc is true and defaultValue!=null as well then explicitly set initial default value as null for those fields which will trigger a type rebuild as well
-
pass isOrc as a parameter in convert and in the underlying method we can add the conditional to set initial default based on it is true or false
@ayushtkn WDYT?
There was a problem hiding this comment.
I will go with 2, and not call it isORC, shouldAddInitialDefault, the value for which will be isOrc
| return hasAnyDefault ? nestedRecord : null; | ||
| } | ||
|
|
||
| private static Record buildStructWithInitialDefaults(Types.StructType structType) { |
There was a problem hiding this comment.
this is identical to buildStructWithDefaults just the intialDefault vs writeDefault, refactor!!!!
There was a problem hiding this comment.
done merged to a single method
| StructColumnVector structCol = (StructColumnVector) col; | ||
| List<String> fieldNames = structTypeInfo.getAllStructFieldNames(); | ||
| List<TypeInfo> fieldTypes = structTypeInfo.getAllStructFieldTypeInfos(); | ||
| Map<String, Object> fieldDefaults = defaultValue instanceof Map ? (Map<String, Object>) defaultValue : null; |
There was a problem hiding this comment.
why not an empty map rather than null, else you are checking fieldDefaults != null in the entire loop
| col.isRepeating = true; | ||
| col.noNulls = true; | ||
| col.isNull[0] = false; | ||
|
|
||
| if (typeInfo.getCategory() == ObjectInspector.Category.PRIMITIVE) { | ||
| fillPrimitive(col, (PrimitiveTypeInfo) typeInfo, fieldDefault); | ||
| } else if (typeInfo.getCategory() == ObjectInspector.Category.STRUCT) { | ||
| fillStruct(col, (StructTypeInfo) typeInfo, fieldDefault); |
There was a problem hiding this comment.
this is getting duped with the above // Case 2: We have a default → fill with constant value refactor rather than duplicating
Change-Id: If7fc697834eda9924177d648a9e96e8e149d7c7d
|
|
||
| @Test | ||
| public void testSimpleSchemaConvertToIcebergSchema() { | ||
| void testSimpleSchemaConvertToIcebergSchema() { |
|
ayushtkn
left a comment
There was a problem hiding this comment.
Thanx @ramitg254 for the work here.
Changes LGTM. There is one spotless warning around line length, I believe that can be fixed



What changes were proposed in this pull request?
support initial default for struct
Why are the changes needed?
currently initial default specified for struct isn't get read and stays null.
Does this PR introduce any user-facing change?
yes initial default values will be seen in previous rows
How was this patch tested?
unit tests and ci results