Skip to content

[AURON #2291] Make auron-flink-assembly the structural deployment jar#2292

Open
weiqingy wants to merge 7 commits into
apache:masterfrom
weiqingy:AURON-2291-impl
Open

[AURON #2291] Make auron-flink-assembly the structural deployment jar#2292
weiqingy wants to merge 7 commits into
apache:masterfrom
weiqingy:AURON-2291-impl

Conversation

@weiqingy
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #2291

Rationale for this change

auron-flink-assembly is the single Flink deployment jar. It already bundles both Auron's StreamExecCalc override and Flink's stock StreamExecCalc (transitive from flink-table-planner_2.12), but with no class-level filter the surviving copy was decided by maven-shade's class-overlap ordering — not a guarantee. This makes Auron's override win structurally, so activation no longer depends on incidental shade/classpath ordering.

What changes are included in this PR?

  • Exclude Flink's stock StreamExecCalc from the shaded jar so only Auron's override remains.
  • Add ASF NOTICE/LICENSE for the embedded modified Flink content, plus a full audit of bundled non-ALv2 dependencies (icu4j, protobuf, kryo, minlog, janino, commons-compiler, zstd-jni, jts-core, checker-qual).
  • Add AssemblyJarStructureIT, a build-time test asserting the jar contains exactly one StreamExecCalc and it is Auron's.
  • Run the structural test in the Flink CI workflow.

Are there any user-facing changes?

The recommended deployment becomes a single structural jar swap (remove flink-table-planner-loader-*.jar from $FLINK_HOME/lib/, add auron-flink-assembly.jar). The user-facing deployment documentation is tracked separately in #1892.

How was this patch tested?

AssemblyJarStructureIT (build-time, runs in CI). Manually verified the loader to non-loader jar swap on a Flink 1.18 home running a Calc SQL with native conversion active.

weiqingy added 3 commits May 30, 2026 01:24
…embly jar

The auron-flink-assembly shaded jar bundles both Auron's StreamExecCalc override
(from auron-flink-planner) and Flink's stock copy (transitive from
flink-table-planner_2.12). Add a maven-shade-plugin filter that excludes the
stock class so only Auron's override remains, making activation a structural
guarantee instead of depending on shade's class-overlap ordering.
…y jar

The auron-flink-assembly jar bundles a modified copy of Apache Flink's
flink-table-planner (Auron's StreamExecCalc substituted in) plus a large
dependency closure. Add ASF-compliant license documentation:

- META-INF/NOTICE: Auron product header (via ApacheNoticeResourceTransformer),
  the bundled-modified-Flink statement with an ALv2 section 4(b) note naming
  StreamExecCalc, and kafka-clients' required jar-root attribution folded in.
- Root LICENSE: Apache License 2.0 plus blocks for every bundled non-ALv2
  dependency (icu4j Unicode/ICU; protobuf-java, kryo, minlog, janino,
  commons-compiler BSD-3; zstd-jni BSD-2; jts-core EPL-2.0; checker-qual MIT).
- Shade transformers and filters to aggregate dependency notices and drop the
  colliding jar-root NOTICE/LICENSE files of kafka-clients and icu4j.
AssemblyJarStructureIT (maven-failsafe, verify phase) opens the built
auron-flink-assembly jar and asserts it is structurally correct: exactly one
StreamExecCalc class, that it is Auron's override (byte-level check for the
FlinkAuronCalcOperator marker, which stock Flink's class lacks), that the full
Flink planner content is bundled, and that META-INF/NOTICE attributes both
Apache Auron and Apache Flink. Guards against the stock StreamExecCalc winning
the shade.
Add the auron-flink-assembly module to the Flink CI job and run
AssemblyJarStructureIT in the verify phase. A single `mvn verify -pl
auron-flink-extension/auron-flink-assembly -am` builds the native library and
all upstream Flink modules (so auron-core's test-jar is produced for
auron-flink-runtime), runs their tests, then verifies the shaded assembly jar.
@weiqingy
Copy link
Copy Markdown
Contributor Author

Hi @Tartarus0zm, could you please help review this PR when you get a chance? Thank you!

Copy link
Copy Markdown
Contributor

@Tartarus0zm Tartarus0zm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weiqingy Thanks for your contribution, I think this is a very rigorous strategy, I completely agree.
But I haven't seen any targeted tests for StreamExecCalc. In the future, when we rewrite other Exec nodes, developers will need to follow this convention, but we don't have a validation mechanism in place. It looks like we'll have to rely solely on a handful of committers doing code reviews to ensure it works, and that's what worries me.

  • Can we define an interface, something like FlinkAuronStreamExec, that our rewritten Flink StreamExecCalc would need to implement? Then, going forward, any support we add for other Exec nodes would also have to comply with this constraint? That way we can write dedicated tests in AssemblyJarStructureIT.
  • Can we add a check in the auron-flink-planner module that scans all rewritten classes under org/apache/flink/table/planner/plan/nodes/exec and verifies they all implement the FlinkAuronStreamExec interface? This would ensure no developer forgets to do so.
  • We can add a reminder in the FlinkAuronStreamExec docs, developers also need to configure the filter strategy in auron-flink-assembly/pom.xml
    What do you think, does this work for you?

weiqingy added 2 commits May 31, 2026 23:09
…ker interface

Adding a new Auron override of a Flink ExecNode requires shipping a class at
Flink's fully-qualified class name and excluding the stock copy from the
assembly shade filter. Nothing failed the build if a step was missed, so a
forgotten exclude could let Flink's stock class win the shade and the override
go silently inactive.

Introduce FlinkAuronExecNode, a marker interface every override implements, and
a registry resource (META-INF/auron/shadowed-flink-execnodes.txt) listing the
override class names. ShadowedExecNodeRegistryTest scans the planner's exec-node
package and fails the build when an override does not implement the marker or is
absent from the registry, making the contract self-enforcing. StreamExecCalc is
the first registered override.
…y shade

AssemblyJarStructureIT previously hard-coded a single StreamExecCalc check. As
more Flink ExecNode overrides are added, each needs its own shade <exclude>; a
forgotten one lets Flink's stock class win the shade and the override silently
fall back, with no failing test.

Drive the structural test off the override registry instead: for every class
named in META-INF/auron/shadowed-flink-execnodes.txt, assert the assembly jar
holds exactly one copy and it carries the FlinkAuronExecNode marker. A missing
<exclude> drops the marker from the surviving class and fails the test. The
assembly pom filter comment now documents that per-override step.
@weiqingy
Copy link
Copy Markdown
Contributor Author

weiqingy commented Jun 1, 2026

@weiqingy Thanks for your contribution, I think this is a very rigorous strategy, I completely agree. But I haven't seen any targeted tests for StreamExecCalc. In the future, when we rewrite other Exec nodes, developers will need to follow this convention, but we don't have a validation mechanism in place. It looks like we'll have to rely solely on a handful of committers doing code reviews to ensure it works, and that's what worries me.

  • Can we define an interface, something like FlinkAuronStreamExec, that our rewritten Flink StreamExecCalc would need to implement? Then, going forward, any support we add for other Exec nodes would also have to comply with this constraint? That way we can write dedicated tests in AssemblyJarStructureIT.
  • Can we add a check in the auron-flink-planner module that scans all rewritten classes under org/apache/flink/table/planner/plan/nodes/exec and verifies they all implement the FlinkAuronStreamExec interface? This would ensure no developer forgets to do so.
  • We can add a reminder in the FlinkAuronStreamExec docs, developers also need to configure the filter strategy in auron-flink-assembly/pom.xml
    What do you think, does this work for you?

@Tartarus0zm Thanks for the careful read and review. I've implemented all three of your suggestions, and backed the last one with a test rather than a doc so the part your note worried about isn't left to committer vigilance. Pushed in the two new commits.

1. Marker interface. Added FlinkAuronExecNode — a pure marker that every override implements; StreamExecCalc is the first. One naming deviation: I called it FlinkAuronExecNode rather than FlinkAuronStreamExec, because the scan and registry below span org/apache/flink/table/planner/plan/nodes/exec (so a future …/exec/batch override is covered by the same machinery). Happy to rename it to the stream-scoped form if you'd prefer to keep it narrow for now.

2. Scan check in auron-flink-planner. ShadowedExecNodeRegistryTest scans every compiled class under …/nodes/exec/** and fails the build if any concrete override doesn't implement FlinkAuronExecNode. So "forgot the marker" can't reach a green CI.

3. The pom.xml filter step — enforced, not just documented. The marker's Javadoc records the full contract (implement the marker → register the class → add the shade <exclude>), but I wanted that last step guarded by a test. There's now a registry resource, META-INF/auron/shadowed-flink-execnodes.txt, listing the override class names, and AssemblyJarStructureIT reads it from the built jar and asserts each registered override survived the shade as a single class carrying the marker. If someone adds an override but forgets the <exclude>, Flink's stock copy wins the collision, the surviving class has no marker, and the assembly test fails. The registry is the bridge between the two checks: the assembly module has neither Flink nor the planner classes on its test classpath, so it needs an explicit list of what to verify — and the planner scan in (2) keeps that list honest (a registered-but-missing or unregistered-but-present class fails there).

Net effect: each of the three "forgot a step" mistakes now fails a specific test instead of resting on review. Does this match what you had in mind? And let me know your preference on the marker name.

@weiqingy
Copy link
Copy Markdown
Contributor Author

weiqingy commented Jun 1, 2026

The 1 failing check is unrelated to this PR — a transient CI-runner network failure in the Spark 4.1 test track, which this PR doesn't touch.

  • Test spark-4.1 JDK21 Scala-2.13 / Build Auron JAR: 461/462 Spark tests passed. The single failure is AuronParquetV2PartitionDiscoverySuiteSPARK-23436, SPARK-36861: invalid Dates should be inferred as String in partition inference. Root cause is RemoteClassLoaderError caused by Failed to connect to runnervm…cloudapp.net:35141 / RejectedExecutionException: event executor terminated — the Spark executor lost its RPC connection to the driver mid-test, so codegen couldn't fetch a class and the write task aborted (TASK_WRITE_FAILED). That's runner infra, not a logic failure (54 connect failures + 70 "event executor terminated" in the log). Evidence: failing job.

This PR only changes Flink modules (auron-flink-extension/auron-flink-planner, auron-flink-extension/auron-flink-assembly, and flink.yml) — none of it is on the Spark test classpath, and the Test Flink 1.18 job (the one this PR affects) passed.

Copy link
Copy Markdown
Contributor

@Tartarus0zm Tartarus0zm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution again! LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ship a shaded auron-flink-planner jar that replaces flink-table-planner.jar

2 participants