Skip to content

[spark] Add union read for lake-enabled log tables#2956

Open
fresh-borzoni wants to merge 3 commits intoapache:mainfrom
fresh-borzoni:spark-union-read
Open

[spark] Add union read for lake-enabled log tables#2956
fresh-borzoni wants to merge 3 commits intoapache:mainfrom
fresh-borzoni:spark-union-read

Conversation

@fresh-borzoni
Copy link
Copy Markdown
Contributor

@fresh-borzoni fresh-borzoni commented Mar 29, 2026

Summary

closes #2983

Adds batch read for lake-enabled log tables. When a table has datalake enabled, reads combine lake storage (Paimon/Iceberg) with Fluss log tail. Lake and log are planned as separate Spark partition, lake tasks read from lake storage without Fluss connections, log tail tasks reuse the existing reader. Falls back to pure log reads when no snapshot exists. Only enabled in FULL startup mode.

Tests cover both Paimon and Iceberg.

Follow-up PRs

  • PK table lake reads (sort-merge with lake snapshot)
  • Streaming with lake bootstrap
  • Filter/partition/limit push-down to lake source
  • DV support for Paimon

@Yohahaha
Copy link
Copy Markdown
Contributor

Yohahaha commented Apr 2, 2026

@fresh-borzoni thank you for the patch, I create an issue to track it #2983.

@Yohahaha
Copy link
Copy Markdown
Contributor

Yohahaha commented Apr 2, 2026

PK table lake reads (sort-merge with lake snapshot)

I will add spark sql support for union read pk table, #2984

@fresh-borzoni
Copy link
Copy Markdown
Contributor Author

Thanks @Yohahaha! Just a heads up, this PR already implements batch union read for log tables, so #2983 should be covered once it's merged.

Regarding PK table union read (#2984),I was planning to follow up with that as noted in the PR description.
Happy to collaborate if you're interested, let me know!

@Yohahaha
Copy link
Copy Markdown
Contributor

Yohahaha commented Apr 2, 2026

Thanks @Yohahaha! Just a heads up, this PR already implements batch union read for log tables, so #2983 should be covered once it's merged.

yeah, you could add "closes #2983" in the PR description so that the corresponding issue can be properly linked and closed, like other PRs.

when fluss release a new version, RM can easily collect the features of version scope.

@Yohahaha
Copy link
Copy Markdown
Contributor

Yohahaha commented Apr 2, 2026

Regarding PK table union read (#2984),I was planning to follow up with that as noted in the PR description. Happy to collaborate if you're interested, let me know!

@fresh-borzoni I was planing to implementing it over the next two weeks. Do you already have a draft PR?

Copy link
Copy Markdown
Contributor

@Yohahaha Yohahaha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments, thank you!

@fresh-borzoni
Copy link
Copy Markdown
Contributor Author

@Yohahaha Thank you, feel free to take #2984, I have some starting code, but I'd love to collaborate and appreciate your help here

@fresh-borzoni fresh-borzoni force-pushed the spark-union-read branch 2 times, most recently from 37d1a7d to 9682f4e Compare April 2, 2026 22:25
@fresh-borzoni
Copy link
Copy Markdown
Contributor Author

@Yohahaha Ty for the review,
Addressed comments, PTAL 🙏

@fresh-borzoni fresh-borzoni requested a review from Yohahaha April 2, 2026 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[spark] Batch union read for log table

2 participants