Skip to content

rptest: add list_offsets leader epoch test#30285

Open
nguyen-andrew wants to merge 4 commits intoredpanda-data:devfrom
nguyen-andrew:listoffsets-test
Open

rptest: add list_offsets leader epoch test#30285
nguyen-andrew wants to merge 4 commits intoredpanda-data:devfrom
nguyen-andrew:listoffsets-test

Conversation

@nguyen-andrew
Copy link
Copy Markdown
Member

@nguyen-andrew nguyen-andrew commented Apr 24, 2026

This PR adds ducktape coverage for CORE-12505: Redpanda returns the current leader epoch instead of the record's historical epoch on the ListOffsets earliest, timequery-match, and empty-partition paths.

Server-side response shape. test_list_offsets_epoch and test_list_offsets_epoch_empty_partition exercise ListOffsets v4 against both Kafka and Redpanda for the earliest, latest, timequery, and empty-partition paths. Produces records at a single epoch, advances the leader epoch 3 times via leader restart, and asserts the returned epoch per path. Kafka returns the record's historical epoch, which is the correct behavior we compare against. Redpanda currently returns the current leader epoch. The Redpanda test pins today's buggy behavior so it will fail once the bug is fixed and can be flipped to assert correct behavior. The empty-partition variant covers the no-records branch.

Consumer-level end-to-end.

  • test_seek_to_start_poisons_commit reproduces the bug end-to-end via rpk group seek --to start. With a stagnant topic and an epoch gap, the seek writes (0, current_epoch) into __consumer_offsets. An rpk topic consume --group consumer (franz-go AutoCommitMarks) reads all records but commits don't advance. Restart triggers a full replay. The bug applies to any seek that resolves to records at an older leader epoch; --to start is the most reproducible trigger.
  • test_throwaway_hack_mitigates_seek_to_start verifies the --to-file-with-throwaway-group mitigation. The seeded commit is (0, -1). franz-go's EpochOffset.Less comparator clamp lets real marks advance the head. Restart reads zero records.

Supporting changes. RpkTool.group_seek_to gains optional topics / allow_new_topics kwargs (backward-compatible). RpkConsumer gains an optional clean_shutdown kwarg (default False; opt-in True sends SIGTERM so franz-go can flush a final commit and send LeaveGroup). New OffsetFetchRequest_v5 / OffsetFetchResponse_v5 protocol classes plus a _get_committed helper, since rpk group describe doesn't expose committed_leader_epoch.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

  • none

@nguyen-andrew nguyen-andrew self-assigned this Apr 24, 2026
Copilot AI review requested due to automatic review settings April 24, 2026 02:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new rptest that exercises Kafka ListOffsets v4 leader-epoch semantics across earliest/latest/timequery and empty-partition paths, running the same procedure against Apache Kafka (as baseline) and Redpanda.

Changes:

  • Introduces custom ListOffsets v4 request/response schema helpers to query leader_epoch from protocol responses.
  • Adds shared test logic to create a record-epoch vs leader-epoch gap by restarting leaders and then validating returned epochs for different ListOffsets timestamp modes.
  • Adds Redpanda and Apache Kafka test classes that run the same assertions with cluster-specific leader-restart mechanics.

Comment thread tests/rptest/tests/list_offsets_epoch_test.py
Comment thread tests/rptest/tests/list_offsets_epoch_test.py
Comment thread tests/rptest/tests/list_offsets_epoch_test.py Outdated
Comment thread tests/rptest/tests/list_offsets_epoch_test.py Outdated
Exercises ListOffsets v4 for the earliest, latest, timequery, and
empty-partition paths. Produces records at a single epoch, advances
the leader epoch 3 times via leader restart, and asserts the returned
epoch against the expected value for each cluster: Kafka returns the
record's historical epoch for earliest/timequery (the correct
behavior we compare against), while Redpanda currently returns the
current leader epoch. The empty-partition variant drives the same
flow with zero records to cover the no-records branch.

Covers CORE-12505: Redpanda returns the current leader epoch instead
of the record's historical epoch for earliest, timequery, and empty
partition paths. The Redpanda test pins today's buggy behavior so it
will fail once the bug is fixed and can be flipped to assert correct
behavior.
nguyen-andrew and others added 3 commits April 27, 2026 22:00
Adds optional kwargs `topics: list[str] | None` and `allow_new_topics:
bool` (default False) to RpkTool.group_seek_to.  Backward-compatible:
existing callers passing only (group, to) are unchanged.

Needed by the CORE-12505 e2e tests' throwaway-hack flow, which seeks
a fresh consumer group to a topic the group has not yet consumed --
rpk requires --allow-new-topics for that case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `clean_shutdown: bool = False` to RpkConsumer.__init__ and threads
it into stop_node's kill_process call.  Default preserves the existing
SIGKILL behavior; opt in via clean_shutdown=True to send SIGTERM, which
triggers rpk's signal handler at consume.go:118-152 -> franz-go
client.Close() -> final commit + LeaveGroup before the process exits.

Used by the CORE-12505 e2e tests, where sequential consumers in the
same group would otherwise wait ~45s for the prior member's session
timeout before joining.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds two ducktape tests on ListOffsetsLeaderEpochRedpandaTest:

- test_seek_to_start_poisons_commit -- end-to-end reproduction of the
  bug via `rpk group seek --to start`.  With a stagnant topic and an
  epoch gap, the seek writes (0, current_epoch) into __consumer_offsets;
  an `rpk topic consume --group` consumer (franz-go AutoCommitMarks)
  reads all records but commits don't advance; restart triggers a full
  replay.  The bug is broader than this single trigger -- it manifests
  for any seek that resolves to records produced at an older leader
  epoch (`--to start`, `--to <past-timestamp>`, or any tool that does
  ListOffsets earliest/timequery -> OffsetCommit).  `--to start` is
  exercised here because it's the most reproducible trigger.

- test_throwaway_hack_mitigates_seek_to_start -- mirror test verifying
  the --to-file-with-throwaway-group hack as a mitigation.  Seeded
  commit is (0, -1); franz-go's EpochOffset.Less comparator clamp lets
  real marks advance the head; restart reads zero records.

Supporting additions in the same file:

- OffsetFetchRequest_v5 / OffsetFetchResponse_v5 protocol classes,
  needed because `rpk group describe` exposes only CURRENT-OFFSET, not
  committed_leader_epoch.
- _get_committed helper on the base class -- issues OffsetFetch v5
  directly via the kafka admin client and returns (offset, epoch).
- _consume_and_wait_for_autocommit helper on the Redpanda subclass --
  wraps RpkConsumer with clean_shutdown=True for graceful LeaveGroup.
- _apply_throwaway_hack helper -- runs the literal 5-step hack with
  try/finally cleanup of the throwaway group and the local seek file.

_setup_topic_with_epoch_gap is parameterized with num_epoch_advances
(default 3 to preserve the existing epoch-correctness tests' behavior;
the new tests pass 1, since the bug only requires
current_epoch > initial_epoch).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nguyen-andrew nguyen-andrew requested a review from pgellert April 27, 2026 22:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants