Skip to content

HBASE-29039 Seek past delete markers instead of skipping one at a time#8001

Draft
junegunn wants to merge 2 commits intoapache:masterfrom
junegunn:HBASE-29039-alt
Draft

HBASE-29039 Seek past delete markers instead of skipping one at a time#8001
junegunn wants to merge 2 commits intoapache:masterfrom
junegunn:HBASE-29039-alt

Conversation

@junegunn
Copy link
Copy Markdown
Member

@junegunn junegunn commented Mar 29, 2026

Context

HBASE-30036 (#7993) consolidates redundant delete markers on flush, preventing them from growing unbounded in HFiles. However, markers still accumulate in the memstore before flush, degrading read performance. HBASE-29039 addresses this from the read path side. Both are needed for full coverage. There is an open PR (#6557), but the review process has been stalled. This is an alternative approach with fewer code changes, hopefully making it easier to reach consensus.

Test result

Using the test code in HBASE-30036.

DeleteFamily

image
  • Substantial read performance improvement before flushes.
  • Without HBASE-30036, delete markers still accumulate in store files.

DeleteColumnContiguous

image
  • Substantial read performance improvement before flushes.
  • Without HBASE-30036, delete markers still accumulate in store files.

DeleteColumnInterleaved

image
  • No difference, as expected. Already triggers SEEK_NEXT_COL via the masked put.

Description

When a DeleteColumn or DeleteFamily marker is encountered during a normal user scan, the matcher currently returns SKIP, forcing the scanner to advance one cell at a time. This causes read latency to degrade linearly with the number of accumulated delete markers for the same row or column.

Since these are range deletes that mask all remaining versions of the column, seek past the entire column immediately via columns.getNextRowOrNextColumn(). This is safe because cells arrive in timestamp descending order, so any puts newer than the delete have already been processed.

For DeleteFamily, also fix getKeyForNextColumn in ScanQueryMatcher to bypass the empty-qualifier guard (HBASE-18471) when the cell is a DeleteFamily marker. Without this, the seek barely advances past the current cell instead of jumping to the first real qualified column.

The optimization is skipped when:

  • seePastDeleteMarkers is true (KEEP_DELETED_CELLS)
  • newVersionBehavior is enabled (sequence IDs determine visibility)
  • the delete marker is not tracked (visibility labels)

@junegunn junegunn marked this pull request as draft March 29, 2026 03:08
@junegunn
Copy link
Copy Markdown
Member Author

TestVisibilityLabelsWithDeletes is failing, which likely explains the additional changes in #6557. I'll try to fix it, but if it ends up resembling the previous approach, I'll drop this.

When a DeleteColumn or DeleteFamily marker is encountered during a normal
user scan, the matcher currently returns SKIP, forcing the scanner to
advance one cell at a time. This causes read latency to degrade linearly
with the number of accumulated delete markers for the same row or column.

Since these are range deletes that mask all remaining versions of the
column, seek past the entire column immediately via
columns.getNextRowOrNextColumn(). This is safe because cells arrive in
timestamp descending order, so any puts newer than the delete have
already been processed.

For DeleteFamily, also fix getKeyForNextColumn in ScanQueryMatcher to
bypass the empty-qualifier guard (HBASE-18471) when the cell is a
DeleteFamily marker. Without this, the seek barely advances past the
current cell instead of jumping to the first real qualified column.

The optimization is only applied with plain ScanDeleteTracker, and
skipped when:
- seePastDeleteMarkers is true (KEEP_DELETED_CELLS)
- newVersionBehavior is enabled (sequence IDs determine visibility)
- visibility labels are in use (delete/put label mismatch)
@junegunn
Copy link
Copy Markdown
Member Author

TestVisibilityLabelsWithDeletes is failing

Fixed by:

-          !seePastDeleteMarkers && !(deletes instanceof NewVersionBehaviorTracker)
+          !seePastDeleteMarkers && deletes.getClass() == ScanDeleteTracker.class

@junegunn junegunn marked this pull request as ready for review March 29, 2026 03:42
@junegunn
Copy link
Copy Markdown
Member Author

junegunn commented Mar 30, 2026

I found a regression with this patch. When scanning across many rows where each row has only one DeleteFamily (or DeleteColumn) marker, scan performance degrades by ~50% compared to master. The seek triggered by this optimization is more expensive than a simple skip when there's nothing to skip over.

The optimization helps when multiple delete markers accumulate for the same row or column. But for the common case of one delete per row, the seek is wasted and the overhead adds up across many rows.

Benchmark data (scan time at 300K iterations, DeleteFamily on different rows):

benchmark(:DeleteFamilyDifferentRows) do |i|
  row = i.to_s.to_java_bytes
  T.put(Put.new(row).addColumn(CF, CQ, VALUE))
  T.delete(Delete.new(row))
end
image

One possible approach: only seek on the second (or n-th) delete marker for the same scope. The first one would SKIP as before. If a second one appears (redundant), it signals accumulation and we switch to seek. This way:

  • One delete per row (common case): always skips, no regression
  • Accumulated deletes (the case we're optimizing): first one skips, rest seek

Would this kind of heuristic make sense?

Note

On the threshold for switching from skip to seek: based on my benchmarks, seek is roughly 50% more expensive than skip. So the false positive (i.e. seek unnecessary because there are just N deletes per Put) overhead depends on the threshold N:

  • N=1: false positive costs SEEK (1.5) vs SKIP (1.0) → 50% overhead
  • N=2: false positive costs SKIP + SEEK (2.5) vs SKIP + SKIP (2.0) → 25% overhead
  • N=3: false positive costs 2 SKIPs + SEEK (3.5) vs 3 SKIPs (3.0) → 17% overhead
  • N=4: false positive costs 3 SKIPs + SEEK (4.5) vs 4 SKIPs (4.0) → 13% overhead
  • N=10: false positive costs 9 SKIPs + SEEK (10.5) vs 10 SKIPs (10.0) → 5% overhead
image image

Higher N reduces the relative overhead of false positives, but delays the benefit when markers are truly accumulating (N-1 extra skips per row before seeking kicks in).

N=2 or N=3 both seem reasonable, but since we're optimizing for the case where many delete markers accumulate, a higher N like 10 would also work. The first few extra skips are negligible when there are hundreds of markers to seek past. Happy to hear thoughts on what makes sense here.

Note

This patch does not compare qualifiers of contiguous delete markers. Doing so (e.g. exposing a method on ScanDeleteTracker) would prevent cross-column false positives but not eliminate them entirely. Even with qualifier comparison, if a column has exactly N DeleteColumn markers, the seek at the Nth is still a false positive. e.g.

DC(q1) --skip--> DC(q1) --skip--> DC(q1) --seek--> DC(q2) --skip--> DC(q2) --skip--> DC(q2) --seek--> DC(q3)

@junegunn
Copy link
Copy Markdown
Member Author

junegunn commented Mar 30, 2026

59ad767 implements the heuristic with N = 3 (i.e. seek every 3 contiguous delete markers)

The regression in normal case (no redundant delete markers) is fixed (see HBASE-29039-alt-n3):

image

The performance benefit with many redundant delete markers remains:

image image

@junegunn junegunn force-pushed the HBASE-29039-alt branch 2 times, most recently from 6be48a0 to e7dc782 Compare March 30, 2026 23:17
@junegunn junegunn marked this pull request as draft March 30, 2026 23:24
Seeking is ~50% more expensive than skipping. When each row has only one
DeleteFamily or DeleteColumn marker (common case), the seek overhead
adds up across many rows, causing ~50% scan regression.

Introduce a counter that tracks consecutive range delete markers per row.
Only switch from SKIP to SEEK after seeing SEEK_ON_DELETE_MARKER_THRESHOLD
(default 3) markers, indicating actual accumulation. This preserves skip
performance for the common case while still optimizing the accumulation
case.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant