Skip to content

Regression: S3/GCS disks broken in 25.8.22 by aws-sdk-cpp 1.11.771 — "Response checksums mismatch" #1708

@CarlosFelipeOR

Description

@CarlosFelipeOR

I checked the Altinity Stable Builds lifecycle table, and the Altinity Stable Build version I'm using is still supported.

Type of problem

Bug report - something's broken

Describe the situation

A regression was introduced in Altinity Antalya 25.8.22 by PR #1667 (Antalya 25.8: Bump to 25.8.22), which backported upstream PR ClickHouse/ClickHouse#100582 (Use aws-sdk-cpp 1.11.771).

After the bump, every ClickHouse operation against real AWS S3 or GCS S3-compatible endpoints fails at startup with:

Code: 499. DB::Exception: Response checksums mismatch
This error happened for S3 disk. (S3_ERROR)

This causes the S3 disk's startup access check to fail, which in turn prevents ConfigReloader from loading the storage configuration, hanging the server's config reload until the regression test times out at 600s.

This issue:

  • Reproduces deterministically on real AWS S3 and real GCS — both architectures (x86 and arm64).
  • Does not reproduce on MinIO (MinIO doesn't enforce response checksums).
  • Was detected in our antalya-25.8 MasterCI runs, where every scenario in the S3_Aws_S3_2, S3_Gcs_2, Benchmark_Aws_S3, Benchmark_Gcs, Tiered_Storage_S3Amazon, and Tiered_Storage_S3Gcs jobs fails identically (~30 scenarios per arch).
  • Is already known upstream as ClickHouse/ClickHouse#103232, with an open backport PR ClickHouse/ClickHouse#103542 that has not yet been merged into 25.8.

How to reproduce the behavior

Environment

  • Version: 25.8.22.20001.altinityantalya (any 25.8.22+ build)
  • Storage: real AWS S3 or GCS S3-compatible endpoint (not MinIO)
  • Build type: Any (release reproducible)

Steps

  1. Configure an S3 disk in storage.xml pointing at a real AWS S3 bucket:
<storage_configuration>
    <disks>
        <external>
            <type>s3</type>
            <endpoint>https://s3.&lt;region&gt;.amazonaws.com/&lt;bucket&gt;/data/benchmark/</endpoint>
            <access_key_id>&lt;key&gt;</access_key_id>
            <secret_access_key>&lt;secret&gt;</secret_access_key>
        </external>
    </disks>
    <policies>
        <external>
            <volumes><external><disk>external</disk></external></volumes>
        </external>
    </policies>
</storage_configuration>
  1. Start clickhouse-server. The disk's startup access check immediately triggers the failure — no user query is needed.

  2. Equivalent reproduction via our regression suite:

python3 -u s3/regression.py \
  --clickhouse docker://altinity/clickhouse-server:25.8.22.20001.altinityantalya \
  --storage aws_s3 \
  --only '/benchmark/aws_s3/*' \
  --log log.log

Expected behavior

The S3 disk startup check should succeed and the server should load its storage configuration normally, as it did in 25.8.21 and earlier.


Actual behavior

The following behavior is taken from job Benchmark_Aws_S3 in MasterCI run 25091176162 on commit 5c9d523.

The startup ReadBufferFromS3 GET against the bucket fails with Response checksums mismatch on every retry (1/4 → 4/4):

[clickhouse1] 2026.04.29 07:55:32.591947 [ 678 ] {} <Debug> ReadBufferFromS3:
  Caught exception while reading S3 object.
  Bucket: [masked]:Secret(name='aws_s3_bucket'),
  Key: data/benchmark/kdc/ioxpdznoiwywnvsrcobnauawixgzw,
  Version: Latest, Offset: 0, Attempt: 1/4,
  Message: Code: 499. DB::Exception: Response checksums mismatch
  This error happened for S3 disk. (S3_ERROR)
  (version 25.8.22.20001.altinityantalya (altinity build))

After 4 retries the disk access check fails:

<Error> ConfigReloader: Error updating configuration from '/etc/clickhouse-server/config.xml':
  Code: 347. DB::Exception: Code: 499. DB::Exception: Response checksums mismatch
  This error happened for S3 disk: While checking access for disk external. (S3_ERROR)

Stack trace (relevant frames):

0. Poco::Exception::Exception(String const&, int)
1. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool)
2. DB::Exception::Exception(String const&, int, String, bool)
3. ./src/IO/S3Common.h:39: DB::S3Exception::S3Exception(String const&, Aws::S3::S3Errors)

The test then hangs 10 minutes waiting for ConfigReloader: Loaded config '/etc/clickhouse-server/config.xml', performed update on configuration and dies with ExpectTimeoutError: Timeout 600.000s — this is why every scenario in the suite collapses identically. The server is unusable, not the test logic.


Root cause analysis

The new aws-sdk-cpp 1.11.771 changed two checksum defaults:

  • requestChecksumCalculation = WHEN_SUPPORTED → SDK adds x-amz-checksum-algorithm: CRC32 on every request.
  • responseChecksumValidation = WHEN_SUPPORTED → SDK validates x-amz-checksum-* headers on every response.

GCS's S3-compatible API doesn't fully implement AWS Flexible Checksums, and on real AWS S3 the response-side CRC32 verification fails against ClickHouse's locally computed value. Both end up throwing Response checksums mismatch.

The follow-up patches inside PR #1667 (commits "Fix build", "One more change to adapt to a new SDK", "Fix Md5 checksums calculation") only override ShouldComputeContentMd5() and RequestChecksumRequired() on PutObjectRequest, UploadPartRequest, DeleteObjectRequest, and DeleteObjectsRequest. They do not disable the new SDK-level checksum defaults, which is why GETs (and the read-side access check) still trigger this error. The patch authors themselves left a /// TODO Understand what is it. Maybe we need it... comment on the related override, and a /// FIXME. Variadic arguments? comment on the new vaLog adapter — this slipped through.


Suggested fix

Backport upstream commit 659369ead95 ("Fix very weird issue") — equivalently, merge upstream backport PR ClickHouse/ClickHouse#103542 into the antalya-25.8 branch.

The fix sets, in PocoHTTPClientConfiguration:

checksumConfig.requestChecksumCalculation  = Aws::Client::RequestChecksumCalculation::WHEN_REQUIRED;
checksumConfig.responseChecksumValidation  = Aws::Client::ResponseChecksumValidation::WHEN_REQUIRED;

restoring the pre-1.11.771 behavior. This commit is already present in upstream 25.10.x and 26.3.x, but was never backported to 25.8.

Version aws-sdk-cpp 1.11.771 WHEN_REQUIRED fix Works on real AWS S3 / GCS
25.8.21 no n/a
25.8.22 (PR #1667) yes ❌ missing
25.10.x yes
26.3.x yes

Additional context

Related PR

  • Altinity/ClickHouse PR Antalya 25.8: Bump to 25.8.22 #1667 (Antalya 25.8: Bump to 25.8.22) — merge commit 5c9d52363de84ccdd439b7f2e20fae710921b26f, which contains the upstream aws-sdk-cpp 1.11.771 backport that introduces this regression.

Upstream references

CI failures

Affected jobs (every scenario fails identically, both x86 and arm64):

  • S3_Aws_S3_2
  • S3_Gcs_2
  • Benchmark_Aws_S3
  • Benchmark_Gcs
  • Tiered_Storage_S3Amazon
  • Tiered_Storage_S3Gcs

Failing MasterCI runs on antalya-25.8 (both on commit 5c9d523 from PR #1667):

# Run Commit Date
1 25091176162 5c9d523 2026-04-29
2 24845940234 5c9d523 2026-04-23

The same scenarios pass on the previous commits (1eed78a, a36e131, 59dcdc0) before PR #1667 landed, and were re-run after #1667 landed and continued to fail — ruling out infrastructure flakiness.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions