You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A regression was introduced in Altinity Antalya 25.8.22 by PR #1667 (Antalya 25.8: Bump to 25.8.22), which backported upstream PR ClickHouse/ClickHouse#100582 (Use aws-sdk-cpp 1.11.771).
After the bump, every ClickHouse operation against real AWS S3 or GCS S3-compatible endpoints fails at startup with:
Code: 499. DB::Exception: Response checksums mismatch
This error happened for S3 disk. (S3_ERROR)
This causes the S3 disk's startup access check to fail, which in turn prevents ConfigReloader from loading the storage configuration, hanging the server's config reload until the regression test times out at 600s.
This issue:
Reproduces deterministically on real AWS S3 and real GCS — both architectures (x86 and arm64).
Does not reproduce on MinIO (MinIO doesn't enforce response checksums).
Was detected in our antalya-25.8 MasterCI runs, where every scenario in the S3_Aws_S3_2, S3_Gcs_2, Benchmark_Aws_S3, Benchmark_Gcs, Tiered_Storage_S3Amazon, and Tiered_Storage_S3Gcs jobs fails identically (~30 scenarios per arch).
<Error> ConfigReloader: Error updating configuration from '/etc/clickhouse-server/config.xml':
Code: 347. DB::Exception: Code: 499. DB::Exception: Response checksums mismatch
This error happened for S3 disk: While checking access for disk external. (S3_ERROR)
The test then hangs 10 minutes waiting for ConfigReloader: Loaded config '/etc/clickhouse-server/config.xml', performed update on configuration and dies with ExpectTimeoutError: Timeout 600.000s — this is why every scenario in the suite collapses identically. The server is unusable, not the test logic.
Root cause analysis
The new aws-sdk-cpp 1.11.771 changed two checksum defaults:
requestChecksumCalculation = WHEN_SUPPORTED → SDK adds x-amz-checksum-algorithm: CRC32 on every request.
responseChecksumValidation = WHEN_SUPPORTED → SDK validates x-amz-checksum-* headers on every response.
GCS's S3-compatible API doesn't fully implement AWS Flexible Checksums, and on real AWS S3 the response-side CRC32 verification fails against ClickHouse's locally computed value. Both end up throwing Response checksums mismatch.
The follow-up patches inside PR #1667 (commits "Fix build", "One more change to adapt to a new SDK", "Fix Md5 checksums calculation") only override ShouldComputeContentMd5() and RequestChecksumRequired() on PutObjectRequest, UploadPartRequest, DeleteObjectRequest, and DeleteObjectsRequest. They do not disable the new SDK-level checksum defaults, which is why GETs (and the read-side access check) still trigger this error. The patch authors themselves left a /// TODO Understand what is it. Maybe we need it... comment on the related override, and a /// FIXME. Variadic arguments? comment on the new vaLog adapter — this slipped through.
Suggested fix
Backport upstream commit 659369ead95 ("Fix very weird issue") — equivalently, merge upstream backport PR ClickHouse/ClickHouse#103542 into the antalya-25.8 branch.
Altinity/ClickHouse PR Antalya 25.8: Bump to 25.8.22 #1667 (Antalya 25.8: Bump to 25.8.22) — merge commit 5c9d52363de84ccdd439b7f2e20fae710921b26f, which contains the upstream aws-sdk-cpp 1.11.771 backport that introduces this regression.
The same scenarios pass on the previous commits (1eed78a, a36e131, 59dcdc0) before PR #1667 landed, and were re-run after #1667 landed and continued to fail — ruling out infrastructure flakiness.
✅ I checked the Altinity Stable Builds lifecycle table, and the Altinity Stable Build version I'm using is still supported.
Type of problem
Bug report - something's broken
Describe the situation
A regression was introduced in Altinity Antalya 25.8.22 by PR #1667 (Antalya 25.8: Bump to 25.8.22), which backported upstream PR ClickHouse/ClickHouse#100582 (Use
aws-sdk-cpp1.11.771).After the bump, every ClickHouse operation against real AWS S3 or GCS S3-compatible endpoints fails at startup with:
This causes the S3 disk's startup access check to fail, which in turn prevents
ConfigReloaderfrom loading the storage configuration, hanging the server's config reload until the regression test times out at 600s.This issue:
S3_Aws_S3_2,S3_Gcs_2,Benchmark_Aws_S3,Benchmark_Gcs,Tiered_Storage_S3Amazon, andTiered_Storage_S3Gcsjobs fails identically (~30 scenarios per arch).How to reproduce the behavior
Environment
25.8.22.20001.altinityantalya(any 25.8.22+ build)Steps
storage.xmlpointing at a real AWS S3 bucket:Start
clickhouse-server. The disk's startup access check immediately triggers the failure — no user query is needed.Equivalent reproduction via our regression suite:
python3 -u s3/regression.py \ --clickhouse docker://altinity/clickhouse-server:25.8.22.20001.altinityantalya \ --storage aws_s3 \ --only '/benchmark/aws_s3/*' \ --log log.logExpected behavior
The S3 disk startup check should succeed and the server should load its storage configuration normally, as it did in 25.8.21 and earlier.
Actual behavior
The following behavior is taken from job
Benchmark_Aws_S3in MasterCI run 25091176162 on commit5c9d523.The startup
ReadBufferFromS3GET against the bucket fails withResponse checksums mismatchon every retry (1/4 → 4/4):After 4 retries the disk access check fails:
Stack trace (relevant frames):
The test then hangs 10 minutes waiting for
ConfigReloader: Loaded config '/etc/clickhouse-server/config.xml', performed update on configurationand dies withExpectTimeoutError: Timeout 600.000s— this is why every scenario in the suite collapses identically. The server is unusable, not the test logic.Root cause analysis
The new aws-sdk-cpp 1.11.771 changed two checksum defaults:
requestChecksumCalculation = WHEN_SUPPORTED→ SDK addsx-amz-checksum-algorithm: CRC32on every request.responseChecksumValidation = WHEN_SUPPORTED→ SDK validatesx-amz-checksum-*headers on every response.GCS's S3-compatible API doesn't fully implement AWS Flexible Checksums, and on real AWS S3 the response-side CRC32 verification fails against ClickHouse's locally computed value. Both end up throwing
Response checksums mismatch.The follow-up patches inside PR #1667 (commits "Fix build", "One more change to adapt to a new SDK", "Fix Md5 checksums calculation") only override
ShouldComputeContentMd5()andRequestChecksumRequired()onPutObjectRequest,UploadPartRequest,DeleteObjectRequest, andDeleteObjectsRequest. They do not disable the new SDK-level checksum defaults, which is why GETs (and the read-side access check) still trigger this error. The patch authors themselves left a/// TODO Understand what is it. Maybe we need it...comment on the related override, and a/// FIXME. Variadic arguments?comment on the newvaLogadapter — this slipped through.Suggested fix
Backport upstream commit
659369ead95("Fix very weird issue") — equivalently, merge upstream backport PR ClickHouse/ClickHouse#103542 into theantalya-25.8branch.The fix sets, in
PocoHTTPClientConfiguration:restoring the pre-1.11.771 behavior. This commit is already present in upstream
25.10.xand26.3.x, but was never backported to25.8.WHEN_REQUIREDfixAdditional context
Related PR
5c9d52363de84ccdd439b7f2e20fae710921b26f, which contains the upstreamaws-sdk-cpp1.11.771 backport that introduces this regression.Upstream references
aws-sdk-cpp1.11.771CI failures
Affected jobs (every scenario fails identically, both x86 and arm64):
S3_Aws_S3_2S3_Gcs_2Benchmark_Aws_S3Benchmark_GcsTiered_Storage_S3AmazonTiered_Storage_S3GcsFailing MasterCI runs on
antalya-25.8(both on commit5c9d523from PR #1667):5c9d5235c9d523The same scenarios pass on the previous commits (
1eed78a,a36e131,59dcdc0) before PR #1667 landed, and were re-run after #1667 landed and continued to fail — ruling out infrastructure flakiness.