I have this data
SELECT *
FROM s3Cluster(replicated_cluster, 'http://minio:9000/warehouse/data3/data/**.parquet', 'admin', 'password')
1. │ true │ 1000 │ 456.78 │ Alice │ 2024-01-01 12:00:00.000000 │ 2024-01-01 │ 1970-01-01 12:00:00.000000 │ 2024-01-01 12:00:00.000000 │ 1000 │ 456.78 │ 456.78 │
2. │ false │ 2000 │ 456.78 │ Bob │ 2024-01-01 12:00:00.000000 │ 2024-01-01 │ 1970-01-01 12:00:00.000000 │ 2024-01-01 12:00:00.000000 │ 2000 │ 456.78 │ 456.78 │
3. │ true │ 1000 │ 456.78 │ Alice │ 2024-01-01 12:00:00.000000 │ 2024-01-01 │ 1970-01-01 12:00:00.000000 │ 2024-01-01 12:00:00.000000 │ 1000 │ 456.78 │ 456.78 │
4. │ false │ 2000 │ 456.78 │ Bob │ 2024-01-01 12:00:00.000000 │ 2024-01-01 │ 1970-01-01 12:00:00.000000 │ 2024-01-01 12:00:00.000000 │ 2000 │ 456.78 │ 456.78 │
5. │ true │ 1000 │ 456.78 │ Alice │ 2024-01-01 12:00:00.000000 │ 2024-01-01 │ 1970-01-01 12:00:00.000000 │ 2024-01-01 12:00:00.000000 │ 1000 │ 456.78 │ 456.78 │
6. │ false │ 2000 │ 456.78 │ Bob │ 2024-01-01 12:00:00.000000 │ 2024-01-01 │ 1970-01-01 12:00:00.000000 │ 2024-01-01 12:00:00.000000 │ 2000 │ 456.78 │ 456.78 │
7. │ true │ 1000 │ 456.78 │ Alice │ 2024-01-01 12:00:00.000000 │ 2024-01-01 │ 1970-01-01 12:00:00.000000 │ 2024-01-01 12:00:00.000000 │ 1000 │ 456.78 │ 456.78 │
8. │ false │ 2000 │ 456.78 │ Bob │ 2024-01-01 12:00:00.000000 │ 2024-01-01 │ 1970-01-01 12:00:00.000000 │ 2024-01-01 12:00:00.000000 │ 2000 │ 456.78 │ 456.78 │
└─────────────┴──────────┴────────────┴────────────┴────────────────────────────┴────────────┴────────────────────────────┴────────────────────────────┴─────────────┴───────────┴─────────────┘
8 rows in set. Elapsed: 0.119 sec.
SELECT *
FROM icebergS3Cluster(replicated_cluster, 'http://minio:9000/warehouse/data1', 'admin', 'password')
1. │ true │ 1000 │ 456.78 │ Alice │ 2024-01-01 13:00:00.000000 │ 2024-01-01 │ 43200000000 │ 2024-01-01 12:00:00.000000 │ 1000 │ 456.78 │ 456.78 │
2. │ false │ 2000 │ 456.78 │ Bob │ 2024-01-01 13:00:00.000000 │ 2024-01-01 │ 43200000000 │ 2024-01-01 12:00:00.000000 │ 2000 │ 456.78 │ 456.78 │
3. │ true │ 1000 │ 456.78 │ Alice │ 2024-01-01 13:00:00.000000 │ 2024-01-01 │ 43200000000 │ 2024-01-01 12:00:00.000000 │ 1000 │ 456.78 │ 456.78 │
4. │ false │ 2000 │ 456.78 │ Bob │ 2024-01-01 13:00:00.000000 │ 2024-01-01 │ 43200000000 │ 2024-01-01 12:00:00.000000 │ 2000 │ 456.78 │ 456.78 │
5. │ true │ 1000 │ 456.78 │ Alice │ 2024-01-01 13:00:00.000000 │ 2024-01-01 │ 43200000000 │ 2024-01-01 12:00:00.000000 │ 1000 │ 456.78 │ 456.78 │
6. │ false │ 2000 │ 456.78 │ Bob │ 2024-01-01 13:00:00.000000 │ 2024-01-01 │ 43200000000 │ 2024-01-01 12:00:00.000000 │ 2000 │ 456.78 │ 456.78 │
7. │ true │ 1000 │ 456.78 │ Alice │ 2024-01-01 13:00:00.000000 │ 2024-01-01 │ 43200000000 │ 2024-01-01 12:00:00.000000 │ 1000 │ 456.78 │ 456.78 │
8. │ false │ 2000 │ 456.78 │ Bob │ 2024-01-01 13:00:00.000000 │ 2024-01-01 │ 43200000000 │ 2024-01-01 12:00:00.000000 │ 2000 │ 456.78 │ 456.78 │
└─────────────┴──────────┴────────────┴────────────┴────────────────────────────┴────────────┴─────────────┴────────────────────────────┴─────────────┴───────────┴─────────────┘
8 rows in set. Elapsed: 0.132 sec.
I also have merge tree tables with same exact data and expected join result is calculated using those merge tree tables:
SELECT *
FROM table_3898404a_3a8b_11f1_995b_de7b9eea348f AS t1
ANY INNER JOIN table_38c3eb3c_3a8b_11f1_995b_de7b9eea348f AS t2 ON (t1.boolean_col = t2.boolean_col) AND (t1.string_col = t2.string_col)
ORDER BY tuple(*) ASC
FORMAT TabSeparated
Query id: 46d4e498-66ab-4038-a7cb-21572550a760
false 2000 456.78 Bob 2024-01-01 12:00:00.000000 2024-01-01 1970-01-01 12:00:00.000000 2024-01-01 12:00:00.000000 2000 456.78 456.78 false 2000 456.78 Bob 2024-01-01 13:00:00.000000 2024-01-01 43200000000 2024-01-01 12:00:00.000000 2000 456.78 456.78
true 1000 456.78 Alice 2024-01-01 12:00:00.000000 2024-01-01 1970-01-01 12:00:00.000000 2024-01-01 12:00:00.000000 1000 456.78 456.78 true 1000 456.78 Alice 2024-01-01 13:00:00.000000 2024-01-01 43200000000 2024-01-01 12:00:00.000000 1000 456.78 456.78
2 rows in set. Elapsed: 0.010 sec.
The same join but with object_storage_cluster_join_mode = 'global', object_storage_cluster = 'replicated_cluster' returnes 6 rows instead of 2:
SELECT *
FROM s3Cluster(replicated_cluster, 'http://minio:9000/warehouse/data3/data/**.parquet', 'admin', 'password') AS t1
ANY INNER JOIN icebergS3Cluster(replicated_cluster, 'http://minio:9000/warehouse/data1', 'admin', 'password') AS t2 ON (t1.boolean_col = t2.boolean_col) AND (t1.string_col = t2.string_col)
ORDER BY tuple(*) ASC
SETTINGS object_storage_cluster_join_mode = 'global', object_storage_cluster = 'replicated_cluster'
FORMAT TabSeparated
Query id: af48112f-995c-47fe-b5e1-a5466955f9be
false 2000 456.78 Bob 2024-01-01 12:00:00.000000 2024-01-01 1970-01-01 12:00:00.000000 2024-01-01 12:00:00.000000 2000 456.78 456.78 false 2000 456.78 Bob 2024-01-01 13:00:00.000000 2024-01-01 43200000000 2024-01-01 12:00:00.000000 2000 456.78 456.78
false 2000 456.78 Bob 2024-01-01 12:00:00.000000 2024-01-01 1970-01-01 12:00:00.000000 2024-01-01 12:00:00.000000 2000 456.78 456.78 false 2000 456.78 Bob 2024-01-01 13:00:00.000000 2024-01-01 43200000000 2024-01-01 12:00:00.000000 2000 456.78 456.78
false 2000 456.78 Bob 2024-01-01 12:00:00.000000 2024-01-01 1970-01-01 12:00:00.000000 2024-01-01 12:00:00.000000 2000 456.78 456.78 false 2000 456.78 Bob 2024-01-01 13:00:00.000000 2024-01-01 43200000000 2024-01-01 12:00:00.000000 2000 456.78 456.78
true 1000 456.78 Alice 2024-01-01 12:00:00.000000 2024-01-01 1970-01-01 12:00:00.000000 2024-01-01 12:00:00.000000 1000 456.78 456.78 true 1000 456.78 Alice 2024-01-01 13:00:00.000000 2024-01-01 43200000000 2024-01-01 12:00:00.000000 1000 456.78 456.78
true 1000 456.78 Alice 2024-01-01 12:00:00.000000 2024-01-01 1970-01-01 12:00:00.000000 2024-01-01 12:00:00.000000 1000 456.78 456.78 true 1000 456.78 Alice 2024-01-01 13:00:00.000000 2024-01-01 43200000000 2024-01-01 12:00:00.000000 1000 456.78 456.78
true 1000 456.78 Alice 2024-01-01 12:00:00.000000 2024-01-01 1970-01-01 12:00:00.000000 2024-01-01 12:00:00.000000 1000 456.78 456.78 true 1000 456.78 Alice 2024-01-01 13:00:00.000000 2024-01-01 43200000000 2024-01-01 12:00:00.000000 1000 456.78 456.78
6 rows in set. Elapsed: 0.427 sec.
I have this data
I also have merge tree tables with same exact data and expected join result is calculated using those merge tree tables:
The same join but with object_storage_cluster_join_mode = 'global', object_storage_cluster = 'replicated_cluster' returnes 6 rows instead of 2: