Benchmark first (threads: 1, 2, 4, 8): 1. float16 full precision 2. minmax8, minmax4 3. reranking (compute is negligible, store read is constant - what % of total latency it is responsible for)
Benchmark first (threads: 1, 2, 4, 8):