Change default threads per block to 256. (!462) · Merge requests · sac-group / sac2c

Merged Thomas Koopman requested to merge thomas/sac2c:cuda-default-threads into develop Nov 28, 2025

Nvidia GPUs have 64 or 128 execution units per SM. They have no more ILP than two instructions per clock, so 256 threads per block is plenty. It should make no difference on large arrays, and help on small arrays (in the multi GPU paper, we had to manually reduce the thread count for nbody).

Benchmarks on 3070ti:

	1024	256
Yusuf (seconds)	0.67	0.45
Stencil (seconds)	2.3	1.96
Matmult (Gflops)	790	700

So a bit slower for matmult, but I think overall 256 or 32 x 8 threadspace is a more sensible default.

Edited Dec 05, 2025 by Thomas Koopman