Nvidia GPUs have 64 or 128 execution units per SM. They have no more ILP than two instructions per clock, so 256 threads per block is plenty. It should make no difference on large arrays, and help on small arrays (in the multi GPU paper, we had to manually reduce the thread count for nbody).
Benchmarks on 3070ti:
| 1024 | 256 | |
|---|---|---|
| Yusuf (seconds) | 0.67 | 0.45 |
| Stencil (seconds) | 2.3 | 1.96 |
| Matmult (Gflops) | 790 | 700 |
So a bit slower for matmult, but I think overall 256 or 32 x 8 threadspace is a more sensible default.