Nvidia GPUs have 64 or 128 execution units per SM. They have no more ILP than two instructions per clock, so 256 threads per block is plenty. It should make no difference on large arrays, and help on small arrays (in the multi GPU paper, we had to manually reduce the thread count for nbody).
Benchmarks on 3070ti:
| 1024 | 256 | |
|---|---|---|
| Yusuf (seconds) | 0.67 | 0.45 |
| Stencil (seconds) | 2.3 | 1.96 |
| Matmul (Gflops) | 290 | 245 |
Weird that it does turn out slower for matmul.