Skip to content
GitLab
  • Menu
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • sac2c sac2c
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 401
    • Issues 401
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 19
    • Merge requests 19
  • Deployments
    • Deployments
    • Releases
  • Wiki
    • Wiki
  • External wiki
    • External wiki
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
Collapse sidebar
  • sac-group
  • sac2csac2c
  • Merge requests
  • !462

Change default threads per block to 256.

  • Review changes

  • Download
  • Email patches
  • Plain diff
Open Thomas Koopman requested to merge thomas/sac2c:cuda-default-threads into develop Nov 28, 2025
  • Overview 2
  • Commits 2
  • Changes 1

Nvidia GPUs have 64 or 128 execution units per SM. They have no more ILP than two instructions per clock, so 256 threads per block is plenty. It should make no difference on large arrays, and help on small arrays (in the multi GPU paper, we had to manually reduce the thread count for nbody).

Benchmarks on 3070ti:

1024 256
Yusuf (seconds) 0.67 0.45
Stencil (seconds) 2.3 1.96
Matmult (Gflops) 790 700

So a bit slower for matmult, but I think overall 256 or 32 x 8 threadspace is a more sensible default.

Edited Dec 05, 2025 by Thomas Koopman
Assignee
Assign to
Reviewer
Request review from
Time tracking
Source branch: cuda-default-threads