Skip to content
GitLab
  • Menu
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • sac2c sac2c
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 398
    • Issues 398
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 15
    • Merge requests 15
  • Deployments
    • Deployments
    • Releases
  • Wiki
    • Wiki
  • External wiki
    • External wiki
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
Collapse sidebar
  • sac-group
  • sac2csac2c
  • Merge requests
  • !462

Draft: Change default threads per block to 256.

  • Review changes

  • Download
  • Email patches
  • Plain diff
Open Thomas Koopman requested to merge thomas/sac2c:cuda-default-threads into develop Nov 28, 2025
  • Overview 0
  • Commits 1
  • Changes 1

Nvidia GPUs have 64 or 128 execution units per SM. They have no more ILP than two instructions per clock, so 256 threads per block is plenty. It should make no difference on large arrays, and help on small arrays (in the multi GPU paper, we had to manually reduce the thread count for nbody).

Benchmarks on 3070ti:

1024 256
Yusuf (seconds) 0.67 0.45
Stencil (seconds) 2.3 1.96
Matmul (Gflops) 290 245

Weird that it does turn out slower for matmul.

Edited Nov 28, 2025 by Thomas Koopman
Assignee
Assign to
Reviewer
Request review from
Time tracking
Source branch: cuda-default-threads