Skip to content
GitLab
  • Menu
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • sac2c sac2c
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 395
    • Issues 395
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 24
    • Merge requests 24
  • Deployments
    • Deployments
    • Releases
  • Wiki
    • Wiki
  • External wiki
    • External wiki
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
Collapse sidebar
  • sac-group
  • sac2csac2c
  • Merge requests
  • !109

Fix cuda mech (for cudaManaged) memcpy ICMs

  • Review changes

  • Download
  • Email patches
  • Plain diff
Merged Hans-Nikolai Viessmann requested to merge hans/sac2c:hans-cuda-mechs-fix2 into develop Apr 22, 2019
  • Overview 0
  • Commits 107
  • Changes 105

Ohhh boy, this MR became bigger then originally intended... ooops.

At this point it is too painful to split the commits into separate MRs.

This MR was originally meant to deal with improving the code generation for the CUDA managed memory case:

The existing ICMs for memcpying data to and from CUDA devices for the managed case performed a basic assignment from the old_value to the new_value. This was done out of expedience, and as it turns out was an unwise decision for the following reasons:

  • the assignment in practice just changes where the pointer points to. As managed is built on top of UVA, this should work fine, but can have unintended consequences when the underlying pointers are not on the same unit (device or host), which leads to an implicit copy operations. This is not good, as the copy operations itself is done in chunks, which causes quiet a slow down.
  • In either CUDA 9.2 or CUDA 10, this simple assignment does not work anymore. I don't know the exact reasons, but perhaps the CUDA devs noted the above problem, and now throw an exception for it.

The solution is to actually use cudaMemcpy, but with the cudaMemcpyDefault flag - this have the effect of implicitly resolving the UVA pointers, and figuring out if it needs to do a H2D/D2H/P2P/or standard memcpy. This has a very advantages, such as not having to track the UVA pointers and figuring out if they resolve to host or device memory. Additionally, as we provide the data-size to be memcpy'd, we avoid the slower chunck based transfer overhead.

After beginning to implement this, it became clear that there needed to be further work done with the following intention:

  • introduce CUDA synching for some async (and managed) cases
  • fixes for DFmaps and LUTs
  • fixes for hwloc
  • improvement in CI warning/error handling
  • improve profiler for memory (add CUDA profiling)
  • add tests for various facilities in the compiler

Further details can be read in the commit history.

Edited Nov 02, 2020 by Hans-Nikolai Viessmann
Assignee
Assign to
Reviewer
Request review from
Time tracking
Source branch: hans-cuda-mechs-fix2