Ohhh boy, this MR became bigger then originally intended... ooops.
At this point it is too painful to split the commits into separate MRs.
This MR was originally meant to deal with improving the code generation for the CUDA managed memory case:
The existing ICMs for memcpying data to and from CUDA devices for the managed case performed a basic assignment from the old_value to the new_value. This was done out of expedience, and as it turns out was an unwise decision for the following reasons:
- the assignment in practice just changes where the pointer points to. As managed is built on top of UVA, this should work fine, but can have unintended consequences when the underlying pointers are not on the same unit (device or host), which leads to an implicit copy operations. This is not good, as the copy operations itself is done in chunks, which causes quiet a slow down.
- In either CUDA 9.2 or CUDA 10, this simple assignment does not work anymore. I don't know the exact reasons, but perhaps the CUDA devs noted the above problem, and now throw an exception for it.
The solution is to actually use cudaMemcpy, but with the cudaMemcpyDefault flag - this have the effect of implicitly resolving the UVA pointers, and figuring out if it needs to do a H2D/D2H/P2P/or standard memcpy. This has a very advantages, such as not having to track the UVA pointers and figuring out if they resolve to host or device memory. Additionally, as we provide the data-size to be memcpy'd, we avoid the slower chunck based transfer overhead.
After beginning to implement this, it became clear that there needed to be further work done with the following intention:
- introduce CUDA synching for some async (and managed) cases
- fixes for DFmaps and LUTs
- fixes for hwloc
- improvement in CI warning/error handling
- improve profiler for memory (add CUDA profiling)
- add tests for various facilities in the compiler
Further details can be read in the commit history.