When applying the EMR loop optimisation, which lifts allocations out of loop functions, the CUDA backend previously would cause host2device
and cudaMalloc
/cudaFree
calls to be made for these lifted allocations - in effect negating the optimisation.
This MR includes a new traversal for the CUDA backend call the Minimize EMR Transfers (MEMRT) optimisation which finds functions that have had allocations lifted out (via EMRL), and lifts out host2device
primitives which reference EMRL lifted variables. The effect is that we only perform one allocation on the device per lifted allocation, and perform no memory transfers within the loop. The MEMRT traversal is run after all other CUDA transfer minimization (see MTRAN - minimize_transfers.c
).