- 22 Nov, 2020 4 commits
-
-
Sven-Bodo Scholz authored
-
Sven-Bodo Scholz authored
-
Sven-Bodo Scholz authored
-
Sven-Bodo Scholz authored
-
- 21 Nov, 2020 3 commits
-
-
Sven-Bodo Scholz authored
-
Sven-Bodo Scholz authored
-
Sven-Bodo Scholz authored
-
- 20 Nov, 2020 1 commit
-
-
Sven-Bodo Scholz authored
-
- 19 Nov, 2020 2 commits
-
-
Sven-Bodo Scholz authored
funs done: MakeArgNode MakeBasetypeArg
-
Sven-Bodo Scholz authored
This is important since the additions of Rouland do not allow for eexternals to be nested without screwing the code generation up...... We can tackle this later!
-
- 18 Nov, 2020 1 commit
-
-
Sven-Bodo Scholz authored
actually not much was missing here. However, the treatment of T_hidden received a massive conceptual overhaul. This was triggered by the observation that SACarg needs to be treated like a nested data structure..... When ironing that out in compile.c we can equally well make sure we add proper support for nesting throughout....
-
- 17 Nov, 2020 1 commit
-
-
Sven-Bodo Scholz authored
-
- 08 Nov, 2020 1 commit
-
-
Hans-Nikolai Viessmann authored
Fix mt fold sbs See merge request !129
-
- 07 Nov, 2020 2 commits
-
-
Sven-Bodo Scholz authored
-
Sven-Bodo Scholz authored
streamlined the explanation; added a section on the implementation and extracted the two Handle-functions as helpes...
-
- 06 Nov, 2020 3 commits
-
-
Sven-Bodo Scholz authored
and put quite some more detail into the main comment of MTSPMDF. This is now feature complete....
-
Sven-Bodo Scholz authored
started re-writing SPMDF lifting to include a dec_rc on the neutral element after the lifted function
-
Sven-Bodo Scholz authored
By setting the rc in the stack copy of the descriptor to 2 we are safe now. I also injected tracing info so that it easier to see what is going on just from the trace. As a consequence of this, the MT version now leaks one copy of the neutral element! This needs another fix. Finally, I added some comments in MTRMI to explain what exactly it does and to understand that traversal more quickly :-)
-
- 05 Nov, 2020 4 commits
-
-
Hans-Nikolai Viessmann authored
[hwloc] fix error on no hwloc.h header See merge request !127
-
Sven-Bodo Scholz authored
-
Hans-Nikolai Viessmann authored
The declarations in cpubind.h still need the hwloc.h header file, regardless if we are compiling with HWLOC support or not. This commit fixes this.
-
Sven-Bodo Scholz authored
-
- 04 Nov, 2020 1 commit
-
-
Sven-Bodo Scholz authored
Hotfix for CUDA profiling See merge request !126
-
- 03 Nov, 2020 6 commits
-
-
Hans-Nikolai Viessmann authored
-
Sven-Bodo Scholz authored
Fix cuda mech (for cudaManaged) memcpy ICMs See merge request !109
-
Hans-Nikolai Viessmann authored
-
Hans-Nikolai Viessmann authored
This traversal was replaced by EMR-related traversals.
-
Hans-Nikolai Viessmann authored
-
Hans-Nikolai Viessmann authored
-
- 02 Nov, 2020 9 commits
-
-
Hans-Nikolai Viessmann authored
-
Hans-Nikolai Viessmann authored
and add some better documentation
-
Hans-Nikolai Viessmann authored
We now also add timers (using GPU timer) to measure the time for certain events on the GPU (kernel launches, memcpys, allocs, etc.).
-
Hans-Nikolai Viessmann authored
When moving to an ad-hoc macro cyclical traversal mechanism, the latest counter value(s) were never stored, meaning that we never cycled to a fix-point. This caused several transfers to be left in place which could otherwise have been elided. This bug also affected the algebraic wlfi traversal.
-
Hans-Nikolai Viessmann authored
on some systems (change in linux kernel maybe?), filter over HWLOC_OBJ_OS_DEVICE objects leads to seqfault. For HWLOC we do not rely on any _system_ devices so leaving the filtering to NONE should be fine. Also updated .gitignore
-
Hans-Nikolai Viessmann authored
-
Hans-Nikolai Viessmann authored
-
Hans-Nikolai Viessmann authored
We also add the count of kernel calls (might be useful).
-
Hans-Nikolai Viessmann authored
We can now measure the runtime (wall-clock time) of CUDA kernels using sac2c's inbuilt profiling system. We use CUDA events systems (GPU/device counters) to make the measurements. Performing the measurement itself is fairly cheap, and has little effect on runtime of main() function. However, we do perform some costly summing up within libsac/runtime libraries after we've reached the end of the program, which can take up several whole seconds. At the moment, this CUDA timer feature only provides a total time for the entire program run, not on a per-function basis. NOTE: we store the start/stop values within a linked-list. We do this as we don't statically know how many times the kernel function will be called (in a conditional loop for example). As such, if we have a programming launching thousands of kernels, this could consume a lot of memory. Clearly we could for instance at the point of the kernel launch make our measurement, and just store the elapsed time. This would require a sync on the device, which could destroy any performance gains from using asynchronise or managed backend.
-
- 23 Oct, 2020 2 commits
-
-
Hans-Nikolai Viessmann authored
-
Hans-Nikolai Viessmann authored
-