Built on top of !360 (merged)
Motivation
With-loops are indexed by an (unrolled) index vector SACl_i, SACl_j, ...
and a flat index, also called wlidx
or offset
. The current implementation sets offset
once, and keeps it up to date through icms such as ADJUST_OFFSET
, SET_OFFSET
, ADJUST_SCHEDULER
, ... This is difficult because an index set (lb <= iv < ub step s width w)
may not cleanly intersect a part of the index space that is local to a thread (SAC_MT_WL_SCHEDULE_START <= iv < SAC_MT_WL_SCHEDULE_STOP)
. These adjustments are not compiled if the compiler can prove it is unnecessary. This is quite complicated already, and adding another level in the scheduling hierarchy for distributed memory is unfeasible in this way.
High-level change
This merge request refactors the way we treat the compilation of with2 nodes. This is also documented in doc/schedule_design.txt
. The new design is
Compilation of with-loop
wlseg
For each dimension i
we have variables
SAC_WL_SCHEDULE_START[i]
SAC_WL_SCHEDULE_STOP [i]
determining what part the program executor is responsible for. For the
sequential backend, this is simply idxinf
, idxsup
, for the multithreaded
backend it is the part a thread is responsible for, and for distmem it
can be the part a node, or a thread within a node.
wlstride
As wlstride contains a step, the SAC_WL_SCHEDULE_START
may not be a legal
index. We set
SAC_WL_SCHEDULE_STRIDE_START[i]
to the smallest legal index larger than SAC_WL_SCHEDULE_START
, and similar
for SAC_WL_SCHEDULE_STRIDE_START
.
These variables set the index vector, but we also have a flat index, sometimes
called offset (wlidx
). If we are the inner-most loop, this needs to be set
to the start of the slice.
wlgrid
This increases wlidx
usually by 1
, sometimes by more if we skip over a NOOP
grid.
Distributed memory backend
There are several challenges for the distributed memory backend.
- The distributed memory backend divides the work for a with2 node, not a wlseg node
- We can only call
ShrayStart
on arrays that are distributed, and this decision is made at runtime. - For genarray, modarray the start and end is computed with
ShrayStart
,ShrayEnd
, but not for fold.
That means we cannot implement a scheduler for it. Instead, we do the following recipe
- Before we start the compilation of wlseg nodes, we compute
SAC_wl_global_ub0
, the global upper bound in first dimension. So forgenarray
this isSAC_ND_A_SIZE (var_NT, 0)
, and for a fold-loop, it is the maximum ofWLSEG_IDXSUP[0]
over all segments. - In COMPwith2 we make the decision to distribute or not and set
SAC_wl_global_start0
,SAC_wl_global_stop0
. If the loop is not distributed, this is0
andSAC_wl_global_ub0
. If it is a genarray or modarray and distributed, this is determined withShray(Start|End)
. If it is a distributed fold loop, we implement a block schedule ourself. - In the actual scheduler, we intersect the [lower, upper) that are passed
with
[SAC_wl_global_start0, SAC_wl_global_stop0)
.
Upshot of change
Schedulers are now free to set SAC_WL_SCHEDULE_(START|STOP)
to any partition of the index space, and the rest of the compilation process does not need to be modified.
Performance
I assume the reason for the old implementation is performance. I cannot imagine a scenario where this will have a measurable effect as we only add a few integer operations outside of the tight loop. Benchmarks on the code I have lying around (PDE, CFAL) confirm this.