Motivation
If it ain't broke, don't fix it, but the current implementation is difficult to integrate with the distributed memory backend.
Hierarchy of index sets
x = {[i, j] -> 0 | [0, 0] <= [i, j] < [n, m] step [6, 12] width w;
[i, j] -> 1 | [2, 0] <= [i, j] < [n, m] step [6, 12] width w};
has one wlseg
, 12 wlstride
, and each wlstride
has one or two wlgrid
sons. I suppose wlseg
is the [0, 0] <= iv < [n, m]
, but what is the rest?
There is also wlblock
and wlublock
(unrolled block?)
Main change (TODO in next merge request)
A scheduler partitions [lb, ub)
into a disjoint union \coprod_thread [MT_SCHEDULE_START (dim, thread), MT_SCHEDULE_STOP (dim, thread)) = [lb, ub)
.
This happens on the wlgrid
level. A child on wlgrid
is some subset I_{step, width, lb, ub} \subseteq [lb, ub)
. We must compute the first index of the intersection I_{step, width, lb, ub} \cap [MT_SCHEDULE_START (dim, thread), MT_SCHEDULE_STOP (dim, thread))
. The compiler used to use the same variable for this. I think we should separate this into MT_SCHEDULE_FIRST_INDEX
(or MT_SCHEDULE_FIRST_OFFSET
for the N_idx
.
Small changes
Some simplifications in the arithmetic and macros.