Motivation
If it ain't broke, don't fix it, but the current implementation is difficult to integrate with the distributed memory backend.
Hierarchy of index sets
x = {[i, j] -> 0 | [0, 0] <= [i, j] < [n, m] step [6, 12] width w;
[i, j] -> 1 | [2, 0] <= [i, j] < [n, m] step [6, 12] width w};
has one wlseg, 12 wlstride, and each wlstride has one or two wlgrid sons. I suppose wlseg is the [0, 0] <= iv < [n, m], but what is the rest?
There is also wlblock and wlublock (unrolled block?)
Main change (TODO in next merge request)
A scheduler partitions [lb, ub) into a disjoint union \coprod_thread [MT_SCHEDULE_START (dim, thread), MT_SCHEDULE_STOP (dim, thread)) = [lb, ub).
This happens on the wlgrid level. A child on wlgrid is some subset I_{step, width, lb, ub} \subseteq [lb, ub). We must compute the first index of the intersection I_{step, width, lb, ub} \cap [MT_SCHEDULE_START (dim, thread), MT_SCHEDULE_STOP (dim, thread)). The compiler used to use the same variable for this. I think we should separate this into MT_SCHEDULE_FIRST_INDEX (or MT_SCHEDULE_FIRST_OFFSET for the N_idx.
Small changes
Some simplifications in the arithmetic and macros.