Closed Thomas Koopman requested to merge thomas/sac2c:simplify-scheduling into develop Mar 10, 2025

Motivation

With-loops are indexed by an (unrolled) index vector SACl_i, SACl_j, ... and a flat index, also called wlidx or offset. The current implementation sets offset once, and keeps it up to date through icms such as ADJUST_OFFSET, SET_OFFSET, ADJUST_SCHEDULER, ... This is difficult because an index set (lb <= iv < ub step s width w) may not cleanly intersect a part of the index space that is local to a thread (SAC_MT_WL_SCHEDULE_START <= iv < SAC_MT_WL_SCHEDULE_STOP). These adjustments are not compiled if the compiler can prove it is unnecessary. This is quite complicated already, and adding another level in the scheduling hierarchy for distributed memory is unfeasible in this way.

High-level change

This merge request refactors the way we treat the compilation of with2 nodes. This is also documented in doc/schedule_design.txt. The new design is

Compilation of with-loop

wlseg

For each dimension i we have variables

SAC_WL_SCHEDULE_START[i]
SAC_WL_SCHEDULE_STOP [i]

determining what part the program executor is responsible for. For the sequential backend, this is simply idxinf, idxsup, for the multithreaded backend it is the part a thread is responsible for, and for distmem it can be the part a node, or a thread within a node.

wlstride

As wlstride contains a step, the SAC_WL_SCHEDULE_START may not be a legal index. We set

SAC_WL_SCHEDULE_STRIDE_START[i]

to the smallest legal index larger than SAC_WL_SCHEDULE_START, and similar for SAC_WL_SCHEDULE_STRIDE_START.

These variables set the index vector, but we also have a flat index, sometimes called offset (wlidx). If we are the inner-most loop, this needs to be set to the start of the slice.

wlgrid

This increases wlidx usually by 1, sometimes by more if we skip over a NOOP grid.

Distributed memory backend

There are several challenges for the distributed memory backend.

The distributed memory backend divides the work for a with2 node, not a wlseg node
We can only call ShrayStart on arrays that are distributed, and this decision is made at runtime.
For genarray, modarray the start and end is computed with ShrayStart, ShrayEnd, but not for fold.

That means we cannot implement a scheduler for it. Instead, we do the following recipe

Before we start the compilation of wlseg nodes, we compute SAC_wl_global_ub0, the global upper bound in first dimension. So for genarray this is SAC_ND_A_SIZE (var_NT, 0), and for a fold-loop, it is the maximum of WLSEG_IDXSUP[0] over all segments.
In COMPwith2 we make the decision to distribute or not and set SAC_wl_global_start0, SAC_wl_global_stop0. If the loop is not distributed, this is 0 and SAC_wl_global_ub0. If it is a genarray or modarray and distributed, this is determined with Shray(Start|End). If it is a distributed fold loop, we implement a block schedule ourself.
In the actual scheduler, we intersect the [lower, upper) that are passed with [SAC_wl_global_start0, SAC_wl_global_stop0).

Upshot of change

Schedulers are now free to set SAC_WL_SCHEDULE_(START|STOP) to any partition of the index space, and the rest of the compilation process does not need to be modified.

Performance

I assume the reason for the old implementation is performance. I cannot imagine a scenario where this will have a measurable effect as we only add a few integer operations outside of the tight loop. Benchmarks on the code I have lying around (PDE, CFAL) confirm this.

Edited Mar 11, 2025 by Thomas Koopman

Simplify scheduling and integrate distmem scheduling