Motivation
The first axis may be too small to use all threads for high-ranked arrays. This merge request supports a 2-dimensional thread space. For example, instead of thread layout
T(0)
--
T(1)
--
T(2)
--
T(3)
we may want a layout
T(0, 0) | T(0, 1)
-----------------
T(1, 0) | T(1, 1)
Problems
The downside of this, is that each thread may have more than one cacheline that is shared with a different thread, leading to false sharing. So this should only be done if necessary.
Implementation
Let X
be an T[m, n, d:shp]
array and suppose we have p
processors. We want a minimal number of rows to minimize false sharing. For this we use the function
/**
* Returns the largest q | p such that p1 <= m
**/
int largest_div(int m, int p)
{
int q = 1;
int candidate = 1;
while (candidate <= m && candidate <= p) {
if (p % candidate == 0) {
q = candidate;
}
candidate++;
}
return q;
}
to factorize p
. We then do a block distribution on the first axis with q
and the second axis with p / q
.
This has as effect that for p <= m
, the grid will be p x 1
, or the same as before this merge request. The only overhead is calculating the processor grid.
Future work
We (meaning not me) could extend this to arbitrarily many dimensions.