First axis reduction of tensor runs 10X slower than reduction of entire tensor


Bugzilla Link	754
Created on	Sep 30, 2010 19:46
Version	svn
OS	Linux
Architecture	PC
Attachments	crud.sac

Extended Description

Created an attachment (id=758)
source code to reproduce fault
The attached code performs a sum reduction over the first axis of
a rank-3 array, if compiled with:
  sac2c -O3 crud.sac -DSLOW
The resulting code executes in roughly 7 seconds on a 3GHz Opteron.
If I have the Ubuntu system monitor running, I see that memory usage
creeps up DURING the execution of the reduction. This is surprising,
as I would naively expect that all allocations would be done before
we enter the loop.
If compiled with:
  sac2c -O3 crud.sac 
The resulting code performs a sum() over the entire tensor, and executes in 
about 0.85 seconds.
The offending function is likely this one:
inline int[+] plussl1XBIFOLD(bool[+] y)
{ /* first-axis reduce rank-3 or greater matrix */
  yt = transpose(y);
  zrho = drop([-1], shape(yt));
  z = with {
        (. <= iv <= .)
                : sum(toi((yt[iv])));
        } : genarray(zrho, 0);
  return(z);
}
Perhaps there is a better way to express such a reduction?
The idea here is that an argument of shape [ 10,20,30] will
give a result shape of [20,30].
Part of the problem is that the reduction array shape is AKD,
which is causing some WLF opportunities to be missed.
However, declaring the reduce argument this way:
 bool[3000, 15000,3] A_23;
still leaves the -DSLOW code running about 6X slower than
the other code.
This on: product rev 17069:MODIFIED linux-gnu_x86_64

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information