Dismal performance of indexed reference in LACFUNs, e.g. Livermore Loop loop15


Bugzilla Link	1066
Created on	Apr 19, 2013 21:28
Version	svn
OS	Linux
Architecture	PC
Attachments	loop15.sac

Extended Description

Created an attachment (id=968)
source code to reproduce fault
I have been looking at the performance, or lack thereof,
of Livermore Loop loop15. It currently runs about 2 minutes,
vs. 6 seconds for the C code.
It contains code like this:
ret2 = with {
    ([0,0] <= iv < [5,99]) {
      if( VF[iv+1] >= VF[iv+[1,0]]) {
        if( VH[iv+[2,1]] > VH[iv+1]) {
          val = sqrt( VGs[iv+1] + sq( max( VH[iv+1], VH[iv+[2,1]])))
                * 0.053d / VF[iv+1];
        } else {
          val = sqrt( VGs[iv+1] + sq( max( VH[iv+1], VH[iv+[2,1]])))
                * 0.073d / VF[iv+1];
        }
      } else {
        if( VH[iv+[2,1]] > VH[iv+1]) {
          val = sqrt( VGs[iv+1] + sq( max( VH[iv+[1,0]], VH[iv+[2,0]])))
                * 0.053d / VF[iv+1];
        } else {
          val = sqrt( VGs[iv+1] + sq( max( VH[iv+[1,0]], VH[iv+[2,0]])))
                * 0.073d / VF[iv+1];
        }
      }
    } : val;
...
You get the idea...
I think what happens is that NONE of the code in the CONDFUNs is WL-folded.
Furthermore, there is no chance to use WLIDX in the LACFUNs.
The immediate fix for the sac code here is this. Consider
the last IF() code block. That can be written so that the LACFUN
has no indexing, and the indexing stuff can remain in the WL's basic
block:
   numer = ( VH[iv+]2,1]] > VH[iv+1]) ? 0.53d : 0.73d;
   val = sqrt( VGs[iv+1] + sq( max( VH[iv+[1,0]], VH[iv+[2,0]]))) 
          * numer / VF[iv+1];
This is not, however, a panacea, because other applications are
not so amenable to this sort of refactoring. I.e., consider
binary search, heapsort, and the like.
Some redesigns we might consider, aside from scrapping the whole
LACFUN idea, include:
   - pushing wlidx into LACFUNs. (Perhaps this is already done, but
     I did not see evidence of it.)
   - making LIR fancier for CONDFUNs. I.e., in the above ultimate IF(),
     the val= blocks are nearly identical in both legs, so the identical
     parts could be moved out of the LACFUN.
I think the latter offers the biggest immediate advantages. 
This bug also explains a lot about why many real-world SAC applications
don't work nearly as well as we expect: I.e., our (my) naive expectation
is that scalar-oriented SAC code should perform as well as the
equivalent C code.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information