sac2c issues
https://gitlab.sac-home.org/sac-group/sac2c/-/issues
2024-03-13T15:11:10Z
https://gitlab.sac-home.org/sac-group/sac2c/-/issues/2395
Loop lifting fails
2024-03-13T15:11:10Z
Thomas Koopman
Loop lifting fails
Commit c0ba0c984b00f4109a3edb95597a517613c9f5c1
Consider the following program. `matmulT` only has to be computed `N / Br` times.
```
use Array: all;
#define d 128
#define N 1024
#define Br 128
#define Bc 256
noinline float[n:shp] i...
Commit c0ba0c984b00f4109a3edb95597a517613c9f5c1
Consider the following program. `matmulT` only has to be computed `N / Br` times.
```
use Array: all;
#define d 128
#define N 1024
#define Br 128
#define Bc 256
noinline float[n:shp] id(float[n:shp] x) { return x; }
noinline
float[m, n] matmulT(float[m, k] A, float[n, k] B)
{
return {iv -> tof(0) | iv < [m, n]};
}
inline
float FlashAttention(float[N, d] Q, float[N, d] K, float[N, d] V)
{
Qb = reshape([N / Br, Br, d], Q);
Kb = reshape([N / Bc, Bc, d], K);
O = tof(0);
for (j = 0; j < N / Bc; j++) {
Pj = {[i, a] -> matmulT(Qb[i], Kb[j])[a]
| [i, a] < [N / Br, Br]};
O += sum(Pj);
}
return O;
}
int main()
{
Q = id({iv -> tof(1) | iv < [N, d]});
K = id({iv -> tof(1) | iv < [N, d]});
V = id({iv -> tof(1) | iv < [N, d]});
O = FlashAttention(Q, K, V);
return _toi_S_(O);
}
```
However, the optimised code gives
```
/* Partn */
([ 0, 0 ] <= _flat_82=[i, a] (IDXS:_wlidx_920_Pj) < [ 8, 128 ] genwidth [ 8, 128 ])
{
_ivesli_930 = _idxs2offset_( [ 8, 128, 128 ], i, _iveras_1039, _iveras_1040);
_flat_84 = with /** FOLDABLE (all gen's const) **/
/** REFERENCED: 1 (total num refs) **/
{
/* Partn */
([ 0, 0 ] <= _pinl_540_iv=[_pinl_543__eat_146, _pinl_542__eat_145] (IDXS:_wlidx_921__flat_84) < [ 128, 128 ] genwidth [ 128, 128 ])
{
_ivesli_932 = _idxs2offset_( [ 8, 128, 128 ], _iveras_1041, _pinl_543__eat_146, _pinl_542__eat_145);
_ivesli_933 = _add_SxS_( _ivesli_930, _ivesli_932);
_pinl_538__flat_396 = _idx_sel_( _ivesli_933, Qb);
} : _pinl_538__flat_396 ;
} :
genarray( [ 128, 128 ], _pinl_450__flat_393, IDX(_wlidx_921__flat_84));
_flat_83 = _MAIN::matmulT( _flat_84, _flat_56) ;
```
computing it `N / Br * Br` times. The `[i, a]` loop should have been split up in an `[i]` and `[a]` loop, the `matmulT` lifted out of the `[a]` loop, and then inside of the `[a]` loop a suballoc can be done.
https://gitlab.sac-home.org/sac-group/sac2c/-/issues/2382
Constant folding does not optimise sel(jv, {iv -> expr(iv) | iv < ub})
2024-02-15T10:19:35Z
Thomas Koopman
Constant folding does not optimise sel(jv, {iv -> expr(iv) | iv < ub})
Consider the following program.
```
inline
int[d:n1] slide(int[d] i, int[d:m] x, int[d] n1) | all(n1 + i <= m)
{
return {iv -> _sel_VxA_(_add_VxV_(iv, i), x)
| iv < n1};
}
int +(int a, int b)
{
return _add_SxS_(a, b)...
Consider the following program.
```
inline
int[d:n1] slide(int[d] i, int[d:m] x, int[d] n1) | all(n1 + i <= m)
{
return {iv -> _sel_VxA_(_add_VxV_(iv, i), x)
| iv < n1};
}
int +(int a, int b)
{
return _add_SxS_(a, b);
}
noinline
int[*] id(int[*] x)
{
return x;
}
int main()
{
inp = id(with {}: genarray([28, 28], 2));
#if 1
bla = {x22 -> with {
([0, 0] <= x21 < [5, 5]): _sel_VxA_(x22, slide(x21, inp, [23, 23]));
}: fold(+, 0)
| x22 < [24, 24]};
#else
bla = {x22 -> with {
([0, 0] <= x21 < [5, 5]): _sel_VxA_(_add_VxV_(x22, x21), inp);
}: fold(+, 0)
| x22 < [24, 24]};
#endif
res = _sel_VxA_([0, 0], bla);
return res;
}
```
Compiling the following example with `sac2c_d -bopt -printfun main` gives
```
/****************************************************************************
* _MAIN::main(...) [ body ]
****************************************************************************/
int _MAIN::main()
/*
* main :: ---
*/
{
int _ivesli_2836 { , NN } ;
int _ivesli_2835 { , NN } ;
int _ivesli_2834 { , NN } ;
int _ivesli_2832 { , NN } ;
int _wlidx_2816__flat_91 { , NN } ;
int _wlidx_2815_bla { , NN } ;
int _wlidx_2814__flat_33 { , NN } ;
int _pinl_405__eat_107 { , NN } ;
int _pinl_404__eat_106 { , NN } ;
int _pinl_403__mose_8__SSA0_1 { , NN } ;
int[2] _pinl_402_iv { , NN } ;
int _ea_327__flat_90 { , NN } ;
int _ea_326__mose_9__SSA0_1 { , NN } ;
int _eat_105 { , NN } ;
int _eat_104 { , NN } ;
int _eat_103 { , NN } ;
int _eat_102 { , NN } ;
int _eat_99 { , NN } ;
int _eat_98 { , NN } ;
int _mose_9__SSA0_1 { , NN } ;
int[2] x21__SSA0_1 { , NN } ;
int res { , NN } ;
int[24,24] bla { , NN } ;
int[2] x22 { , NN } ;
int[28,28] inp { , NN } ;
int[2] _hzgwl_12 { , NN } ;
int[23,23] _flat_91 { , NN } ;
int _flat_90 { , NN } ;
int{0} _flat_39 { , NN } ;
int{2} _flat_37 { , NN } ;
int[28,28] _flat_33 { , NN } ;
_flat_39 = 0;
_flat_37 = 2;
_flat_33 = with /** FOLDABLE (all gen's const) **/
/** REFERENCED: 1 (total num refs) **/
{
/* Partn */
([ 0, 0 ] <= _hzgwl_12=[_eat_99, _eat_98] (IDXS:_wlidx_2814__flat_33) < [ 28, 28 ] genwidth [ 28, 28 ])
{
} : _flat_37 ;
} :
genarray( [ 28, 28 ], _flat_37, IDX(_wlidx_2814__flat_33));
inp = _MAIN::id( _flat_33) ;
bla = with /** FOLDABLE (all gen's const) **/
/** REFERENCED: 1 (total num refs) **/
{
/* Partn */
([ 0, 0 ] <= x22=[_eat_103, _eat_102] (IDXS:_wlidx_2815_bla) < [ 24, 24 ] genwidth [ 24, 24 ])
{
_ivesli_2836 = _idxs2offset_( [ 23, 23 ], _eat_103, _eat_102);
_mose_9__SSA0_1 = with /** FOLDABLE (all gen's const) **/
/** REFERENCED: 1 (total num refs) **/
{
/* Partn */
([ 0, 0 ] <= x21__SSA0_1=[_eat_105, _eat_104] < [ 5, 5 ] genwidth [ 5, 5 ])
{
_ea_326__mose_9__SSA0_1 = _accu_( x21__SSA0_1, _flat_39);
_ivesli_2832 = _idxs2offset_( [ 28, 28 ], _eat_105, _eat_104);
_flat_91 = with /** FOLDABLE (all gen's const) **/
/** REFERENCED: 1 (total num refs) **/
{
/* Partn */
([ 0, 0 ] <= _pinl_402_iv=[_pinl_405__eat_107, _pinl_404__eat_106] (IDXS:_wlidx_2816__flat_91) < [ 23, 23 ] genwidth [ 23, 23 ])
{
_ivesli_2834 = _idxs2offset_( [ 28, 28 ], _pinl_405__eat_107, _pinl_404__eat_106);
_ivesli_2835 = _add_SxS_( _ivesli_2832, _ivesli_2834);
_pinl_403__mose_8__SSA0_1 = _idx_sel_( _ivesli_2835, inp);
} : _pinl_403__mose_8__SSA0_1 ;
} :
genarray( [ 23, 23 ], _flat_39, IDX(_wlidx_2816__flat_91));
_flat_90 = _idx_sel_( _ivesli_2836, _flat_91);
_ea_327__flat_90 = _add_SxS_( _ea_326__mose_9__SSA0_1, _flat_90);
} : _ea_327__flat_90 ;
} :
fold( _MAIN::+(), _flat_39);
} : _mose_9__SSA0_1 ;
} :
genarray( [ 24, 24 ], _flat_39, IDX(_wlidx_2815_bla));
res = _idx_sel_( _flat_39, bla);
return( res);
}
/*-----------------------------------------------*/
```
so `slide` is computed every iteration of the fold-loop, whereas I would have expected a simple selection `inp[x21 + x22]`, as in the commented out version.
https://gitlab.sac-home.org/sac-group/sac2c/-/issues/1182
MT performance lacking
2017-11-19T20:28:45Z
Sven-Bodo Scholz
MT performance lacking
| | |
| --- | --- |
| Bugzilla Link | [1166](http://bugs.sac-home.org/show_bug.cgi?id=1166) |
| Created on | Oct 07, 2015 15:23 |
| Version | svn |
| OS | All |
| Architecture | All |
| Attachments | [tutu.sac](/uploads/0b405ae329ee26a...
| | |
| --- | --- |
| Bugzilla Link | [1166](http://bugs.sac-home.org/show_bug.cgi?id=1166) |
| Created on | Oct 07, 2015 15:23 |
| Version | svn |
| OS | All |
| Architecture | All |
| Attachments | [tutu.sac](/uploads/0b405ae329ee26a341cb6646b8185c66/tutu.sac) |
## Extended Description
<pre>Created an attachment (id=1043)
source code used
When compiling the attached code with sac2c 1.2.beta-BlackForest-41-7dc65 (sac2c-follow branch)
I find two problems:
1) sequential execution is twice as fast as mt execution with one thread
2) scaling is virtually non existant
Here the exact data on a 24 core Intel Intel(R) Xeon(R) CPU X5650 @ 2.67GHz:
Sequential time:
-bash-4.1$ sac2c tutu3.sac
-bash-4.1$ /usr/bin/time ./a.out
1.04user 0.00system 0:01.07elapsed 96%CPU (0avgtext+0avgdata 1504maxresident)k
1504inputs+0outputs (0major+422minor)pagefaults 0swaps
=> 1.07 sec
Parallel times:
-bash-4.1$ sac2c -tmt_pth tutu3.sac
-bash-4.1$ /usr/bin/time ./a.out -mt 1
2.83user 0.00system 0:02.93elapsed 96%CPU (0avgtext+0avgdata 3296maxresident)k
3344inputs+0outputs (3major+363minor)pagefaults 0swaps
=> 2.93 secs
-bash-4.1$ /usr/bin/time ./a.out -mt 2
4.92user 0.19system 0:03.37elapsed 151%CPU (0avgtext+0avgdata 4184maxresident)k
0inputs+0outputs (0major+600minor)pagefaults 0swaps
=> 3.37 secs
-bash-4.1$ /usr/bin/time ./a.out -mt 4
4.16user 0.65system 0:01.52elapsed 316%CPU (0avgtext+0avgdata 5512maxresident)k
0inputs+0outputs (0major+510minor)pagefaults 0swaps
=> 1.52 secs</pre>
BugZilla
BugZilla
https://gitlab.sac-home.org/sac-group/sac2c/-/issues/1112
masterrun does not terminate due to ArrayFormatUT running forever
2017-11-19T20:22:52Z
Sven-Bodo Scholz
masterrun does not terminate due to ArrayFormatUT running forever
| | |
| --- | --- |
| Bugzilla Link | [949](http://bugs.sac-home.org/show_bug.cgi?id=949) |
| Created on | May 01, 2012 19:13 |
| Version | svn |
| OS | All |
| Architecture | PC |
## Extended Description
<pre>using sac2c rev.17794 a...
| | |
| --- | --- |
| Bugzilla Link | [949](http://bugs.sac-home.org/show_bug.cgi?id=949) |
| Created on | May 01, 2012 19:13 |
| Version | svn |
| OS | All |
| Architecture | PC |
## Extended Description
<pre>using sac2c rev.17794 and
stdlib rev. 1624 and
sac rev 1664
the file testsuite/stdlib/modules/structures/ArrayFormatUT.sac compiles -mt fine but
the generated ArrayFormatUT_mt runs forever......</pre>
BugZilla
BugZilla