sac2c issues
https://gitlab.sac-home.org/sac-group/sac2c/-/issues
2024-03-13T15:11:10Z
https://gitlab.sac-home.org/sac-group/sac2c/-/issues/2395
Loop lifting fails
2024-03-13T15:11:10Z
Thomas Koopman
Loop lifting fails
Commit c0ba0c984b00f4109a3edb95597a517613c9f5c1
Consider the following program. `matmulT` only has to be computed `N / Br` times.
```
use Array: all;
#define d 128
#define N 1024
#define Br 128
#define Bc 256
noinline float[n:shp] i...
Commit c0ba0c984b00f4109a3edb95597a517613c9f5c1
Consider the following program. `matmulT` only has to be computed `N / Br` times.
```
use Array: all;
#define d 128
#define N 1024
#define Br 128
#define Bc 256
noinline float[n:shp] id(float[n:shp] x) { return x; }
noinline
float[m, n] matmulT(float[m, k] A, float[n, k] B)
{
return {iv -> tof(0) | iv < [m, n]};
}
inline
float FlashAttention(float[N, d] Q, float[N, d] K, float[N, d] V)
{
Qb = reshape([N / Br, Br, d], Q);
Kb = reshape([N / Bc, Bc, d], K);
O = tof(0);
for (j = 0; j < N / Bc; j++) {
Pj = {[i, a] -> matmulT(Qb[i], Kb[j])[a]
| [i, a] < [N / Br, Br]};
O += sum(Pj);
}
return O;
}
int main()
{
Q = id({iv -> tof(1) | iv < [N, d]});
K = id({iv -> tof(1) | iv < [N, d]});
V = id({iv -> tof(1) | iv < [N, d]});
O = FlashAttention(Q, K, V);
return _toi_S_(O);
}
```
However, the optimised code gives
```
/* Partn */
([ 0, 0 ] <= _flat_82=[i, a] (IDXS:_wlidx_920_Pj) < [ 8, 128 ] genwidth [ 8, 128 ])
{
_ivesli_930 = _idxs2offset_( [ 8, 128, 128 ], i, _iveras_1039, _iveras_1040);
_flat_84 = with /** FOLDABLE (all gen's const) **/
/** REFERENCED: 1 (total num refs) **/
{
/* Partn */
([ 0, 0 ] <= _pinl_540_iv=[_pinl_543__eat_146, _pinl_542__eat_145] (IDXS:_wlidx_921__flat_84) < [ 128, 128 ] genwidth [ 128, 128 ])
{
_ivesli_932 = _idxs2offset_( [ 8, 128, 128 ], _iveras_1041, _pinl_543__eat_146, _pinl_542__eat_145);
_ivesli_933 = _add_SxS_( _ivesli_930, _ivesli_932);
_pinl_538__flat_396 = _idx_sel_( _ivesli_933, Qb);
} : _pinl_538__flat_396 ;
} :
genarray( [ 128, 128 ], _pinl_450__flat_393, IDX(_wlidx_921__flat_84));
_flat_83 = _MAIN::matmulT( _flat_84, _flat_56) ;
```
computing it `N / Br * Br` times. The `[i, a]` loop should have been split up in an `[i]` and `[a]` loop, the `matmulT` lifted out of the `[a]` loop, and then inside of the `[a]` loop a suballoc can be done.
https://gitlab.sac-home.org/sac-group/sac2c/-/issues/2382
Constant folding does not optimise sel(jv, {iv -> expr(iv) | iv < ub})
2024-02-15T10:19:35Z
Thomas Koopman
Constant folding does not optimise sel(jv, {iv -> expr(iv) | iv < ub})
Consider the following program.
```
inline
int[d:n1] slide(int[d] i, int[d:m] x, int[d] n1) | all(n1 + i <= m)
{
return {iv -> _sel_VxA_(_add_VxV_(iv, i), x)
| iv < n1};
}
int +(int a, int b)
{
return _add_SxS_(a, b)...
Consider the following program.
```
inline
int[d:n1] slide(int[d] i, int[d:m] x, int[d] n1) | all(n1 + i <= m)
{
return {iv -> _sel_VxA_(_add_VxV_(iv, i), x)
| iv < n1};
}
int +(int a, int b)
{
return _add_SxS_(a, b);
}
noinline
int[*] id(int[*] x)
{
return x;
}
int main()
{
inp = id(with {}: genarray([28, 28], 2));
#if 1
bla = {x22 -> with {
([0, 0] <= x21 < [5, 5]): _sel_VxA_(x22, slide(x21, inp, [23, 23]));
}: fold(+, 0)
| x22 < [24, 24]};
#else
bla = {x22 -> with {
([0, 0] <= x21 < [5, 5]): _sel_VxA_(_add_VxV_(x22, x21), inp);
}: fold(+, 0)
| x22 < [24, 24]};
#endif
res = _sel_VxA_([0, 0], bla);
return res;
}
```
Compiling the following example with `sac2c_d -bopt -printfun main` gives
```
/****************************************************************************
* _MAIN::main(...) [ body ]
****************************************************************************/
int _MAIN::main()
/*
* main :: ---
*/
{
int _ivesli_2836 { , NN } ;
int _ivesli_2835 { , NN } ;
int _ivesli_2834 { , NN } ;
int _ivesli_2832 { , NN } ;
int _wlidx_2816__flat_91 { , NN } ;
int _wlidx_2815_bla { , NN } ;
int _wlidx_2814__flat_33 { , NN } ;
int _pinl_405__eat_107 { , NN } ;
int _pinl_404__eat_106 { , NN } ;
int _pinl_403__mose_8__SSA0_1 { , NN } ;
int[2] _pinl_402_iv { , NN } ;
int _ea_327__flat_90 { , NN } ;
int _ea_326__mose_9__SSA0_1 { , NN } ;
int _eat_105 { , NN } ;
int _eat_104 { , NN } ;
int _eat_103 { , NN } ;
int _eat_102 { , NN } ;
int _eat_99 { , NN } ;
int _eat_98 { , NN } ;
int _mose_9__SSA0_1 { , NN } ;
int[2] x21__SSA0_1 { , NN } ;
int res { , NN } ;
int[24,24] bla { , NN } ;
int[2] x22 { , NN } ;
int[28,28] inp { , NN } ;
int[2] _hzgwl_12 { , NN } ;
int[23,23] _flat_91 { , NN } ;
int _flat_90 { , NN } ;
int{0} _flat_39 { , NN } ;
int{2} _flat_37 { , NN } ;
int[28,28] _flat_33 { , NN } ;
_flat_39 = 0;
_flat_37 = 2;
_flat_33 = with /** FOLDABLE (all gen's const) **/
/** REFERENCED: 1 (total num refs) **/
{
/* Partn */
([ 0, 0 ] <= _hzgwl_12=[_eat_99, _eat_98] (IDXS:_wlidx_2814__flat_33) < [ 28, 28 ] genwidth [ 28, 28 ])
{
} : _flat_37 ;
} :
genarray( [ 28, 28 ], _flat_37, IDX(_wlidx_2814__flat_33));
inp = _MAIN::id( _flat_33) ;
bla = with /** FOLDABLE (all gen's const) **/
/** REFERENCED: 1 (total num refs) **/
{
/* Partn */
([ 0, 0 ] <= x22=[_eat_103, _eat_102] (IDXS:_wlidx_2815_bla) < [ 24, 24 ] genwidth [ 24, 24 ])
{
_ivesli_2836 = _idxs2offset_( [ 23, 23 ], _eat_103, _eat_102);
_mose_9__SSA0_1 = with /** FOLDABLE (all gen's const) **/
/** REFERENCED: 1 (total num refs) **/
{
/* Partn */
([ 0, 0 ] <= x21__SSA0_1=[_eat_105, _eat_104] < [ 5, 5 ] genwidth [ 5, 5 ])
{
_ea_326__mose_9__SSA0_1 = _accu_( x21__SSA0_1, _flat_39);
_ivesli_2832 = _idxs2offset_( [ 28, 28 ], _eat_105, _eat_104);
_flat_91 = with /** FOLDABLE (all gen's const) **/
/** REFERENCED: 1 (total num refs) **/
{
/* Partn */
([ 0, 0 ] <= _pinl_402_iv=[_pinl_405__eat_107, _pinl_404__eat_106] (IDXS:_wlidx_2816__flat_91) < [ 23, 23 ] genwidth [ 23, 23 ])
{
_ivesli_2834 = _idxs2offset_( [ 28, 28 ], _pinl_405__eat_107, _pinl_404__eat_106);
_ivesli_2835 = _add_SxS_( _ivesli_2832, _ivesli_2834);
_pinl_403__mose_8__SSA0_1 = _idx_sel_( _ivesli_2835, inp);
} : _pinl_403__mose_8__SSA0_1 ;
} :
genarray( [ 23, 23 ], _flat_39, IDX(_wlidx_2816__flat_91));
_flat_90 = _idx_sel_( _ivesli_2836, _flat_91);
_ea_327__flat_90 = _add_SxS_( _ea_326__mose_9__SSA0_1, _flat_90);
} : _ea_327__flat_90 ;
} :
fold( _MAIN::+(), _flat_39);
} : _mose_9__SSA0_1 ;
} :
genarray( [ 24, 24 ], _flat_39, IDX(_wlidx_2815_bla));
res = _idx_sel_( _flat_39, bla);
return( res);
}
/*-----------------------------------------------*/
```
so `slide` is computed every iteration of the fold-loop, whereas I would have expected a simple selection `inp[x21 + x22]`, as in the commented out version.
https://gitlab.sac-home.org/sac-group/sac2c/-/issues/2377
Complexity problem in IVD
2024-02-11T10:13:20Z
Thomas Koopman
Complexity problem in IVD
IVD checks for each `N_spids` whether a variable declaration exists, inserting one if necessary. This search is a linear traversal of the list of variable declarations. If we have `n` variables without declaration, that means we check `s...
IVD checks for each `N_spids` whether a variable declaration exists, inserting one if necessary. This search is a linear traversal of the list of variable declarations. If we have `n` variables without declaration, that means we check `sum_{i = 1}^n i = n(n + 1)/2 \in O(n^2)` times. For the program below, the phase FLAT generates about 100K variables without declaration, leading to this phase taking about 20 minutes.
```
use Array: all;
use Math: all;
use StdIO: all;
#define CATEGORIES 10
inline
float[n:shpi] sumOuter(int m, float[m:shpo,n:shpi] x)
{
return {iv -> sum({jv -> x[jv++iv] | jv < shpo}) | iv < shpi};
}
inline
float[d:n1] slide(int[d] i, float[d:m] x, int[d] n) | all(n1 == n + 1)
| all(n + 1 + i <= m)
{
return {iv -> x[iv + i] | iv < n + 1};
}
inline
float[d:mn] backslide(int[d] i, float[d:n1] y, int[d] mn)
| all(i < 1 + mn - n1)
{
return {iv -> 0f | iv < i;
iv -> y[iv - i] | i <= iv < n1 + i;
iv -> 0f | n1 + i <= iv < mn};
}
inline
float[*] logistics(float[*] x)
{
return {iv -> 1f / (1f + exp(-x[iv]))};
}
inline
float[n:oshp,n:ishp] block(float[d:shp] x, int[n] ishp)
| all(shp % ishp == 0)
| all(oshp * ishp == shp)
| (2 * n == d)
{
return {iv -> tile(ishp, iv * ishp, x) | iv < shp / ishp};
}
/* TODO: type patterns */
inline
float[*] unblock(float[*] a, int[.] bshp)
{
shp = drop(-shape(bshp), shape(a)) * bshp;
return {iv -> a[(iv / bshp) ++ (iv % bshp)] | iv < shp};
}
inline
float[n:ishp] selb(float[d:shp] x, int[n] iv, int[n] ishp)
| (2 * n == d)
| all(iv < shp / ishp)
{
return block(x, ishp)[iv];
}
int main()
{
/* Variables */
inp = {[i, j] -> tof(i * j) | [i, j] < [28, 28]};
k1 = genarray( [6,5,5], 1f/25f);
b1 = genarray( [6], 1f/6f);
k2 = genarray( [12,6,5,5], 1f/150f);
b2 = genarray( [12], 1f/12f);
fc = genarray( [CATEGORIES,12,1,4,4], 1f/192f);
b = genarray( [CATEGORIES], 1f/tof(CATEGORIES));
/* Generated code with minor adjustments:
UTF8 -> ASCII
4 -> 4f (as SaC does not coerce types)
one -> 1f
backlide -> backslide
*/
/* Shape [6, 24, 24] */
c11 = { x16 -> (sumOuter(2, { x17 -> (slide(x17, inp, [23, 23])) * ({ x18 -> ((k1)[x16])[x17] | x18 < [24, 24] }) | x17 < [5, 5]})) + ({ x19 -> (b1)[x16] | x19 < [24, 24] }) | x16 < [6] };
printf("c11\n");
print(c11);
/* Shape [6, 24, 24] */
c1 = logistics(c11);
printf("c1\n");
print(c1);
/* Shape [6, 12, 12] */
s1 = { x13 -> { x14 -> (sumOuter(2, { x15 -> (selb((c1)[x13], x14, [2, 2]))[x15] | x15 < [2, 2]})) / 4f | x14 < [12, 12] } | x13 < [6] };
printf("s1\n");
print(s1);
/* Shape [12, 1, 8, 8] */
c21 = { x9 -> (sumOuter(3, { x10 -> (slide(x10, s1, [0, 7, 7])) * ({ x11 -> ((k2)[x9])[x10] | x11 < [1, 8, 8] }) | x10 < [6, 5, 5]})) + ({ x12 -> (b2)[x9] | x12 < [1, 8, 8] }) | x9 < [12] };
printf("c21\n");
print(c21);
/* Shape [12, 1, 8, 8] */
c2 = logistics(c21);
printf("c2\n");
print(c2);
/* Shape [12, 1, 4, 4] */
s2 = { x5 -> { x6 -> { x7 -> (sumOuter(2, { x8 -> (selb(((c2)[x5])[x6], x7, [2, 2]))[x8] | x8 < [2, 2]})) / 4f | x7 < [4, 4] } | x6 < [1] } | x5 < [12] };
printf("s2\n");
print(s2);
/* Shape [10, 1, 1, 1, 1] */
r1 = { x1 -> (sumOuter(4, { x2 -> (slide(x2, s2, [0, 0, 0, 0])) * ({ x3 -> ((fc)[x1])[x2] | x3 < [1, 1, 1, 1] }) | x2 < [12, 1, 4, 4]})) + ({ x4 -> (b)[x1] | x4 < [1, 1, 1, 1] }) | x1 < [10] };
printf("r1\n");
print(r1);
/* Shape [10, 1, 1, 1, 1] */
r = logistics(r1);
printf("r\n");
print(r);
ddr = 1f;
/* Shape [10, 1, 1, 1, 1] */
ddr1 = ddr * (r1 * (1f + -r1));
printf("ddr1\n");
print(ddr1);
/* Shape [12, 1, 4, 4] */
dds2 = sumOuter(1, { x1 -> sumOuter(4, { x2 -> backslide(x2, { x3 -> (((ddr1)[x1])[x3]) * (((fc)[x1])[x2]) | x3 < [1, 1, 1, 1] }, [12, 1, 4, 4]) | x2 < [12, 1, 4, 4]}) | x1 < [10]});
printf("dds2\n");
print(dds2);
/* Shape [12, 1, 8, 8] */
ddc2 = { x1 -> { x2 -> unblock({ x3 -> { x4 -> ((((dds2)[x1])[x2])[x3]) / 4f | x4 < [2, 2] } | x3 < [4, 4] }, [2, 2]) | x2 < [1] } | x1 < [12] };
printf("ddc2\n");
print(ddc2);
/* Shape [12, 1, 8, 8] */
ddc21 = ddc2 * (c21 * (1f + -c21));
printf("ddc21\n");
print(ddc21);
/* Shape [6, 12, 12] */
dds1 = sumOuter(1, { x1 -> sumOuter(3, { x2 -> backslide(x2, { x3 -> (((ddc21)[x1])[x3]) * (((k2)[x1])[x2]) | x3 < [1, 8, 8] }, [6, 12, 12]) | x2 < [6, 5, 5]}) | x1 < [12]});
printf("dds1\n");
print(dds1);
/* Shape [6, 24, 24] */
ddc1 = { x1 -> unblock({ x2 -> { x3 -> (((dds1)[x1])[x2]) / 4f | x3 < [2, 2] } | x2 < [12, 12] }, [2, 2]) | x1 < [6] };
printf("ddc1\n");
print(ddc1);
/* Shape [6, 24, 24] */
ddc11 = (ddc1) * (c11 * (1f + -c11));
printf("ddc11\n");
print(ddc11);
ddb = { x1 -> sumOuter(4, { x2 -> ((ddr1)[x1])[x2] | x2 < [1, 1, 1, 1]}) | x1 < [10] };
printf("ddb\n");
print(ddb);
ddfc = { x1 -> { x2 -> sumOuter(4, { x3 -> (((ddr1)[x1])[x3]) * ((slide(x2, s2, [0, 0, 0, 0]))[x3]) | x3 < [1, 1, 1, 1]}) | x2 < [12, 1, 4, 4] } | x1 < [10] };
printf("ddfc\n");
print(ddfc);
ddb2 = { x1 -> sumOuter(3, { x2 -> ((ddc21)[x1])[x2] | x2 < [1, 8, 8]}) | x1 < [12] };
printf("ddb2\n");
print(ddb2);
ddk2 = { x1 -> { x2 -> sumOuter(3, { x3 -> (((ddc21)[x1])[x3]) * ((slide(x2, s1, [0, 7, 7]))[x3]) | x3 < [1, 8, 8]}) | x2 < [6, 5, 5] } | x1 < [12] };
printf("ddk2\n");
print(ddk2);
ddb1 = { x1 -> sumOuter(2, { x2 -> ((ddc11)[x1])[x2] | x2 < [24, 24]}) | x1 < [6] };
printf("ddb1\n");
print(ddb1);
ddk1 = { x1 -> { x2 -> sumOuter(2, { x3 -> (((ddc11)[x1])[x3]) * ((slide(x2, inp, [23, 23]))[x3]) | x3 < [24, 24]}) | x2 < [5, 5] } | x1 < [6] };
printf("ddk1\n");
print(ddk1);
ddinp = sumOuter(1, { x1 -> sumOuter(2, { x2 -> backslide(x2, { x3 -> (((ddc11)[x1])[x3]) * (((k1)[x1])[x2]) | x3 < [24, 24] }, [28, 28]) | x2 < [5, 5]}) | x1 < [6]});
printf("ddinp\n");
print(ddinp);
return 0;
}
```
https://gitlab.sac-home.org/sac-group/sac2c/-/issues/1182
MT performance lacking
2017-11-19T20:28:45Z
Sven-Bodo Scholz
MT performance lacking
| | |
| --- | --- |
| Bugzilla Link | [1166](http://bugs.sac-home.org/show_bug.cgi?id=1166) |
| Created on | Oct 07, 2015 15:23 |
| Version | svn |
| OS | All |
| Architecture | All |
| Attachments | [tutu.sac](/uploads/0b405ae329ee26a...
| | |
| --- | --- |
| Bugzilla Link | [1166](http://bugs.sac-home.org/show_bug.cgi?id=1166) |
| Created on | Oct 07, 2015 15:23 |
| Version | svn |
| OS | All |
| Architecture | All |
| Attachments | [tutu.sac](/uploads/0b405ae329ee26a341cb6646b8185c66/tutu.sac) |
## Extended Description
<pre>Created an attachment (id=1043)
source code used
When compiling the attached code with sac2c 1.2.beta-BlackForest-41-7dc65 (sac2c-follow branch)
I find two problems:
1) sequential execution is twice as fast as mt execution with one thread
2) scaling is virtually non existant
Here the exact data on a 24 core Intel Intel(R) Xeon(R) CPU X5650 @ 2.67GHz:
Sequential time:
-bash-4.1$ sac2c tutu3.sac
-bash-4.1$ /usr/bin/time ./a.out
1.04user 0.00system 0:01.07elapsed 96%CPU (0avgtext+0avgdata 1504maxresident)k
1504inputs+0outputs (0major+422minor)pagefaults 0swaps
=> 1.07 sec
Parallel times:
-bash-4.1$ sac2c -tmt_pth tutu3.sac
-bash-4.1$ /usr/bin/time ./a.out -mt 1
2.83user 0.00system 0:02.93elapsed 96%CPU (0avgtext+0avgdata 3296maxresident)k
3344inputs+0outputs (3major+363minor)pagefaults 0swaps
=> 2.93 secs
-bash-4.1$ /usr/bin/time ./a.out -mt 2
4.92user 0.19system 0:03.37elapsed 151%CPU (0avgtext+0avgdata 4184maxresident)k
0inputs+0outputs (0major+600minor)pagefaults 0swaps
=> 3.37 secs
-bash-4.1$ /usr/bin/time ./a.out -mt 4
4.16user 0.65system 0:01.52elapsed 316%CPU (0avgtext+0avgdata 5512maxresident)k
0inputs+0outputs (0major+510minor)pagefaults 0swaps
=> 1.52 secs</pre>
BugZilla
BugZilla
https://gitlab.sac-home.org/sac-group/sac2c/-/issues/1112
masterrun does not terminate due to ArrayFormatUT running forever
2017-11-19T20:22:52Z
Sven-Bodo Scholz
masterrun does not terminate due to ArrayFormatUT running forever
| | |
| --- | --- |
| Bugzilla Link | [949](http://bugs.sac-home.org/show_bug.cgi?id=949) |
| Created on | May 01, 2012 19:13 |
| Version | svn |
| OS | All |
| Architecture | PC |
## Extended Description
<pre>using sac2c rev.17794 a...
| | |
| --- | --- |
| Bugzilla Link | [949](http://bugs.sac-home.org/show_bug.cgi?id=949) |
| Created on | May 01, 2012 19:13 |
| Version | svn |
| OS | All |
| Architecture | PC |
## Extended Description
<pre>using sac2c rev.17794 and
stdlib rev. 1624 and
sac rev 1664
the file testsuite/stdlib/modules/structures/ArrayFormatUT.sac compiles -mt fine but
the generated ArrayFormatUT_mt runs forever......</pre>
BugZilla
BugZilla