CUDA backend undefined behaviour
The following simplification of FlashAttention exhibits undefined behaviour. Some outputs are -nan, 3.458913, 11082841.000000
. The sequential and multithreaded backends do give the correct answers. Built with version 1.3.3-MijasCosta-1079-g648dba
use Array: all;
use StdIO: all;
#define N 256
#define d 64
inline
float[., .] matmul(float[., .] A, float[., .] B)
{
return {[i, j] -> sum({[p] -> A[i, p] * B[p, j]})};
}
noinline
float[., ., .] FlashAttention(float[., ., .] Q, float[d, N] K, float[N, d] V)
{
return {[i] -> matmul(matmul(Q[i], K), V)};
}
float L2(float[*] x)
{
return Math::sqrt(sum(x * x));
}
int main()
{
/* Q[i]K is a d x N matrix of ds, so multiplying this with V
gives a d x d matrix of N * ds. Taking the L2 norm gives
sqrt((N * d)^2 * N * d) = N * d * sqrt(N * d) */
Q = {[i, j, k] -> tof(1) | [i, j, k] < [N / d, d, d]};
K = {[i, j] -> tof(1) | [i, j] < [d, N]};
V = {[i, j] -> tof(1) | [i, j] < [N, d]};
O = FlashAttention(Q, K, V);
printf("L2 norm of output is %lf, should be %lf\n",
L2(O), tof(d * N) * Math::sqrt(tof(d * N)));
return 0;
}