CUDA backend undefined behaviour

The following simplification of FlashAttention exhibits undefined behaviour. Some outputs are -nan, 3.458913, 11082841.000000. The sequential and multithreaded backends do give the correct answers. Built with version 1.3.3-MijasCosta-1079-g648dba

use Array: all;
use StdIO: all;

#define N 256
#define d 64

inline
float[., .] matmul(float[., .] A, float[., .] B)
{
  return {[i, j] -> sum({[p] -> A[i, p] * B[p, j]})};
}

noinline
float[., ., .] FlashAttention(float[., ., .] Q, float[d, N] K, float[N, d] V)
{
  return {[i] -> matmul(matmul(Q[i], K), V)};
}

float L2(float[*] x)
{
  return Math::sqrt(sum(x * x));
}

int main()
{
  /* Q[i]K is a d x N matrix of ds, so multiplying this with V
     gives a d x d matrix of N * ds. Taking the L2 norm gives
     sqrt((N * d)^2 * N * d) = N * d * sqrt(N * d) */
  Q = {[i, j, k] -> tof(1) | [i, j, k] < [N / d, d, d]};
  K = {[i, j] -> tof(1) | [i, j] < [d, N]};
  V = {[i, j] -> tof(1) | [i, j] < [N, d]};

  O = FlashAttention(Q, K, V);
  
  printf("L2 norm of output is %lf, should be %lf\n", 
              L2(O), tof(d * N) * Math::sqrt(tof(d * N)));

  return 0;
}

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information