Much work has gone into optimising matrix multiplication. The straightforward algorithm
of multiplying two $n \times n$ matrices has a complexity of $O(n^3)$, but algorithms with lower complexity
have been found. Strassen's algorithm~\cite{Strassen}
has a complexity of $O(n^{\log_2(7)})$. This algorithm is recursive and not numerically stable,
but \cite{StrassenReloaded} shows that a few recursive steps can be safely implemented, leading to
real-world performance gains. To our knowledge, the currently fastest peer-reviewed algorithm~\cite{laserMatrix}
has a complexity of $O(n^{2.37286})$. However, no practical implementation exists.
of multiplying two $n \times n$ matrices has a complexity of $O(n^3)$ and is used by the libraries
we compared against. There exist algorithms with lower complexity ~\cite{Strassen} ~\cite{laserMatrix}, and at the cost of some minor accuracy loss this can be implemented ~\cite{StrassenReloaded}.
The idea to use blocking to improve performance of numerical
algorithms dates back to the 1960s~\cite{10.1145/362875.362879}, roughly coinciding