General Matrix Multiply - Optimization

Optimization

Not only is GEMM an important building block of other numeric software, it is often an important building block for calls to GEMM for larger matrices. By decomposing one or both of the input matrices into block matrices, GEMM can be used repeatedly on the smaller blocks to build up a result for the full matrix. This is one of the motivations for including the parameter, so the results of previous blocks can be accumulated. Note that this decomposition requires the special case which many implementations optimize for, thereby eliminating one multiplication for each value of C.

This decomposition allows for better locality of reference both in space and time of the data used in the product. This, in turn, takes advantage of the cache on the system (Golub & Van Loan 1996, Ch. 1). For systems with more than one level of cache, the blocking can be applied a second time to the order in which the blocks are used in the computation. Both of these levels of optimization are used in implementations such as ATLAS. More recently, implementations by Kazushige Goto have shown that blocking only for the L2 cache, combined with careful amortizing of copying to contiguous memory to reduce TLB misses, is superior to ATLAS. A highly tuned implementation based on these ideas are part of the GotoBLAS.

Read more about this topic:  General Matrix Multiply