Sum Addressed Decoder - Overview

Overview

The L1 data cache should usually be in the most critical recurrence on the CPU, because few things improve instructions per cycle (IPC) as directly as a larger data cache, a larger data cache takes longer to access, and pipelining the data cache makes IPC worse. This article explains a way of reducing the latency of the L1 data cache access by fusing the address generation sum operation with the decode operation in the cache SRAM.

The address generation sum operation still must be performed, because other units in the memory pipe will use the resulting virtual address. That sum will be performed in parallel with the fused add/decode described here.

The most profitable recurrence to accelerate is a load, followed by a use of that load in a chain of integer operations leading to another load. Assuming that load results are bypassed with the same priority as integer results, then it's possible to summarize this recurrence as a load followed by another load—as if the program was following a linked list.

The rest of this page assumes an Instruction set architecture (ISA) with a single addressing mode (register+offset), a virtually indexed data cache, and sign-extending loads that may be variable-width. Most RISC ISAs fit this description. In ISAs such as the Intel x86, three or four inputs are summed to generate the virtual address. Multiple-input additions can be reduced to a two-input addition with carry save adders, and the remaining problem is as described below. The critical recurrence, then, is an adder, a decoder, the SRAM word line, the SRAM bit line(s), the sense amp(s), the byte steering muxes, and the bypass muxes.

For this example, a direct-mapped 16KB data cache which returns doubleword (8-byte) aligned values is assumed. Each line of the SRAM is 8 bytes, and there are 2048 lines, addressed by Addr. The sum addressed SRAM idea applies equally well to set associative caches.

Read more about this topic:  Sum Addressed Decoder