Software Pipelining - Difficulties of Implementation

Difficulties of Implementation

The requirement of a prologue and epilogue is one of the major difficulties of implementing software pipelining. Note that the prologue in this example is 18 instructions, 3 times as large as the loop itself. The epilogue would also be 18 instructions. In other words, the prologue and epilogue together are 6 times as large as the loop itself. While still better than attempting loop unrolling for this example, software pipelining requires a trade-off between speed and memory usage. Keep in mind, also, that if the code bloat is too large, it will affect speed anyway via a decrease in cache performance.

A further difficulty is that on many architectures, most instructions use a register as an argument, and that the specific register to use must be hard-coded into the instruction. In other words, on many architectures, it is impossible to code such an instruction as "multiply the contents of register X and register Y and put the result in register Z", where X, Y, and Z are numbers taken from other registers or memory. This has often been cited as a reason that software pipelining cannot be effectively implemented on conventional architectures.

In fact, Monica Lam presents an elegant solution to this problem in her thesis, A Systolic Array Optimizing Compiler (1989) (ISBN 0-89838-300-5). She calls it Modulo Renaming. The trick is to replicate the body of the loop after it has been scheduled, allowing different registers to be used for different values of the same variable when they have to be live at the same time. For the simplest possible example, let's suppose that A(i) and B(i) can be issued in parallel and that the latency of the latter is 2 cycles. The pipelined body could then be:

A(i+2); B(i)

Register allocation of this loop body runs into the problem that the result of A(i+2) must stay live for two iterations. Using the same register for the result of A(i+2) and the input of B(i) will result in incorrect results.

However, if we replicate the scheduled loop body, the problem is solved:

A(i+2); B(i) A(i+3); B(i+1)

Now a separate register can be allocated to the results of A(i+2) and A(i+3). To be more concrete:

r1 = A(i+2); B(i) = r1 r2 = A(i+3); B(i+1) = r2 i = i + 2 // Just to be clear

On the assumption that each instruction bundle reads its input registers before writing its output registers, this code is correct. At the start of the replicated loop body, r1 holds the value of A(i+2) from the previous replicated loop iteration. Since i has been incremented by 2 in the meantime, this is actually the value of A(i) in this replicated loop iteration.

Of course, code replication increases code size and cache pressure just as the prologue and epilogue do. Nevertheless, for loops with large trip counts on architectures with enough instruction level parallelism, the technique easily performs well enough to be worth any increase in code size.

Read more about this topic: Software Pipelining

Famous quotes containing the words difficulties of and/or difficulties:

“Only a great actor finds the difficulties of the actor’s art infinite.”
—Ellen Terry (1847–1928)

“Only a great actor finds the difficulties of the actor’s art infinite.”
—Ellen Terry (1847–1928)