Software Pipelining - Implementation

Implementation

Software pipelining is often used in combination with loop unrolling, and this combination of techniques is often a far better optimization than loop unrolling alone. In the example above, we could write the code as follows (assume for the moment that bignumber is divisible by 3):

for i = 1 to (bignumber - 2) step 3 A(i) A(i+1) A(i+2) B(i) B(i+1) B(i+2) C(i) C(i+1) C(i+2) end

Of course, matters are complicated if (as is usually the case) we can't guarantee that the number of iterations will be divisible by the number of iterations we unroll. See the article on loop unrolling for more on solutions to this problem. It should be noted, however, that software pipelining prevents the use of Duff's device, a widely known and efficient solution to this problem.

In the general case, loop unrolling may not be the best way to implement software pipelining. Consider a loop containing instructions with a high latency. For example, the following code:

for i = 1 to bignumber A(i) ; 3 cycle latency B(i) ; 3 C(i) ; 12(perhaps a floating point operation) D(i) ; 3 E(i) ; 3 F(i) ; 3 end

would require 12 iterations of the loop to be unrolled to avoid the bottleneck of instruction C. This means that the code of the loop would increase by a factor of 12 (which not only affects memory usage, but can also affect cache performance, see code bloat). Even worse, the prologue (code before the loop for handling the case of bignumber not divisible by 12) will likely be even larger than the code for the loop, and very probably inefficient because software pipelining cannot be used in this code (at least not without a significant amount of further code bloat). Furthermore, if bignumber is expected to be moderate in size compared to the number of iterations unrolled (say 10-20), then the execution will spend most of its time in this inefficient prologue code, rendering the software pipelining optimization ineffectual.

By contrast, here is the software pipelining for our example (the prologue and epilogue will be explained later):

prologue for i = 1 to (bignumber - 6) A(i+6) B(i+5) C(i+4) D(i+2) ; note that we skip i+3 E(i+1) F(i) end epilogue

Before getting to the prologue and epilogue, which handle iterations at the beginning and end of the loop, let's verify that this code does the same thing as the original for iterations in the middle of the loop. Specifically, consider iteration 7 in the original loop. The first iteration of the pipelined loop will be the first iteration that includes an instruction from iteration 7 of the original loop. The sequence of instructions is:

Iteration 1: A(7) B(6) C(5) D(3) E(2) F(1)

Iteration 2: A(8) B(7) C(6) D(4) E(3) F(2)

Iteration 3: A(9) B(8) C(7) D(5) E(4) F(3)

Iteration 4: A(10) B(9) C(8) D(6) E(5) F(4)

Iteration 5: A(11) B(10) C(9) D(7) E(6) F(5)

Iteration 6: A(12) B(11) C(10) D(8) E(7) F(6)

Iteration 7: A(13) B(12) C(11) D(9) E(8) F(7)

However, unlike the original loop, the pipelined version avoid the bottleneck at instruction C. Note that there are 12 instructions between C(7) and the dependent instruction D(7), which means that the latency cycles of instruction C(7) are used for other instructions instead of being wasted.

The prologue and epilogue handle iterations at the beginning and end of the loop. Here is a possible prologue for our example above:

; loop prologue (arranged on lines for clarity) A(1) A(2), B(1) A(3), B(2), C(1) A(4), B(3), C(2) A(5), B(4), C(3), D(1) A(6), B(5), C(4), D(2), E(1)

Each line above corresponds to an iteration of the pipelined loop, with instructions for iterations that have not yet begun removed. The epilogue would look similar.

Read more about this topic: Software Pipelining