POWER3 - Description

Description

The POWER3 was based on the PowerPC 620, an earlier 64-bit PowerPC implementation that was late, under-performing and commercially unsuccessful. Like the PowerPC 620, the POWER3 has three fixed-point units, but the single floating-point unit (FPU) was replaced with two floating-point fused multiply–add units, and an extra load-store unit was added (for a total of two) to improve floating-point performance. The POWER3 is a superscalar design that executed instructions out of order. It has a seven-stage integer pipeline, a minimal eight-stage load/store pipeline and a ten-stage floating-point pipeline.

The front end consists of two stages: fetch and decode. During the first stage, eight instructions were fetched from a 32 KB instruction cache and placed in a 12-entry instruction buffer. During the second stage, four instructions were taken from the instruction buffer, decoded, and issued to instruction queues. Restrictions on instruction issue are few: of the two integer instruction queues, only one can accept one instruction, the other can accept up to four, as does the floating-point instruction queue. If the queues do not have enough unused entries, instructions cannot be issued. The front end has a short pipeline, resulting in a small three-cycle branch misprediction penalty.

In stage three, instructions in the instruction queues that are ready for execution have their operands read from the register files. The general-purpose register file contains 48 registers, of which 32 are general-purpose registers and 16 are rename registers for register renaming. To reduce the number of ports required to provide data and receive results, the general purpose register file is duplicated so that there are two copies, the first supporting three integer execution units and the second supporting the two load/store units. This scheme was similar to a contemporary microprocessor, the DEC Alpha 21264, but was simpler as it did not require an extra clock cycle to synchronize the two copies due to the POWER3's higher cycle times. The floating-point register file contains 56 registers, of which 32 are floating-point registers and 24 rename registers. Compared to the PowerPC 620, there were more rename registers, which allowed more instructions to be executed out of order, improving performance.

Execution begins in stage four. The instruction queues dispatch up to eight instructions to the execution units. Integer instructions are executed in three integer execution units (termed "fixed-point units" by IBM). Two of the units are identical and execute all integer instructions except for multiply and divide. All instructions executed by them have a one-cycle latency. The third unit executes multiply and divide instructions. These instructions are not pipelined and have multi-cycle latencies. 64-bit multiply has a nine-cycle latency and 64-bit divide has a 37-cycle latency.

Floating-point instructions are executed in two floating-point units (FPUs). The FPUs are capable of fused multiply–add, where multiplication and addition is performed simultaneously. Such instructions, along with individual add and multiply, have a four-cycle latency. Divide and square-root instructions are executed in the same FPUs, but are assisted by specialized hardware. Single-precision (32-bit) divide and square-root instructions have a 14-cycle latency, whereas double-precision (64-bit) divide and square-root instructions have an 18-cycle and a 22-cycle latency, respectively.

After execution is completed, the instructions are held in buffers before being committed and made visible to software. Execution finishes in stage five for integer instructions and stage eight for floating-point. Committing occurs during stage six for integers, stage nine for floating-point. Writeback occurs in the stage after commit. The POWER3 can retire up to four instructions per cycle.

The PowerPC 620 data cache was optimized for technical and scientific applications. Its capacity was doubled to 64 KB, to improve the cache-hit rate; the cache was dual-ported, implemented by interleaving eight banks, to enable two loads or two stores to be performed in one cycle in certain cases; and the line-size was increased to 128-bytes. The L2 cache bus was doubled in width to 256 bits to compensate for the larger cache line size and to retain a four-cycle latency for cache refills.

The POWER3 contained 15 million transistors on a 270 mm2 die. It was fabricated in IBM's CMOS-6S2 process, a complementary metal–oxide–semiconductor process that is a hybrid of 0.25 µm feature sizes and 0.35 µm metal layers. The process features five layers of aluminium. It was packaged in the same 1,088-column ceramic column grid array as the P2SC, but with a different pin out.

Read more about this topic:  POWER3

Famous quotes containing the word description:

    God damnit, why must all those journalists be such sticklers for detail? Why, they’d hold you to an accurate description of the first time you ever made love, expecting you to remember the color of the room and the shape of the windows.
    Lyndon Baines Johnson (1908–1973)

    Everything to which we concede existence is a posit from the standpoint of a description of the theory-building process, and simultaneously real from the standpoint of the theory that is being built. Nor let us look down on the standpoint of the theory as make-believe; for we can never do better than occupy the standpoint of some theory or other, the best we can muster at the time.
    Willard Van Orman Quine (b. 1908)

    It [Egypt] has more wonders in it than any other country in the world and provides more works that defy description than any other place.
    Herodotus (c. 484–424 B.C.)