Description
In general terms, CPUs are able to manipulate one or two pieces of data at a time. For instance, most CPUs have an instruction that essentially says "add A to B and put the result in C". The data for A, B and C could be—in theory at least—encoded directly into the instruction. However, in efficient implementation things are rarely that simple. The data is rarely sent in raw form, and is instead "pointed to" by passing in an address to a memory location that holds the data. Decoding this address and getting the data out of the memory takes some time. As CPU speeds have increased, this memory latency has historically become a large impediment to performance; see Memory wall.
In order to reduce the amount of time this takes, most modern CPUs use a technique known as instruction pipelining in which the instructions pass through several sub-units in turn. The first sub-unit reads the address and decodes it, the next "fetches" the values at those addresses, and the next does the math itself. With pipelining the "trick" is to start decoding the next instruction even before the first has left the CPU, in the fashion of an assembly line, so the address decoder is constantly in use. Any particular instruction takes the same amount of time to complete, a time known as the latency, but the CPU can process an entire batch of operations much faster than if it did so one at a time.
Vector processors take this concept one step further. Instead of pipelining just the instructions, they also pipeline the data itself. There are fed instructions that say not just to add A to B, but to add all of the numbers "from here to here" to all of the numbers "from there to there". Instead of constantly having to decode instructions and then fetch the data needed to complete them, it reads a single instruction from memory, and "knows" that the next address will be one larger than the last. This allows for significant savings in decoding time.
To illustrate what a difference this can make, consider the simple task of adding two groups of 10 numbers together. In a normal programming language you would write a "loop" that picked up each of the pairs of numbers in turn, and then added them. To the CPU, this would look something like this:
execute this loop 10 times read the next instruction and decode it fetch this number fetch that number add them put the result here end loopBut to a vector processor, this task looks considerably different:
read instruction and decode it fetch these 10 numbers fetch those 10 numbers add them put the results hereThere are several savings inherent in this approach. For one, only two address translations are needed. Depending on the architecture, this can represent a significant savings by itself. Another saving is fetching and decoding the instruction itself, which has to be done only one time instead of ten. The code itself is also smaller, which can lead to more efficient memory use.
But more than that, a vector processor may have multiple functional units adding those numbers in parallel. The checking of dependencies between those numbers is not required as a vector instruction specifies multiple independent operations. This simplifies the control logic required, and can improve performance by avoiding stalls.
As mentioned earlier, the Cray implementations took this a step further, allowing several different types of operations to be carried out at the same time. Consider code that adds two numbers and then multiplies by a third; in the Cray, these would all be fetched at once, and both added and multiplied in a single operation. Using the pseudocode above, the Cray did:
read instruction and decode it fetch these 10 numbers fetch those 10 numbers fetch another 10 numbers add and multiply them put the results hereThe math operations thus completed far faster overall, the limiting factor being the time required to fetch the data from memory.
Not all problems can be attacked with this sort of solution. Adding these sorts of instructions necessarily adds complexity to the core CPU. That complexity typically makes other instructions run slower—i.e., whenever it is not adding up many numbers in a row. The more complex instructions also add to the complexity of the decoders, which might slow down the decoding of the more common instructions such as normal adding.
In fact, vector processors work best only when there are large amounts of data to be worked on. For this reason, these sorts of CPUs were found primarily in supercomputers, as the supercomputers themselves were, in general, found in places such as weather prediction centres and physics labs, where huge amounts of data are "crunched".
Read more about this topic: Vector Processor
Famous quotes containing the word description:
“The Sage of Toronto ... spent several decades marveling at the numerous freedoms created by a global village instantly and effortlessly accessible to all. Villages, unlike towns, have always been ruled by conformism, isolation, petty surveillance, boredom and repetitive malicious gossip about the same families. Which is a precise enough description of the global spectacles present vulgarity.”
—Guy Debord (b. 1931)
“As they are not seen on their way down the streams, it is thought by fishermen that they never return, but waste away and die, clinging to rocks and stumps of trees for an indefinite period; a tragic feature in the scenery of the river bottoms worthy to be remembered with Shakespeares description of the sea-floor.”
—Henry David Thoreau (18171862)
“Whose are the truly labored sentences? From the weak and flimsy periods of the politician and literary man, we are glad to turn even to the description of work, the simple record of the months labor in the farmers almanac, to restore our tone and spirits.”
—Henry David Thoreau (18171862)