Despite numerous improvements in the design of chips, microprocessors are still bound by their internal clocks. These devices tick off "heartbeats" that synchronize a chip�s operations. Even in a new Pentium processor, the clock meters operations according to the traditional Von Neumann architecture: data is processed sequentially in the order set by the compiler.
Engineers have tried different approaches to maximize performance around this bottleneck. By writing complex tasks into a single instruction, CISC fans argue, a program can run more efficiently with fewer instructions. RISC, on the other hand, focuses on reduced instruction sets that can be processed in shorter cycle times. The latest processors are actually neither CISC nor RISC, but engender aspects of both.
Yet here again, the clock forces its will upon the microprocessor, limiting when and how instructions can be executed. Think of a processor as an airport. A traditional machine would have one runway and a set order for takeoff. Even though flight D is ready, it cannot shoot down the runway until flights A, B, and C have taken off � and C is still waiting for a passenger.
If you eliminated sequential execution, instructions could be carried out immediately when their inputs were ready. This technique, called dataflow, offers significant improvements over the traditional architecture. A dataflow airport would have no runway waiting list � an airplane can take off as soon as it is ready.
Some call this design clockless � Theseus Logic's Null Convention Logic circuit and Sharp's New Media Processor both claim to function with no central clock. The defining element of a dataflow processor, however, is not the presence or absence of a clock. It is that instructions execute automatically when the input data becomes available. This allows things to happen in parallel, without the software-coding nightmare of having to explicitly recognize when an operation will receive its required input and be ready to execute.
Data dependencies
A dataflow program is a sort of flow chart, in which individual instructions are represented by nodes. In a simple program such as [(a + b) x (c + d)] x [(e + f) x (g + h)], the (a + b) calculation represents one node, (c + d) another, and so on. Calculations execute automatically when the input data or operand pair � say, a and b � is available. Data flows from node to node according to the arrows of the graph, so that (a + b) becomes one of the operands for [(a + b) x (c + d)]. When the operand (c + d) is calculated, that information is passed on as well.
Yet if dataflow technology is so great, why didn�t it catch on decades ago? After all, the idea sprang up in academic circles in the '60s, and in 1965 Robert Tomasulo applied a limited dataflow design to the floating-point unit for the IBM 360 Model 91. "Dataflow is a good idea, which is why it has kept resurfacing," says Yale Patt, a computer scientist at the University of Michigan. "But it also has several problems."
First of all, tracking the information flow and data dependencies of thousands of nodes proved to be insanely difficult. Take the issue of context switching. A computer usually handles multiple tasks in parallel, cycling through its to-do list and spending a fraction of a second at a time on each task. Given a sequential instruction stream, context switching � recording the current status of one job and moving on to the next � is a simple matter of leaving a marker at the stopping point and storing a few computed values. But, Patt explains, "the amount of state information in a dataflow graph � this operation has fired, this node is waiting for the second operand, et cetera � makes context switching incredibly difficult." The state information for a machine with a full dataflow architecture could run into the tens of thousands of bytes.
Debugging presents another problem. The straightforward way to test a processor is to let it execute to a certain point and then analyze what you get. "But a dataflow program does not just follow a sequential path and stop," says Patt. "So where was the bug?"
To market, to market
In 1985, Patt's research team added to Tomasulo's limited dataflow technique and suggested applying it to all chip operations, through a microarchitecture called HPS, for high-performance substrate. And by the early '90s, most of the industry began listening. What is the Pro in the Pentium Pro? An on-the-fly dataflow architecture called dynamic scheduling or out-of-order execution. Dynamic scheduling applies the benefits of dataflow processing to conventional programs. And software coding is not an obstacle, because the dataflow elements are written into the chip. Instructions flow in and out of the processor in sequential program order, but internally they are converted into a dataflow graph and executed according to the availability of data.
This graph, called the active window, is created at run time, so at any moment only a portion of the program is represented. When one node is ready � that is, the input data is available � the operands are sent to the functional unit to be computed and that node of the graph disappears. Nodes are being created and others retired every cycle.
The HPS research group also improved on Tomasulo's 360/91 concept in other ways: by fetching multiple instructions per cycle; by incorporating a very aggressive dynamic branch predictor that anticipates future instructions, allowing the chip to get a head start; and, most important, by adding a mechanism to recover the precise state of the machine as it would have been if instructions had executed in sequential order. Essentially, this last addition enables the chip to correct itself if any execution does not complete correctly.
Most of the industry has now adopted dynamic scheduling, something that has helped Intel's Pentium Pro achieve 30 percent higher performance than the Pentium. And as the number of possible nodes rises, processing performance will increase even more. At the head of the curve, Patt's team is thinking about dynamic scheduling with thousands of nodes. Though the Pentium Pro can now handle only 20 nodes at a time, and the HP 28000 chip only 56, it's clear that the industry has decided to go with the dataflow.
This article originally appeared in the August issue of Wired magazine.