Geek Page - VLIW: Beyond RISC

By an instruction that uses the result. The second instruction cannot be

The Next Revolution in Microprocessors

While the big microprocessor developers don't want to tip their hands, there have been hints that the next generation of processors will be radically different from RISC processors. Most likely, they will be based on another four-letter acronym: VLIW (Very Long Instruction Word). VLIW has been around since at least the early 1980s and is perhaps most notable for the number of companies it has left bankrupt. Yet Hewlett-Packard, working in cooperation with Intel, has hired Joseph Fisher and Bob Rau, the two stars of VLIW research, and, according to industry reports, IBM and Digital also appear to be racing toward this new technology.

Current performance rests on the fundamental equation of computer architecture: the time required for a program to run is equal to the number of instructions in the program, times the average number of cycles required for an instruction, times the clock cycle period. (A clock cycle is the heartbeat of a microprocessor: each pulse triggers one step of computation.) All improvements in performance come from reducing one or more of these factors. The idea behind CISC was to reduce the first factor - the number of instructions - by making a single instruction do complex tasks. The problem was that the other two factors of the equation shot up. This led to the RISC approach of short, simple instructions and a clock cycle fast enough to compensate for the increase in the number of instructions. This approach works well, as far as it goes. But raw chip speed never increases by more than 25 percent a year, so additional performance must come from the growing number of transistors that fit on a single chip.

The first answer, emerging at the end of the 1980s, was superscalar RISC. A microprocessor is like a factory, made up of a dozen or so stations, each of which handles a single, simple task (for example, one station handles all additions). If a processor has two copies of all these stations, then it can work on any two instructions simultaneously. This solution had seductive appeal: simply double the number of stations and you get twice the performance.

It wasn't that easy. An instruction that adds two numbers may be followed executed until the first has been completed. Superscalar processors must detect these dependencies and ensure that only independent instructions are executed simultaneously. And this results in the key problem facing designers: the more instructions you try to execute simultaneously, the more dependencies you have to check for at every cycle, and therefore the longer your cycle time. So designers are faced with a trade-off between the Digital approach, whose AXP21064a Alpha processor runs at 275 MHz but has a maximum of only two simultaneous instructions, and the IBM approach, whose POWER2 plods along at 71.5 MHz but can execute up to six simultaneous instructions.

VLIW avoids this crippling trade-off - it can do simultaneous instruction sets at a high processing speed. That's why so much money is now being thrown into its development. VLIW follows a maxim popular in the hardware community: move the difficult stuff out of the hardware and into the software. Rather than having the microprocessor figure out on the fly which instructions are independent, have the compiler - the tool that translates a high-level program into machine instructions - figure it out beforehand.

So a compiler for a hypothetical 16-way VLIW processor first converts a program from a language like C or Fortran into standard RISC machine instructions. It then goes through the RISC instructions and sticks 16 of them together to produce one very long instruction. The compiler is able to determine when dependent instructions can be executed because it knows exactly how many cycles an instruction takes to produce its result. When the compiled program is run, the processor grabs each long instruction in a single gulp and, without wasting any time worrying about dependencies, immediately routes the 16 independent pieces to the appropriate stations. The VLIW processor itself is almost identical to a superscalar processor, but with the dependency-checking circuitry eliminated and the instruction word widened.

But, along with some significant design challenges, two major stumbling blocks remain. The first is that current compilers can rarely find more than four independent instructions per cycle. Even Joseph Fisher, who coined the term VLIW, has said that VLIW really only achieves significant speed increases for more predictable scientific programs.

The second stumbling block is more subtle, and is as much a marketing issue as a design problem. We are accustomed to processors being downward compatible: if a program runs on a 25 MHz 386, it will also run on a 50 MHz 486. This is not true for VLIW processors. Remember that a VLIW compiler schedules instructions based on the precise number of cycles required for that instruction's execution. Run a program on a processor with a faster multiplier than it was compiled for and instructions could get out of order, causing nasty and unpredictable results. If you believe the hints coming out of Hewlett-Packard, it may be possible to preserve downward compatibility with hardware emulation, albeit with some sacrifice in performance.

It's going to take a little luck and a few hundred million dollars in R&D, but VLIW should allow computer performance to continue on its exponential growth curve for the next 5 to 10 years. It will achieve this by taking full advantage of chip speed with simple, streamlined hardware, and of chip transistor densities with many duplicate processing stations. But the most important result of VLIW may be the closer ties it is fostering between hardware and compiler researchers. It is a relationship that has never been as tight as it should be, and could lead to startling new developments.