Minggu, 23 Juni 2019

future engineering techniques in electronics for analog logic, digital, analog and digital conditional logic or called fuzzy logic, and towards technology in the development of logic logic instincts on the robotic automation system and the world energy research network in ADA (Address - Data - Accumulation) towards the earth as a dynamic dynamic automatic ship in galactic space AMNIMARJESLOW up to Thanks in Lord Jesus Over from heaven Hi- Tech Fidelity ___ 91320470100250017 and 0209601014 for what Cell operate __ On Gen Mac Tech Zone come in Your Love in 2 e- Yes 40__007






  
An ADA (Address - Data - Accumulation) network is composed of a series of ADA mapping logic that is based on an electronic engineering technique called analog and digital techniques which is a 21st century civilization of electronic technology that we are leading to instinctive electronics and learning technologies that might network THERE it will be compiled based on electronic engineering techniques that we will develop from this network of microcomputer networks all over the world to encourage appropriate network research.


                                                                                LOVE & e- WET In Hello EINSTEIN
                                                                            ( Energy Input Saucer Tech Energy Intern )
                                                                                   Gambar terkait



                                                                               Gen . Mac Tech Zone Love in 2 e- YES              
             Gambar terkait Gambar terkait
                                      Gambar terkait




                                   XI PIC  Trends in Microcomputers
_______________________________________________________


Fueling the microcomputer product and market expansion is a rapid technological evolution. Today, the microcomputer market is fundamentally technology-driven, and it is expected to remain in this condition for at least 10 more years. To characterize market trends, it is, therefore, essential to examine first the LSI technology trends and then assess the potential market impact. The following projections will be limited to the MOS technology, since it represents the fastest-moving and most promising technology for high-performance and large-complexity VLSI circuits. Each technology is characterized by figures of merit that relate to performance and cost. The most common figures of merit are:
  • Propagation delay, i.e., the time delay of a signal through a logic gate driving 10 identical gates. Propagation delay is usually measured in nanoseconds.
  • Speed-power product, i.e., the product of the propagation delay of a gate and its power dissipation, usually measured in picojoules.
  • Gate density and bit density measured in gates per square millimeter and bits per square millimeter.
  • Cost per bit and cost per gate, measured in cents per bit and cents per gate for a product that has reached high- volume production levels.
Figure 1 shows past and expected future trends of speed-power product and propagation delay (tpd) for the major generations of state-of-the-art noncomplementary MOS technologies used for LSI production.
Figure 2 shows past and expected future trends of bit density for major generations of dynamic RAMs. The future also shows expected chip size and the expected first year of production for each major new RAM generation. Figure 3 shows trends of random-logic gate density and how this translates into practical gate complexity and circuit size for major generations of random-logic circuits. Underscoring these trends are the following considerations and developments. Optical photolithography limits will be reached by

                                     

the late seventies and further progress will be made possible by the application to large-scale production of electron beam lithography now under development. Electron beam lithography will make possible the scaling down of structures to micron and submicron sizes with consequent increase in density. The actual physical limitations to a continuing increase in complexity and performance are not expected to result from line-width limitations but rather from breakdown phenomena in semiconductors and from total power dissipation. Breakdown phenomena are usually proportional to electric field strengths; therefore, as the geometry is scaled down, the supply voltage must be reduced. Ultimately, thermal phenomena will limit this voltage to a multiple of KT/q. A gross estimate of a practical limit for MOS technology is a circuit using complementary MOS technology, operating at a supply voltage of 400 mV, having minimum line width of 1/4m m, dissipating 1 W at 100 MHz of operating frequency, having a size of about 5 cm by 5 cm, and having the complexity of about 100 million gates! This shows that the trends shown in Fig. 1, Fig. 2, and Fig. 3 are still very far from a practical limit and that technological acceleration will continue well beyond the next decade.

      


in the previous data is that the data is valid for state-of-the-art, high-volume products or technologies and not for R and D projects. Finally, Fig. 4 shows the cost-equivalent die size as a function of time for state-of-the-art, high-volume-production products. The increase in die size for a given cost is made possible by the use in production of larger-diameter wafers, as shown, and the continuing improvement and control of yield-limiting factors, such as mask quality, fabrication-equipment sophistication, and clean- room facilities. I should stress that only mature products follow the curve of Fig. 4, i.e., products in high-volume production with similar production volume history. For a product of a given chip size, the cost (not the price) is found to follow a 70 percent learning curve; i.e., the cost becomes 70 percent of the original every time the cumulative volume produced doubles.
  Microcomputer Trends The data given only shows the inherent capabilities of technology. The products suggested in the curves are only indicative of the increased complexity possible in relationship to and in conformity with today's products. However, the real impact of such technology potential is in creating the breeding ground for a new revolutionary development of which the microcomputer is the forerunner. To better clarify this concept, let's examine the influence semiconductor technology has had on the evolution of the basic constituents of a computer:
  • Memory. This function was the first to be integrated, and over a period of 6 years, semiconductor memories have practically replaced the magnetic core memory. Much of the technological development motivation in the seventies was due to the existence and the demands of the memory market.
  • CPU. As soon as memory technology reached a sufficient level of maturity, the function of a simple CPU could be integrated-the microprocessor was born. Microprocessors still use memory technology for their implementation and have borrowed architectural concepts from the well- developed area of computer architecture. I need to stress here that since computer architecture has evolved under the economic and technological reality of small-scale and medium-scale integration, it is predictable that LSI and







VLSI will have a profound influence on computer and system architecture in general.
  • Input/output. This function, because of the multiplicity of requirements, was the last to be integrated, and this process is still in its infancy. To solve the I/O problem, our industry has introduced a novel idea, i.e., input/output devices whose hardware configuration and timing requirements are software-programmable. This way, the same circuit can be adapted to a variety of different uses within the same class: parallel interface, serial interface, or specific peripheral controllers.
  • Software. So far, software technology has only been marginally affected by the existence of microcomputers. Areas of influence are, for example, in diagnostic tools, such as software-development systems and specialized logic analyzers and hardware emulation tools. Under the pressure of an expanding market, however, microcomputer software, is rapidly maturing to the level of sophistication found in minicomputer and megacomputer software. High-level languages specifically designed for microcomputers are now being developed, and the trend will continue by incorporating features into the microcomputer architecture that will make .high-level programming very efficient.
                      
                                             Intel Microprocessors: 8008 to 8086
________________________________________________________________________


I. Introduction
"In the beginning Intel created the 4004 and the 8008."

A. The Prophecy
  
Intel introduced the microprocessor in November 1971 with the advertisement, "Announcing a New Era in Integrated Electronics." The fulfillment of this prophecy has already occurred with the delivery of the 8008 in 1972, the 8080 in 1974, the 8085 in 1976, and the 8086 in 1978. During this time, throughput has improved 100-fold, the price of a CPU chip has declined from $300 to $3, and microcomputers have revolutionized design concepts in countless applications. They are now entering our homes and cars.
Each successive product implementation depended on semiconductor process innovation, improved architecture, better circuit design, and more sophisticated software, yet upward compatibility not envisioned by the first designers was maintained. This paper provides an insight into the evolutionary process that transformed the 8008 into the 8086, and gives descriptions of the various processors, with emphasis on the 8086.

B. Historical Setting
  
In the late 1960s it became clear that the practical use of large-scale integrated circuits (LSI) depended on defining chips having

  • High gate-to-pin ratio


  • Regular cell structure


  • Large standard-part markets
In 1968, Intel Corporation was founded to exploit the semiconductor memory market, which uniquely fulfilled these criteria. Early semiconductor RAMs, ROMs, and shift registers were welcomed wherever small memories were needed, especially in calculators and CRT terminals, In 1969, Intel engineers began to study ways of integrating and partitioning the control logic functions of these systems into LSI chips.
At this time other companies (notably Texas Instruments) were exploring ways to reduce the design time to develop custom integrated circuits usable in a customer's application. Computer-aided design of custom ICs was a hot issue then. Custom ICs are making a comeback today, this time in high-volume applications which typify the low end of the microprocessor market.
An alternate approach was to think of a customer's application as a computer system requiring a control program, I/O monitoring, and arithmetic routines, rather than as a collection of special-purpose logic chips. Focusing on its strength in memory, Intel partitioned systems into RAM, ROM, and a single controller chip, the central processor unit (CPU).
Intel embarked on the design of two customer-sponsored microprocessors, the 4004 for a calculator and the 8008 for a CRT terminal. The 4004, in particular, replaced what would otherwise have been six customized chips, usable by only one customer, Because the first microcomputer applications were known, tangible, and easy to understand, instruction sets and architectures were defined in a matter of weeks. Since they were programmable computers, their uses could be extended indefinitely.
Both of these first microprocessors were complete CPUs-on-a-chip and had similar characteristics. But because the 4004 was designed for serial BCD arithmetic while the 8008 was made for 8-bit character handling, their instruction sets were quite different.
The succeeding years saw the evolutionary process that eventually led to the 8086. Table 1 summarizes the progression of features that took place during these years.

II. 8008 Objectives and Constraints

Late in 1969 Intel Corporation was contracted by Computer Terminal Corporation (today called Datapoint) to do a pushdown stack chip for a processor to be used in a CRT terminal. Datapoint had intended to build a bit-serial processor in TTL logic using shift-register memory. Intel counterproposed to implement the entire processor on one chip, which was to become the 8008. This processor, along with the 4004, was to be fabricated using the then-current memory fabrication technology, p-MOS. Due to the long lead time required by Intel, Computer Terminal proceeded to market the serial processor and thus compatibility constraints were imposed on the 8008.
Most of the instruction-set and register organization was specified by Computer Terminal. Intel modified the instruction set so the processor would fit on one chip and added instructions to make it more general-purpose. For although Intel was developing the 8008 for one particular customer, it wanted to have the option of selling it to others, Intel was using only 16- and 18-pin packages in those days, and rather than require a new package for what was believed to be a low-volume chip, they chose to use 18 pins for the 8008.

icrocomputers


Table 1 Feature Comparison
 


8008 8080 8085 8086
Number of instructions 66 111 113 133
Number of flags 4 5 5 9
Maximum memory size 16K bytes 64K bytes64K bytes 1 M bytes
I/O ports8 input
24 output
256 input
256 output
256 input
256 output
64K input
64K output
Number of pins 18 40 40 40
Address bus width 8† 16 16 16†
Data bus width 8† 8 8 16†
Data types 8-bit unsign8-bit unsign
16-bit unsign(limited)
Packed BCD (limited)
8-bit unsign
16-bit unsign (limited)
Packed BCD (limited)
8-bit unsign
8-bit signed
16-bit unsign
16-bit unsign
Packed BCD
Unpacked BCD
Addressing modes Register ‡ ImmediateMemory direct (limited)
Memory indirect (limited)
Register ‡ Immediate
Memory direct (limited)
Memory indirect (limited)
Register ‡ Immediate
Memory direct Memory indirect
Register Immediate
Indexing
Introduction date 1972 1974 1976 1978

† Address and date bus multiplexed.
‡ Memory can be addressed as a special case by using register M.

III. 8008 Instruction-Set Processor
 
The 8008 processor architecture is quite simple compared to modern-day microprocessors. The data-handling facilities provide for byte data only. The memory space is limited to 16K bytes, and the stack is on the chip and limited to a depth of 8. The instruction set is small but symmetrical, with only a few operand-addressing modes available. An interrupt mechanism is provided, but there is no way to disable interrupts.

A. Memory and I/O Structure
  
The 8008 addressable memory space consists of 16K bytes. That seemed like a lot back in 1970, when memories were expensive and LSI devices were slow. It was inconceivable in those days that anybody would want to put more than 16K of this precious resource on anything as slow as a microprocessor.
The memory size limitation was imposed by the lack of available pins. Addresses are sent out in two consecutive clock cycles over an 8-bit address bus. Two control signals, which would have been on dedicated pins if these had been available, are sent out together with every address, thereby limiting addresses to 14 bits.
The 8008 provides eight 8-bit input ports and twenty-four 8-bit output ports. Each of these ports is directly addressable by the instruction set. It was felt that output ports were more important than input ports because input ports can always be multiplexed by external hardware under control of additional output ports.
One of the interesting things about that era was that, for the first time, the users were given access to the memory bus and could define their own memory structure; they were not confined to what the vendors offered, as they had been in the minicomputer era. As an example, the user had the option of putting I/O ports inside the memory address space instead of in a separate I/O space.

Intel Microprocessors: 8008 to 8086 617


B. Register Structure
  
The 8008 processor contains two register files and four 1-bit flags. The register flies are referred to as the scratchpad and the address stack.

1. Scratchpad. The scratchpad file contains an 8-bit accumulator called A and six additional 8-bit registers called B,C,D,E,H, and L. All arithmetic operations use the accumulator as one of the operands and store the result back in the accumulator. All seven registers can be used interchangeably for on-chip temporary storage.
There is one pseudo-register, M, which can be used inter changeably with the scratchpad registers. M is, in effect, that particular byte in memory whose address is currently contained in H and L (L contains the eight low-order bits of the address and H contains the six high-order bits). Thus M is a byte in memory and not a register; although instructions address M as if it were a register, accesses to M actually involve memory references. The M register is the only mechanism by which data in memory can be accessed.

2. Address Stack. The address stack contains a 3-bit stack pointer and eight 14-bit address registers providing storage for eight addresses. These registers are not directly accessible by the programmer; rather they are manipulated with control-transfer instructions.
Any one of the eight address registers in the address stack can serve as the program counter; the current program counter is specified by the stack pointer. The other seven address registers permit storage for nesting of subroutines up to seven levels deep. The execution of a call instruction causes the next address register in turn to become the current program counter, and the return instruction causes the address register that last served as the program counter to again become the program counter. The stack will wrap around if subroutines are nested more than seven levels deep.

3. Flags. The four flags in the 8008 are CARRY, ZERO, SIGN, and PARITY. They are used to reflect the status of the latest arithmetic or logical operation. Any of the flags can be used to alter program flow through the use of the conditional jump, call, or return instructions. There is no direct mechanism for saving or restoring flags, which places a severe burden on interrupt processing (see Appendix 1 for details).
The CARRY flag indicates if a carry-out or borrow-in was generated, thereby providing the ability to perform multiple-precision binary arithmetic.
The ZERO flag indicates whether or not the result is zero. This provides the ability to compare the two values for equality.
The SIGN flag reflects the setting of the leftmost bit of the result. The presence of this flag creates the illusion that the 8008 is able to handle signed numbers. However, there is no facility for detecting signed overflow on additions and subtractions. Furthermore, comparing signed numbers by subtracting them and then testing the SIGN flag will not give the correct result if the subtraction resulted in signed overflow. This oversight was not corrected until the 8086.
The PARITY flag indicates if the result is even or odd parity. This permits testing for transmission errors, an obviously useful function for a CRT terminal.

C. Instruction Set
  
The 8008 instructions are designed for moving or modifying 8-bit operands. Operands are either contained in the instruction itself (immediate operand), contained in a scratchpad register (register operand), or contained in the M register (memory operand). Since the M register can be used interchangeably with the scratchpad registers, there are only two distinct operand-addressing modes¾ immediate and register. Typical instruction formats for these modes are shown in Fig. 1. A summary of the 8008 instructions appears in Fig. 2.
The instruction set consists of scratchpad-register instructions, accumulator-specific instructions, transfer-of-control instructions, input/output instructions, and processor-control instructions.
The scratchpad-register instructions modify the contents of the M register or any scratchpad register. This can consist of moving data between any two registers, moving immediate data into a register, or incrementing or decrementing the contents of a register. The incrementing and decrementing instructions were not in Computer Terminal's specified instruction set; they were added by Intel to provide for loop control, thereby making the processor more general-purpose.
Most of the accumulator specific instructions perform operations between the accumulator and a specified operand. The operand can be any one of the scratchpad registers, including M, or it can be immediate data. The operations are add, add-with-carry, subtract, subtract-with-borrow, logical AND, logical OR, logical exclusive-OR, and compare. Furthermore, there are four unit-rotate instructions that operate on the accumulator. These instructions perform either an 8- or 9-bit rotate (the CARRY flag acts as a ninth bit) in either the left or right direction.
Transfer-of-control instructions consist of jumps, calls, and returns. Any of the transfers can be unconditional, or can be conditional based on the setting of any one of the four flags. Making calls and returns conditional was done to preserve the symmetry with jumps and for no other reason. A short one-byte form of call is also provided, which will be discussed later under interrupts.
Each of the jump and call instructions (with the exception of the one-byte call) specifies an absolute code address in the second and
third byte of the instruction. The second byte contains the six high-order bits of the address, and the third byte contains the eight low-order bits. This inverted storage, which was to haunt all processors evolved from the 8008, was a result of compatibility with the Datapoint bit-serial processor, which processes addresses from low bit to high bit. This inverted storage did have a virtue in those early days when 256 by 8 memory chips were popular: it allowed all memory chips to select a byte and latch it for output while waiting for the six high-order bits which selected the chip. This speeded up memory access. There are eight input instructions and 24 output instructions, which altogether use up 32 opcodes. Each of these instructions transfers a byte of data between the accumulator and a designated I/O port. The processor-control instructions are halt and no-op. Halt puts the processor into a waiting state. The processor will remain in that state until an interrupt occurs. No-op is actually one of the move instructions; specifically, it moves the contents of the accumulator into the accumulator, thereby having no net effect (move instructions do not alter flag settings). D. Interrupts Interrupt processing was not a requirement of the 8008. Hence only the most primitive mechanism conceivable-not incrementing the program counter-was provided. Such a mechanism permits an interrupting device to jam an instruction into the processor's instruction stream. This is accomplished by having the interrupting device, instead of memory, respond to the instruction fetch; since the program counter isn't incremented, the instruction in memory that doesn't get fetched won't be skipped. The instruction typically supplied by the interrupting device is a call, so that an interrupt service routine can be entered and then the main program can be resumed after interrupt processing is complete (a jump instruction would result in the loss of the main program return address). To simplify the interrupting device's task of generating an instruction, the 8008 instruction set provides eight one-byte subroutine calls, each to a fixed location in memory. There are no instructions provided for disabling the interrupt mechanism, and so this function must be realized with external hardware. More important, there are no instructions for conveniently saving the registers and flags when an interrupt occurs.
  IV. Objectives and Constraints of the 8080 By 1973 the technology had advanced from p-MOS to n-MOS for memory fabrication. As an engineering exercise it was decided to use the 8008 layout masks with the n-MOS process to obtain a faster 8008. After a short study, it was determined that a new layout was required, so it was decided to enhance the processor at the same time, and to utilize the new 40-pin package made practical by high-volume calculator chips. The result was the 8080 processor. The 8080 was the first processor designed specifically for the microprocessor market. It was constrained to include all the 8008


 

Microcomputers
instructions but not necessarily with the same encodings. This meant that user's software would be portable but the actual ROM chips containing the programs would have to be replaced. The main objective of the 8080 was to obtain a 10:1 improvement in throughput, eliminate many of the 8008 shortcomings that had by then become apparent, and provide new processing capabilities not found in the 8008. These included a commitment to 16-bit data types mainly for address computations, BCD arithmetic, enhanced operand-addressing modes, and improved interrupt capabilities. Now that memory costs had come down and processing speed was approaching TTL, larger memory spaces were appearing more practical. Hence another goal was to be able to address directly more than 16K bytes. Symmetry was not a goal, because the benefits to be gained from making the extensions symmetric would not justify the resulting increase in chip size and opcode space.
  V. The 8080 Instruction-Set Processor The 8080 architecture is an unsymmetrical extension of the 8008. The byte-handling facilities have been augmented with a limited number of 16-bit facilities. The memory space grew to 64K bytes and the stack was made virtually unlimited. Various alternatives for the 8080 were considered. The simplest involved merely adding a memory stack and stack instructions to the 8008. An intermediate position was to augment the above with 16-bit arithmetic facilities that can be used for explicit address manipulations as well as 16-bit data manipulations. The most difficult alternative was a symmetric extension which replaced the one-byte M-register instructions with three-byte generalized memory-access instructions. The last two bytes of these instructions contained two address-mode bits specifying indirect addressing and indexing (using HL as an index register) and a 14-bit displacement. Although this would have been a more versatile addressing mechanism, it would have resulted in significant code expansion on existing 8008 programs. Furthermore, the logic necessary to implement this solution would have precluded the ability to implement 16-bit arithmetic; such arithmetic would not be needed for address manipulations under this enhanced addressing facility but would still be desirable for data manipulations. For these reasons, the intermediate position was finally taken.
 
  A. Memory and I/O Structure The 8080 can address up to 64K bytes of memory, a fourfold increase over the 8008 (the 14-bit address stack of the 8008 was eliminated). The address bus of the 8080 is 16 bits wide, in contrast to eight bits for the 8008, so an entire address can be sent down the bus in one memory cycle. Although the data handling facilities of the 8080 are primarily byte-oriented (the 8008 was exclusively byte-oriented), certain operations permit two consecutive bytes of memory to be treated as a single data item. The two bytes are called a word. The data bus of the 8080 is only eight bits wide, and hence word accesses require an extra memory cycle. The most significant eight bits of a word are located at the higher memory address. This results in the same kind of inverted storage already noted in transfer instructions of the 8008. The 8080 extends the 32-port capacity of the 8008 to 256 input ports and 256 output ports. In this instance, the 8080 is actually more symmetrical than the 8008. Like the 8008, all of the ports are directly addressable by the instruction set. B. Register Structure The 8080 processor contains a file of seven 8-bit general registers, a 16-bit program counter (PC) and stack pointer (SP), and five 1-bit flags. A comparison between the 8008 and 8080 register sets is shown in Fig. 3.
 
 
 
 

1. General Registers. The 8080 registers are the same seven 8-bit registers that were in the 8008 scratchpad-namely A,B,C, D,E,H, and L. In order to incorporate 16-bit data facilities in the 8080, certain instructions operate on the register pairs BC, DE, and HL.
The seven registers can be used interchangeably for on-chip temporary storage. The three register pairs are used for address manipulations, but their roles are not interchangeable; there is an 8080 instruction that allows operations on DE and not BC, and there are address modes that access memory indirectly through BC or. DE but not HL.
As in the 8008, the A register has a unique role in arithmetic and logical operations: it serves as one of the operands and is the receptacle for the result. The HL register again has its special role of pointing to the pseudo-register M.

2. Stack Pointer and Program Counter. The 8080 has a single program counter instead of the floating program counter of the 8008. The program counter is 16 bits (two bits more than the 8008's program counter), thereby permitting an address space of 64K.
The stack is contained in memory instead of on the chip, which removes the restriction of only seven levels of nested subroutines. The entries on the stack are 16 bits wide. The 16-bit stack pointer is used to locate the stack in memory. The execution of a call instruction causes the contents of the program counter to be pushed onto the stack, and the return instruction causes the last stack entry to be popped into the program counter. The stack pointer was chosen to run "downhill" (with the stack advancing toward lower memory) to simplify indexing into the stack from the user's program (positive indexing) and to simplify displaying the contents of the stack from a front panel.
Unlike the 8008, the stack pointer is directly accessible to the programmer. Furthermore, the stack itself is directly accessible, and instructions are provided that permit the programmer to push and pop his own 16-hit items onto the stack.

3. Flags. A fifth flag, AUXILIARY CARRY, augments the 8008 flag set to form the flag set of the 8080. The AUXILIARY CARRY flag indicates if a carry was generated out of the four low-order bits. This flag, in conjunction with a decimal-adjust instruction, provides the ability to perform packed BCD addition (see Appendix 2 for details). This facility can be traced back to the 4004 processor. The AUXILIARY CARRY flag has no purpose other than for BCD arithmetic, and hence the conditional transfer instructions were not expanded to include tests on the AUXILIARY CARRY flag.
It was proposed too late in the design that the PARITY flag should double as an OVERFLOW flag. Although this feature didn't make it into the 8080, it did show up two years later in Zilog's Z-80.

C. Instruction Set
The 8080 includes the entire 8008 instruction set as a subset, The added instructions provide some new operand-addressing modes and some facilities for manipulating 16-bit data. These extensions have introduced a good deal of asymmetry. Typical instruction formats are shown in Fig. 1. A summary of the 8080 instructions appears in Fig. 4.
The only means that the 8008 had for accessing operands in memory was via the M register. The 8080 has certain instructions that access memory by specifying the memory address (direct addressing) and also certain instructions that access memory by specifying a pair of general registers in which the memory address is contained (indirect addressing). In addition, the 8080 .includes the register and immediate operand-addressing modes of the 8008. A 16-bit immediate mode is also included,
The added instructions can be classified as load/store instructions, register-pair instructions, HL-specific instructions, accumulator-adjust instructions, carry instructions, expanded I/O instructions, and interrupt instructions.
The load/store instructions load and store the accumulator register and the HL register pair using the direct and indirect addressing mode. Both modes can be used for the accumulator, but due to chip size constraints, only the direct mode was implemented for HL.
The register-pair instructions provide for the manipulation of 16-bit data items. Specifically, register pairs can be loaded with
16-bit immediate data, incremented, decremented, added to HL, pushed on the stack, or popped off the stack. Furthermore, the flag settings themselves can be pushed and popped, thereby simplifying saving the environment when interrupts occur (this was not possible in the 8008).
The UL-specific instructions include facilities for transferring HL to the program counter or to the stack pointer, and exchanging HL with DE or with the top entry on the stack. The last of these instructions was included to provide a mechanism for (1) removing a subroutine return address from the stack so that passed parameters can be discarded or (2) burying a result-to-be-returned under the return address, This became the longest instruction in the 8080 (5 memory cycles); its implementation precluded the inclusion of several other instructions that were already proposed for the processor.
Two accumulator-adjust instructions are provided. One complements each bit in the accumulator and the other modifies the accumulator so that it contains the correct decimal result after a packed BCD addition is performed.
The carry instructions provide for setting or complementing the CARRY flag. No instruction is provided for clearing the CARRY flag. Because of the way the CARRY flag semantics are defined, the CARRY flag can be cleared simply by ORing or ANDing the accumulator with itself.
The expanded I/O instructions permit transferring the contents of any one of 256 8-bit ports either to or from the accumulator. The port number is explicitly contained in the instruction; hence, the instruction is two bytes long. The equivalent 8008 instruction is only one byte long. This is the only instance in which an 8080 instruction requires a different number of bytes than its 8008 counterpart. The motivation for doing this was more to free up 32 opcodes than to increase the number of I/O ports. The 8080 has the identical interrupt mechanism the 8008 has, but in addition, it has instructions for enabling or disabling the interrupt mechanism. This feature, along with the ability to push and pop the processor flags, made the interrupt mechanism practical.
 
  VI. 8085 Objectives and Constraints
  In 1976, technology advances allowed Intel to consider enhancing its 8080. The objective was to come out with a processor set utilizing a single power supply and requiring fewer chips (the 8080
 
required a separate oscillator chip and system controller chip to make it usable). The new processor, called the 8085, was constrained to be compatible with the 8080 at the machine-code level. This meant that the only extension to the instruction set could be in the twelve unused opcodes of the 8080.
The 8085 turned out to be architecturally not much more than a repackaging of the 8080. The major differences were in such areas as an on-chip oscillator, power-on reset, vectored interrupts, decoded control lines, a serial I/O port, and a single power supply. Two new instructions were added to handle the serial port and interrupt mask. These instructions (RIM and SIM) appear in Fig. 4. Several other instructions that had been contemplated were not made available because of the software ramifications and the compatibility constraints they would place on the forthcoming 8086.

VII. Objectives and Constraints of 8086

The new Intel 8086 microprocessor was designed to provide an order of magnitude increase in processing throughput over the older 8080. The processor was to be assembly-language-level-compatible with the 8080 so that existing 8080 software could be reassembled and correctly executed on the 8086. To allow for this, the 8080 register set and instruction set appear as logical subsets of the 8086 registers and instructions. By utilizing a general- register structure architecture, Intel could capitalize on its experience with the 8080 to obtain a processor with a higher degree of sophistication. Strict 8080 compatibility, however, was not attempted, especially in areas where it would compromise the final design.
The goals of the 8086 architectural design were to provide symmetric extensions of existing 8080 features, and to add processing capabilities not found in the 8080. These features included 16-bit arithmetic, signed 8- and 16-hit arithmetic (including multiply and divide), efficient interruptible byte-string operations, improved bit-manipulation facilities, and mechanisms to provide for re-entrant code, position-independent code, and dynamically relocatable programs.
By now memory had become very inexpensive and microprocessors were being used in applications that required large amounts of code and/or data. Thus another design goal was to be able to address directly more than 64k bytes and support multiprocessor configurations.

VIII. The 8086 Instruction-Set Processor

The 8086 processor architecture is described in terms of its memory structure, register structure, instruction set, and external interface. The 8086 memory structure includes up to one megabyte of memory space and up to 64K input/output ports. The register structure includes three files of registers. Four 16-bit general registers can participate interchangeably in arithmetic and logic operations, two 16-bit pointer and two 16-bit index registers are used for address calculations, and four 16-bit segment registers allow extended addressing capabilities. Nine flags record the processor state and control its operation.
The instruction set supports a wide range of addressing modes and provides operations for data transfer, signed and unsigned 8- and 16-bit arithmetic, logicals, string manipulations, control transfer, and processor control. The external interface includes a reset sequence, interrupts, and a multiprocessor-synchronization and resource-sharing facility.

A. Memory and I/O Structure
The 8086 memory structure consists of two components-the memory space and the input/output space. All instruction code and operands reside in the memory space. Peripheral and I/O devices ordinarily reside in the I/O space, except in the case of memory-mapped devices.

1. Memory Space. The 8086 memory is a sequence of up to 1 million 8-bit bytes, a considerable increase over the 64K bytes in the 8080. Any two consecutive bytes may be paired together to form a 16-bit word. Such words may be located at odd or even byte addresses. The data bus of the 8086 is 16 bits wide, so, unlike the 8080, a word can be accessed in one memory cycle (however, words located at odd byte addresses still require two memory cycles). As in the 8080, the most significant 8 bits of a word are located in the byte with the higher memory address.
Since the 8086 processor performs 16-bit arithmetic, the address objects it manipulates are 16 bits in length. Since a 16-bit quantity can address only 64K bytes, additional mechanisms are required to build addresses in a megabyte memory space. The 8086 memory may be conceived of as an arbitrary number of segments, each .at most 64K bytes in size. Each segment begins at an address which is evenly divisible by 16 (i.e., the low-order 4 bits of a segment's address are zero). At any given moment the contents of four of these segments are immediately addressable. These four segments, called the current code segment, the current data segment, the current stack segment, and the current extra segment, need not be unique and may overlap. The high-order 16 bits of the address of each current segment are held in a dedicated 16-bit segment register. In the degenerate case where all four segments start at the same address, namely address 0, we have an 8080 memory structure.
Bytes or words within a segment are addressed by using 16-bit offset addresses within the 64K byte segment. A 20-bit physical address is constructed by adding the 16-bit offset address to the contents of a 16-bit segment register with 4 low-order zero bits appended, as illustrated in Fig. 5.
Various alternatives for extending the 8080 address space were considered. One such alternative consisted of appending 8 rather than 4 low-order zero bits to the contents of a segment register, thereby providing a 24-hit physical address capable of addressing up to 16 megabytes of memory. This was rejected for the following reasons:
  • Segments would be forced to start on 256-byte boundaries, resulting in excessive memory fragmentation.
  • The 4 additional pins that would he required on the chip were not available.
  • It was felt that a 1-megabyte address space was sufficient.

2. Input/Output Space. In contrast to the 256 I/O ports in the 8080, the 8086 provides 64K addressable input or output ports. Unlike the memory, the I/O space is addressed as if it were a single segment, without the use of segment registers. Input/output physical addresses are in fact 20 bits in length, but the high-order 4 bits are always zero. The first 256 ports are directly addressable (address in the instruction), whereas all 64K ports are indirectly addressable (address in register). Such indirect addressing was provided to permit consecutive ports to he accessed in a program loop. Ports may be 8 or 16 bits in size, and 16-bit ports may he located at odd or even addresses. B. Register Structure The 8086 processor contains three files of four 16-bit registers and a file of nine 1-bit flags. The three files of registers are the general-register file, the pointer- and index-register file, and the segment-register file. There is a 16-bit instruction pointer (called the program counter in the earlier processors) which is not directly accessible to the programmer; rather, it is manipulated with control transfer instructions. The 8086 register set is a superset of the 8080 registers, as shown in Figs. 6 and 7. Corresponding registers in the 8080 and 8086 do not necessarily have the same names, thereby permitting the 8086 to use a more meaningful set of names.
Intel Microprocessors: 8008 to 8086 625
1, General-Register File. The AX-BX-CX-DX register set is called the general-register file, or HL group (for reasons that will be apparent below). The general registers can participate interchangeably in the arithmetic and logical operations of the 8086. Some of the other 8086 operations (such as the string operations) dedicate certain of the general registers to specific uses. These uses are indicated by the mnemonic phrases "accumulator," "base," "count," and "data" in Fig. 7. The general registers have a property that distinguishes them from the other registers-their upper and lower halves are separately addressable. Thus, the general registers can be thought of as two files of four 8-bit registers-the H file and the L file. 2. Pointer- and Index-Register File. The SP-BP-SI-DI register set is called the pointer- and index-register file, or the P and I groups. The registers in this file generally contain offset addresses used for addressing within a segment. Like the general registers, the pointer and index registers can participate interchangeably in the 16-bit arithmetic and logical operations of the 8086, thereby providing a means to perform address computations.. There is one main difference between the registers in this file, which results in dividing the file into two subfiles, the P or pointer group (SP,BP) and the I or index group (SI,DI). The difference is that the pointers are by default assumed to contain offset addresses within the current stack segment, and the indexes are by default generally assumed to contain offset addresses within the current data segment. The mnemonic phrases "stack pointer," "base pointer," "source index," and "destination index" are mnemonics associated with these registers' names, as shown in Fig. 7. 3. Segment-Register File. The CS-DS-SS-ES register set is called the segment-register file, or S group. The segment registers play an important role in the memory addressing mechanism of the processor. These registers are similar in that they are used in all memory address computations (see Sec. VIII. A. of this chapter). The segment registers names have the associated mnemonic phrases "code," "data," "stack," and "extra as shown in Fig. 7. The contents of the CS register define the current code segment. All instruction fetches are taken to be relative to CS, using the instruction pointer (IP) as an offset. The contents of the DS register define the current data segment. Generally, all data references except those involving BP or SP are taken by default to be relative to DS. The contents of the SS register define the current stack segment. All data references which explicitly or implicitly involve SP or BP are taken by default to be relative to SS. This includes all push and pop operations, interrupts, and return operations. The contents of the ES register define the current extra segment. The extra segment has no specific use, although it is usually treated as an additional data segment which can be specified in an instruction by using a special default-segment-override prefix. In general, the default segment register for the two types of data references (DS and SS) can be overriden. By preceding the instruction with a special one-byte prefix, the reference can be forced to be relative to one of the other three segment registers. This prefix, as well as other prefixes described later, has a unique encoding that permits it to be distinguished from the opcodes. Programs which do not load or manipulate the segment registers are said to be dynamically relocatable. Such a program may be interrupted, moved in memory to a new location, and restarted with new segment-register values. At first a set of eight segment registers was proposed along with a field in a program-status word specifying which segment register was currently CS, which was currently DS, and which was currently SS. The other five all served as extra segment registers.
Microcomputers
Such a scheme would have resulted in virtually no thrashing of segment register contents; start addresses of all needed segments would be loaded initially into one of the eight segment registers, and the roles of the various segment registers would vary dynamically during program execution. Concern over the size of the resulting processor chip forced the number of segment registers to be reduced to the minimum number necessary, namely four. With this minimum number, each segment register could be dedicated to a particular type of segment (code, data, stack, extra), and the specifying field- in the program status word was no longer needed.

4. Flag-Register File. The AF-CF-DF-IF-OF-PF-SF-TF-ZF register set is called the flag-register file or F group. The flags in this group are all one bit in size and are used to record processor status information and to control processor operation. The flag registers' names have the following associated mnemonic phrases:
AF
CF
DF
IF
OF
PF
SF
TF
ZF
Auxiliary carry
Carry
Direction
Interrupt enable
Overflow
Parity
Sign
Trap
Zero
The AF, CF, PE, SF, and ZF flags retain their familiar 8080 semantics, generally reflecting the status of the latest arithmetic or logical operation. The OF flag joins this group, reflecting the signed arithmetic overflow condition. The DF, IF, and TF flags are used to control certain aspects of the processor. The DF flag controls the direction of the string manipulations (auto-incrementing or auto-decrementing). The IF flag enables or disables external interrupts. The TF flag puts the processor into a single-step mode for program debugging. More detail is given on each of these three flags later in the chapter.

C. Instruction Set
  
The 8086 instruction set-while including most of the 8080 set as a subset-has more ways to address operands and more power in every area. It is designed to implement block-structured languages efficiently. Nearly all instructions operate on either 8- or 16-bit operands. There are four classes of data transfer. All four arithmetic operations are available. An additional logic instruction, test, is included. Also new are byte- and word-string manipulations and intersegment transfers. A summary of the 8086 instructions appears in Fig. 8.

1. Operand Addressing. The 8086 instruction set provides many more ways to address operands than were provided by the 8080. Two-operand operations generally allow either a register or memory to serve as one operand (called the first operandi), and either a register or a constant within the instruction to serve as the other (called the second operand). Typical formats for two-operand operations are shown in Fig. 9 (second operand is a register) and Fig. 10 (second operand is a constant). The result of a two-operand operation may be directed to either of the source operands, with the exception, of course, of in-line immediate constants. Single-operand operations generally allow either a register or a memory to serve as the operand. A typical one- operand format is shown in Fig. 11. Virtually all 8086 operators may specify 8- or 16-bit operands.

Memory operands. An instruction may address an operand residing in memory in one of four ways as determined by the mod and rim fields in the instruction (see Table 2).

Direct 16-bit offset address
Indirect through a base register (BP or BX), optionally with an 8- or 16-bit displacement
Indirect through an index register (SI or DI), optionally with an 8- or 16-bit displacement
Indirect through the sum of a base register and an index register, optionally with an 8- or 16-bit displacement

The general register, BX, and the pointer register, BP, may serve as base registers. When the base register EX is used without an index register, the operand by default resides in the current data segment. When the base register BP is used without an index register, the operand by default resides in the current stack segment. When both base and index registers are used, the operand by default resides in the segment determined by the base register. When an index register alone is used, the operand by default resides in the current data segment.
Auto-incrementing and auto-decrementing address modes were not included in general, since it was felt that their use is mainly oriented towards string processing. These modes were included on the string primitive instructions.

Register operands. The four 16-bit general registers and the four 16-bit pointer and index registers may serve interchangeably as operands in 16-bit operations. Three exceptions to note are multiply, divide, and the string operations, all of which use the AX register implicitly. The eight 8-bit registers of the HL group may serve interchangeably in 8-bit operations. Again, multiply, divide, and the string operations use AL implicitly.
 
Microcomputers
the register selection as determined by the r/m field (first operand) or reg field (second operand) in the instruction. Immediate operands. All two-operand operations except multiply, divide, and the string operations allow one source operand to appear within the instruction as immediate data represented in 2's complement form. Sixteen-hit immediate operands having a high-order byte which is the sign extension of the low-order byte may be abbreviated to 8 bits.

 
 
Addressing mode usage. The addressing modes permit registers BX and BP to serve as base registers and registers SI and DI as index registers. Possible use of this for language implementation is discussed below.
Simple variables and arrays: A simple variable is accessed with the direct address mode. An array element is accessed with the indirect address mode utilizing the sum of the register SI (where SI contains the index into the array) and displacement (where displacement is the offset of the array in its segment). Based variables: A based variable is located at a memory address pointed at by some other variable. If the contents of the pointer variable were placed in BX, the indirect addressing mode utilizing BX would access the based variable. If the based variable were an array and the index into the array were placed in SI, the indirect addressing mode utilizing the sum of the register BX and register SI would access elements of the array. Stack marker: Marking a stack permits efficient implementation of block-structured languages and provides an efficient address mechanism for reentrant procedures. Register BP can be used as a stack marker pointer to the beginning of an activation record in the stack. The indirect address mode utilizing the sums of the base register BP and a displacement (where displacement is the offset of a local variable in the activation record) will access the variable declared in the currently active block. The indirect address mode utilizing the sum of the base register BP, index register SI (where SI contains the index in an array), and displacement (where displacement is the offset of the array in the activation record) will access an element of the array. Register DI can be used inTable 2 Determining 8086 Offset Address of a Memory Operand (Use This Table When mod ¹ 11; Otherwise Use Table 3.)


 
This table applies to the first operand only; the second operand can never be a memory operand. mod specifies how disp-lo and disp-hi are used to define a displacement as follows:
00: DISP=0 (disp-lo and disp-hi are absent)mod = 01: DISP=disp-lo sign extended (disp-hi is absent) 10: DISP = disp-hi,disp-lo
r/m specifies which base and index register contents are to be added to the displacement to form the operand offset address as follows:
 

the same manner as SI so that two array elements can be accessed concurrently.
Example: An example of a procedure-calling sequence on the 8086 illustrates the interaction of the addressing modes and activation records.
  Table 3 Determining 8086 Register Operand
(Use This Table When mod = 11; Otherwise Use Table 2.)
First operand 
Second operand
r/m 8-bit 16-bit reg 8-bit 16-bit
000: AL AX000: AL AX
001: CL CX001: CL CX
010: DL DX010: DL DX
011:BLBX 011:BL BX
100:AH SP 100:AH SP
101:CH BP 101: CH BP
110:DH SI110: DH SI
111:BH DI 111: BH DI

 
 
;CALL MYPROC (ALPHA, BETA)

PUSH ALPHA  PUSH BETA CALL MYPROC
;pass parameters by ;.. .pushing them on
;the stack
;call the procedure
;PROCEDURE MYPROC (A, B)
MYPROC:;entry point

PUSH  MOV SUB
BP  BP,SP SP,LOCALS;save previous BP value ;make BP point at new
;record
;allocate local storage on
;stack
;... for reentrant procedur-
;es (stack advances towards
;lower memory)
;body of procedure

MOV  POP  RET
SP,BP  BP  4;deallocate local storage ;restore previous BP ;return and discard 4 bytes
;of parameters
Upon entry to the procedure MYPROC its parameters are addressable with positive offsets from BP (the stack grows towards lower memory addresses). Since usually less than 128 bytes of parameters are passed, only an 8-bit signed displacement from BP is needed. Similarly, local variables to MYPROC are addressable with negative offsets from BP. Again, economy of instruction size is realized by using 8-bit signed displacements. A special return instruction discards the parameters pushed on the stack.
  2. Data Transfers. Four classes of data transfer operations may be distinguished: general-purpose, accumulator-specific, address-object transfers, and flag transfers. The general-purpose data transfer operations are move, push, pop, and exchange. Generally, these operations are available for all types of operands. The accumulator-specific transfers include input and output and the translate operations. The first 256 ports can be addressed directly, just as they were addressed in the 8080. However, the 8086 also permits ports to be addressed indirectly through a register (DX). This latter facility allows 64K ports to be addressed. Furthermore, the 8086 ports may be 8 or 16 bits wide, whereas the 8080 only permitted 8-bit-wide ports. The translate operation
performs a table-lookup byte translation. We will see the useful ness of this operation below, when it is combined with string operations.
The address-object transfers¾ load effective address and load pointer¾ are an 8086 facility not present in the 8080. A pointer is a pair of 16-bit values specifying a segment start address and an offset address; it is used to gain access to the full megabyte of memory. The load pointer operations provide a means of loading a segment start address into a segment register and an offset address into a general or pointer register in a single operation. The load effective address operation provides access to the offset address of an operand, as opposed to the value of the operand itself.
The flag transfers provide access to the collection of flags for such operations as push, pop, load, and store. A similar facility for pushing and popping flags was provided in the 8080; the load and store flags facility is new in the 8086.
It should he noted that the load and store operations involve only those flags that existed in the 8080. This is part of the concessions made for 8080 compatibility (without these operations it would take nine 8086 bytes to perform exactly an 8080 PUSH PSW or POP PSW).

3. Arithmetics. Whereas the 8080 provided for only 8-bit addition and subtraction of unsigned numbers, the 8086 provides all four basic mathematical functions on 8- and 16-bit signed and unsigned numbers. Standard 2's complement representation of signed values is used. Sufficient conditional transfers are provided to allow both signed and unsigned comparisons. The OF flag allows detection of the signed overflow condition.
Consideration was given to providing separate operations for signed addition and subtraction which would automatically trap on signed overflow (signed overflow is an exception condition, whereas unsigned overflow is not). However, lack of room in the opcode space prohibited this. As a compromise, a one-byte trap-on-overflow instruction was included to make testing for signed overflow less painful.
The 8080 provided a correction operation to allow addition to be performed directly on packed binary-coded representations of decimal digits. In the 8086, correction operations are provided to allow arithmetic to be performed directly on unpacked representations of decimal digits (e.g., ASCII) or on packed decimal representations.

Multiply and divide. Both signed and unsigned multiply and divide operations are provided. Multiply produces a double-length product (16 hits for 8-bit multiply, 32 bits for 16-bit multiply), while divide returns a single-length quotient and a single-length remainder from a double-length dividend and single-length divisor. Sign extension operations allow one to construct the double-length dividend needed for signed division. A quotient overflow (e.g., that caused by dividing by zero) will automatically interrupt the processor.

Decimal instructions. Packed BCD operations are provided in the form of accumulator-adjustment instructions. Two such instructions are provided-one for an adjustment following an addition and one following a subtraction. The addition adjustment is identical to the 8080 DAA instruction; the subtraction adjustment is defined similarly. Packed multiply and divide adjustments are not provided, because the cross terms generated make it impossible to recover the decimal result without additional processor facilities (see Appendix 2 for details).
Unpacked BCD operations are also provided in the form of accumulator adjust instructions (ASCII is a special case of unpacked BCD). Four such instructions are provided, one each for adjustments involving addition, subtraction, multiplication, and division. The addition and subtraction adjustments are similar to the corresponding packed BCD adjustments except that the AH register is updated if an adjustment on AL is required. Unlike packed BCD, unpacked BCD byte multiplication does not generate cross terms, so multiplication adjustment consists of converting the binary value in the AL register into BCD digits in AH and AL; the divide adjustment does the reverse. Note that adjustments for addition, subtraction, and multiplication are performed following the arithmetic operation; division adjustment is performed prior to a division operation. See Appendix 2 for more details on unpacked BCD adjustments.

4. Logicals. The standard logical operations AND, OR, XOR, and NOT are carry-overs from the 8080. Additionally, the 8086 provides a logical TEST for specific hits. This consists of a logical AND instruction which sets the flags but does not store the result, thereby not destroying either operand.
The four unit-rotate instructions in the 8080 are augmented with four unit-shift instructions in the 8086. Furthermore, the 8086 provides multi-bit shifts and rotates including an arithmetic right shift.

5. String Manipulation. The 8086 provides a group of 1-byte instructions which perform various primitive operations for the manipulation of byte or word strings (sequences of bytes or words). These primitive operations can be performed repeatedly in hardware by preceding the instruction with a special prefix. The single-operation forms may be combined to form complex string operations in tight software loops with repetition provided by special iteration operations. The 8080 did not provide any string-manipulation facilities.

Hardware operation control. All primitive string operations use the SI register to address the source operands, which are assumed .
to be in the current data segment. The DI register is used to address the destination operands, which reside in the current extra segment. The operand pointers are incremented or decremented (depending on the setting of the DF flag) after each operation, once for byte operations and twice for word operations.
Any of the primitive string operation instructions may be preceded with a 1-byte prefix indicating that the operation is to be repeated until the operation count in CX is satisfied, The test for completion is made prior to each repetition of the operation. Thus, an initial operation count of zero will cause zero executions of the primitive operation.
The repeat prefix byte also designates a value to compare with the ZF flag. If the primitive operation is one which affects the ZF flag and the ZF flag is unequal to the designated value after any execution of the primitive operation, the repetition is terminated. This permits the scan operation to serve as a scan-while or a scan-until.
During the execution of a repeated primitive operation the operand pointer registers (SI and DI) and the operation count register (CX) are updated after each repetition, whereas the instruction pointer will retain the offset address of the repeat prefix byte (assuming it immediately precedes the string operation instruction). Thus, an interrupted repeated operation will be correctly resumed when control returns from the interrupting task.

Primitive string operations. Five primitive string operations are provided:

  • MOVS moves a string element (byte or word) from the source operand to the destination operand. As a repeated operation, this provides for moving a string from one location in memory to another.


  • CMPS subtracts the string element at the destination operand from the string element at the source operand and affects the flags but does not return the result. As a repeated operation this provides for comparing two strings. With the appropriate repeat prefix it is possible to compare two strings and determine after which string element the two strings become unequal, thereby establishing an ordering between the strings.


  • SCAS subtracts the string element at the destination operand from AL (or AX for word strings) and affects the flags but does not return the result. As a repeated operation this provides for scanning for the occurrence of, or departure from, a given value in the string.


  • LODS loads a string element from the source operand into AL (or AX for word strings). This operation ordinarily would not be repeated.


  • STOS stores a string element from AL (or AX for word strings) into the destination operand. As a repeated operation this provides for filling a string with a given value.

Software operation control. The repeat prefix provides for rapid iteration in a hardware-repeated string operation. Iteration-control operations provide this same control for implementing software loops to perform complex string operations. These iteration operations provide the same operation count update, operation completion test, and ZF flag tests that the repeat prefix provides.
The iteration-control transfer operations perform leading- and trailing-decision loop control. The destinations of iteration-control transfers must be within a 256-byte range centered about the instruction.
Four iteration-control transfer operations are provided:

  • LOOP decrements the CX ("count") register by 1 and transfers if CX is not 0.


  • LOOPE decrements the CX register by 1 and transfers if CX is not 0 and the ZF flag is set (loop while equal),


  • LOOPNE decrements the CX register by 1 and transfers if CX is not 0 and the ZF flag is cleared (loop while not equal).


  • JCXZ transfers if the CX register is 0. This is used for skipping over a loop when the initial count is 0.
By combining the primitive string operations and iteration- control operations with other operations, it is possible to build sophisticated yet efficient string manipulation routines. One instruction that is particularly useful in this context is the translate operation; it permits a byte fetched from one string to be translated before being stored in a second string, or before being operated upon in some other fashion. The translation is performed by using the value in the AL register to index into a table pointed at by the BX register. The translated value obtained from the table then replaces the value initially in the AL register.
As an example of use of the primitive string operations and iteration-control operations to implement a complex string operation, consider the following application: An input driver must translate a buffer of EBCDIC characters into ASCII and transfer characters until one of several different EBCDIC control characters is encountered. The transferred ASCII string is to be terminated with an EOT character. To accomplish this, SI is initialized to point to the beginning of the EBCDIC buffer, DI is initialized to point to the beginning of the buffer to receive the ASCII characters, BX is made to point to an EBCDIC-to-ASCII translation table, and CX is initialized to contain the length of the EBCDIC buffer (possibly empty). The translation table contains the ASCII equivalent for each EBCDIC character, perhaps with ASCII nulls for illegal characters. The EOT code is placed into
those entries in the table corresponding to the desired EBCDIC stop characters. The 8086 instruction sequence to implement this example is the following:

JCXZ
Empty
Next:

LODS XLAT CMP STOS LOOPNE
Ebcbuf Table AL,EOT Ascbuf Next;fetch next EBCDIC character
;translate it to ASCII
;test for the EOT
;transfer ASCII character
;continue if not EOT

.
.

.
Empty:
The body of this loop requires just seven bytes of code.

6. Transfer of Control. Transfer-of-control instructions (jumps, calls, returns) in the 8086 are of two basic varieties: intrasegment transfers, which transfer control within the current code segment by specifying a new value for IP, and intersegment transfers, which transfer control to an arbitrary code segment by specifying a new value for both CS and IP. Furthermore, both direct and indirect transfers are supported. Direct transfers specify the destination of the transfer (the new value of IP and possibly CS) in the instruction; indirect transfers make use of the standard addressing modes, as described previously, to locate an operand which specifies the destination of the transfer. By contrast, the 8080 provides only direct intrasegment transfers.
Facilities for position-independent code and coding efficiency not found in the 8080 have been introduced in the 8086. Intrasegment direct calls and jumps specify a self-relative direct displacement, thus allowing position-independent code. A shortened jump instruction is available for transfers within a 256-byte range centered about the instruction, thus allowing for code compaction.
Returns may optionally adjust the SP register so as to discard stacked parameters, thereby making parameter passing more efficient. This is a more complete solution to the problem than the 8080 instruction which exchanged the contents of the HL with the top of the stack.
The 8080 provided conditional jumps useful for determining relations between unsigned numbers. The 8086 augments these with conditional jumps for determining relations between signed numbers. Table 4 shows the conditional jumps as a function of flag settings. The seldom-used conditional calls and returns provided by the 8080 have not been incorporated into the 8086.

7. External Interface. The 8086 processor provides both common and uncommon interfaces to external equipment. The two
Table 4 8086 Conditional Jumps as a Function of Flag Set tings

Jump on Flag settings
EQUAL . . . . . . . . . . . . . . . . . . . . . . . . ZF = 1
NOT EQUAL . . . . . . . . . . . . . . . . . . . ZF = 0
LESS THAN . . . . . . . . . . . . . . . . . . . . (SF xor OF) = 1
GREATER THAN . . . . . . . . . . . . . . . . ((SF xor OF) or ZF) = 0
LESS THAN OR EQUAL . . . . . . . . . . ((SF xor OF) or ZF) = 1
GREATER THAN OR EQUAL . . . . . . (SF xor OF) = 0
BELOW . . . . . . . . . . . . . . . . . . . . . . . . CF=1
ABOVE . . . . . . . . . . . . . . . . . . . . . . . . (CF or ZF) = 0
BELOW OR EQUAL . . . . . . . . . . . . . .(CF or ZF) = 1
ABOVE OR EQUAL . . . . . . . . . . . . . . CF = 0
PARITY EVEN . . . . . . . . . . . . . . . . . PF = 1
PARITY ODD . . . . . . . . . . . . . . . . . . . PF = 0
OVERFLOW . . . . . . . . . . . . . . . . . . . OF = 1
NO OVERFLOW . . . . . . . . . . . . . . . . OF = 0
SIGN . . . . . . . . . . . . . . . . . . . . . . . . . .SF=1
NO SIGN . . . . . . . . . . . . . . . . . . . . . .SF=0
varieties of interrupts, maskable and non-maskable, are not uncommon, nor is single-step diagnostic capability. More unusual is the ability to escape to an external processor to perform specialized operations. Also uncommon is the hardware mechanism to control access to shared resources in a multiple-processor configuration.

Interrupts. The 8080 interrupt mechanism was general enough to permit the interrupting device to supply any operation to be executed out of sequence when an interrupt occurs. However, the only operation that had any utility for interrupt processing was the 1-byte subroutine call. This byte consists of 5 bits of opcode and 3 bits identifying one of eight interrupt subroutines residing at eight fixed locations in memory. If the unnecessary generalization was removed, the interrupting device would not have to provide the opcode and all 8 bits could be used to identify the interrupt subroutine. Furthermore, if the 8 bits were used to index a table of subroutine addresses, the actual subroutine could reside anywhere in memory. This is the evolutionary process that led to the design of the 8086 interrupt mechanism.
Interrupts result in a transfer of control to a new location in a new code segment. A 256-element table (interrupt transfer vector) containing pointers to these interrupt service code locations resides at the beginning of memory. Each element is four bytes in size, containing an offset address and the high-order 16-bits of the start address of the service code segment. Each element of this table corresponds to an interrupt type, these types being numbered 0 to 255. All interrupts perform a transfer by pushing the current flag setting onto the stack and then performing an indirect call (of the intersegment variety) through the interrupt transfer vector.
Intel Microprocessors: 8008 to 8086 633
The 8086 processor recognizes two varieties of external interrupt-the non-maskable interrupt and the maskable interrupt. A pin is provided for each variety. Program execution control may be transferred by means of operations similar in effect to that of external interrupts. A generalized 2-byte instruction is provided that generates an interrupt of any type; the type is specified in the second byte. A special 1-byte instruction to generate an interrupt of one particular type is also provided. Such an instruction would he required by a software debugger so that breakpoints can be "planted" on 1-byte instructions without overwriting, even temporarily, the next instruction. And finally, an interrupt return instruction is provided which pops and restores the saved flag settings in addition to performing the normal subroutine return function. Single step. When the TF flag register is set, the processor generates an interrupt after the execution of each instruction. During interrupt transfer sequences caused by any type of interrupt, the TF flag is cleared after the push-flags step of the interrupt sequence. No instructions are provided for setting or clearing TF directly. Rather, the flag-register file image saved on the stack by a previous interrupt operation must be modified so that the subsequent interrupt return operation restores TF set. This allows a diagnostic task to single-step through a task under test while still executing normally itself. External-processor synchronization. Instructions are included that permit the 8086 to utilize an external processor to perform any specialized operations (e.g., exponentiation) not implemented on the 8086. Consideration was given to the ability to perform the specialized operations either via the external processor or through software routines, without having to recompile the code. The external processor would have the ability to monitor the 8086 bus and constantly be aware of the current instruction being executed. In particular, the external processor could detect the special instruction ESCAPE and then perform the necessary actions. In order for the external processor to know the 20-bit address of the operand for the instruction, the 8086 will react to the ESCAPE instruction by performing a read (but ignoring the result) from the operand address specified, thereby placing the address on the bus for the external processor to see. Before doing such a dummy read, the 8086 will have to wait for the external processor to be ready. The "test" pin on the 8086 processor is used to provide this synchronization. The 8086 instruction WAIT accomplishes the wait. If the external processor is not available, the specialized operations could be performed by software subroutines. To invoke the subroutines, an interrupt-generating instruction would be executed. The subroutine needs to be passed the specific specialized-operation opcode and address of the operand. This information would be contained in an in-line data byte (or bytes) following the interrupt-generating instruction. The same number of bytes are required to issue a specialized operation instruction to the external processor or to invoke the software subroutines, as illustrated in Fig. 12. Thus the compiler could generate object code that could be used either way. The actual determination of which way the specialized operations were carried out could be made at load time and the object code modified by the loader accordingly. Sharing resources with parallel processors. In multiple-processor systems with shared resources it is necessary to provide mechanisms to enforce controlled access to those resources. Such mechanisms, while generally provided through software operating systems, require hardware assistance. A sufficient mechanism for accomplishing this is a locked exchange (also known as test-and-set-lock). The 8086 provides a special 1-byte prefix which may precede any instruction. This prefix causes the processor to assert its bus-lock signal for the duration of the operation caused by the instruction. It is assumed that external hardware, upon receipt of
that signal, will prohibit bus access for other bus masters during the period of its assertion. The instruction most useful in this context is an exchange register with memory. A simple software lock may be implemented with the following code sequences:
 
Check:

MOV LOCK XCHG TEST  JNZ  . . . MOV
AL,1  Sema,AL AL,AL  Check 
 
 
 
 
 
  Sema,0
;set AL to 1 (implies
;locked)
;test and set lock
;set flags based on AL ;retry if lock already set
 
  ;critical region
 
  ;clear the lock when done
IX. Summary and Conclusions "The 8008 begat the 8080, and the 8080 begat the 8085, and the 8085 begat the 8086." During the six years in which the 8008 evolved into the 8086, the processor underwent changes in many areas, as depicted by the conceptual diagram of Fig. 13. Figure 14 compares the functional block diagrams of the various processors. Comparisons in performance and technology are shown in Tables 5 and 6. The era of the 8008 through the 8086 is architecturally notable for its role in exploiting technology and capabilities, thereby lowering computing costs by over three orders of magnitude. By removing a dominant hurdle that has inhibited the computer industry-the necessity to conserve expensive processors-the
Table 5 Performance Comparison
 
80088080 (2 MHz)8086 (8 MHz)
register-register transfer12.5 
2
0.25
jump 25 
5
0.875
register-immediate operation 20 
3.5
0.5
subroutine call 28 
9
2.5
increment (16-bit) 50 
2.5
0.25
addition (16-bit) 75 
5
0.375
transfer (16-bit) 25 
2
0.25
All times are given in microseconds.
  Table 6 Technology Comparison
 
8008808080858086
Silicon gate technology P-channel enhancement load deviceN-channel enhancement load deviceN-channel depletion load deviceScaled N-channel (HMOS) depletion load device
Clock rate 0.5-0.8 MHz 2-3 MHz 3-5 MHz 5-8 MHz
Min gate delay † 
F0 = F1 = 1
30 ns ‡ 15 ns ‡ 5 ns 3 ns
Typical speed- power product 100 pj 40 pj 10 pj 2 pj
Approximate number of transistors¶ 2,000 4,500 6,500 20,000§
Average transistor density (mil2 per transistor) 8.4 7.5 5.7 2.5
† Fastest inverter function available with worst-case processing.
‡ Linear-mode enhancement load.
§ This is 29,000 transistors if all ROM and PLA available placement sites are counted.
¶ Gate equivalent can be estimates by dividing by 3.



new era has permitted system designers to concentrate on solving the fundamental problems of the applications themselves.
  X. References Bylinsky [1975]; Faggin et al. [1972]; Hoff [1972]; Intel 8080 Manual [1975]; Intel MCS-8 Manual [1975]; Intel MCS-40 Manual [1976]; Intel MCS-85 Manual [1977]; Intel MCS-86 Manual [1978]; Morse [1980]; Morse, Pohlman, and Ravenel [1978]; Shima, Faggin, and Mazor [1974]; Vadasz et al. [1969].
 
 
 
  APPENDIX 1 SAVING AND RESTORING FLAGS IN THE 8008 Interrupt routines must leave all processor flags and registers unaltered so as not to contaminate the processing that was interrupted. This is most simply done by having the interrupt routine save all flags and registers on entry and restore them prior to exiting. The 8008, unlike its successors, has no instruction for directly saving or restoring flags. Thus 8008 interrupt routines that alter flags (practically every routine does) must conditionally test each flag to obtain its value and then save that value. Since there are no instructions for directly setting or clearing flags, the flag values must be restored by executing code that will put the flags in the saved state. The 8008 flags can be restored very efficiently if they are saved in the following format in a byte in memory.

Most significant = bit 7 original value of CARRY bit 6 = original value of SIGN bit 5 = original value of SIGN

Intel Microprocessors: 8008 to 8086 639
 
Bit 4 = 0
bit 3 = 0
bit 2 = complement of original value of ZERO
bit 1 = complement of original value of ZERO
bit 0 = complement of original value of PARITY


With the information saved in the above format in a byte called FLAGS, the following two instructions will restore all the saved flag values:

LDA ADD
FLAGS A
;load saved flags into accumulator ;add the accumulator to itself


This instruction sequence loads the saved flags into the accumulator and then doubles the value, thereby moving each bit one position to the left. This causes each flag to be set to its original value, for the following reasons:
The original value of the CARRY flag, being in the leftmost bit, will be moved out of the accumulator and wind up in the CARRY flag.
The original value of the SIGN flag, being in bit 6, will wind up in bit 7 and will become the sign of the result. The new value of the SIGN flag will reflect this sign.
The complement of the original value of the PARITY flag will wind up in bit 1, and it alone will determine the parity of the result (all other bits in the result are paired up and have no net effect on parity). The new setting of the PARITY flag will be the complement of this bit (the flag denotes even parity) and therefore will take on the original value of the PARITY flag.
Whenever the ZERO flag is 1, the SIGN flag must be 0 (zero is a positive two's-complement number) and the PARITY flag must be 1 (zero has even parity). Thus an original ZERO flag value of 1 will cause all bits of FLAGS, with the possible exception of bit 7, to be 0. After the ADD instruction is executed, all bits of the result will be 0 and the new value of the ZERO flag will therefore be 1.
An original ZERO flag value of 0 will cause two bits in FLAGS to be 1 and will wind up in the result as well. The new value of the ZERO flag will therefore be 0.

The above algorithm relies on the fact that flag values are always consistent, i.e., that the SIGN flag cannot be a 1 when the ZERO flag is a 1. This is always true in the 8008, since the flags come up in a consistent state whenever the processor is reset and flags can only be modified by instructions which always leave the flags in a consistent state. The 8080 and its derivatives allow the programmer to modify the flags in an arbitrary manner by popping a value of his choice off the stack and into the flags. Thus the above algorithm will not work on those processors.

A code sequence for saving the flags in the required format is as follows:


L1:
L2:
L3:
MVI
JNC
ORA
JZ
ORA
JM
ORA
JPE
ORA
STA
A,0
L1
80H
L3
06H
L2
60H
L3
01H
FLAGS
; move zero in accumulator
;jump if CARRY not set
;OR accumulator with 80 hex
;(set bit 7)
;jump if ZERO set (and SIGN
;not set and PARITY set)
; OR accumulator with 03 hex
(set bits 1 and 2)
; jump if negative (SIGN set)
;OR accumulator with 60 hex
;(set bits 5 and 6)
; jump if parity even (PARITY
;set)
; OR accumulator with 01 hex
;(set bit 0)
;store accumulator in FLAGS


APPENDIX 2 DECIMAL ARITHMETIC


A. Packed BCD
 
1. Addition. Numbers can be represented as a sequence of decimal digits by using a 4-bit binary encoding of the digits and packing these encodings two to a byte. Such a representation is called packed BCD (unpacked BCD would contain only one digit per byte). In order to preserve this decimal interpretation in performing binary addition on packed BCD numbers, the value 6 must be added to each digit of the sum whenever (1) the resulting digit is greater than 9 or (2) a carry occurs out of this digit as a result of the addition. This is because the 4-bit encoding contains six more combinations than there are decimal digits. Consider the following examples (numbers are written in hexadecimal instead of binary for convenience).


Example 1: 81+52

d2
d1
d0
names of digit positions
packed BCD augend
packed BCD addend
adjustment because d1 > 9
packed BCD sum





Example 2: 28+ 19
d2 d1d0names of digit positions

+
2
1
8
9
packed BCD augend
packed BCD addend

+
41
6
carry occurs out of d0
adjustment for carry

4 7packed BCD sum
In order to be able to make such adjustments, carries out of either digit position must be recorded during the addition operation. The 4004, 8080, 8085, and 8086 use the CARRY and AUXILIARY CARRY flag to record carries out of the leftmost and rightmost digits respectively. All of these processors provide an instruction for performing the adjustments. Furthermore, they all contain an add-with-carry instruction to facilitate the addition of numbers containing more than two digits.
2. Subtraction. Subtraction of packed BCD numbers can be performed in a similar manner. However, none of the Intel processors prior to the 8086 provides an instruction for performing decimal adjustment following a subtraction (Zilog's Z-80, introduced two years before the 8086, also has such an instruction). On processors without the subtract adjustment instruction, subtraction of packed BCD numbers can be accomplished by generating the ten's complement of the subtrahend and adding.
3. Multiplication. Multiplication of packed BCD numbers could also be adjusted to give the correct decimal result if the out-of-digit carries occurring during the multiplication were recorded. The result of multiplying two one-byte operands is two bytes long (four digits), and out-of-digit carries can occur on any of the three low-order digits, all of which would have to be recorded. Furthermore, the carries out of any digit are no longer restricted to unity, and so counters rather than flags would be required to record the carries. This is illustrated in the following example (numbers are written in hexadecimal instead of binary for convenience).
Example 3: 94 * 63
d3 d2d1d0names of digit positions

*

9643packed BCD multiplicand
packed BCD multiplier

3
1 7B8Ccarry occurs out of d1
carry occurs out of d1, three out of d2
3+
+
+
9
6
6
6
3
6
6
6
Ccarry occurs out of d1
adjustment for...
. . .above six...
... carries
4
+
C
6
5
6
Ccarry occurs out of dl and out of d2
adjustment for above two carries
5 + 26BCcarry occurs out of d2
adjustment for above carry
5 +8BC6
adjustment because d0 is greater than 9
5 + 8C62
adjustment because d1 is greater than 9
5922packed BCD product
The preceding example illustrates two facts. First, packed BCD multiplication adjustments are possible if the necessary out-of-digit carry information is recorded by the multiply instruction. Second, the facilities needed in the processor to record this information and apply the correction are non-trivial.
Another approach to determining the out-of-digit carries is to analyze the multiplication process on a digit-by-digit basis as follows:
Let x1 and x2 be packed BCD digits in multiplicand.
Let y1 and y2 be packed BCD digits in multiplier.
Binary value of multiplicand = 16 *x1 + x2
Binary value of multiplier = 16 * y1 + y2
Binary value of product = 256 * x1*y1 + 16 * (x1*y2 + x2*y1) +x2*y2
= x1*y1 in most significant byte, x2sy2 in least significant byte, (x1*y2 + x2*y1) straddling both bytes
If there are no cross terms (i.e., either x1 or y2 is zero and either x2 or y1 is zero), the number of out-of-digit carries generated by the x1 * y1 term is simply the most significant digit in the most significant byte of the product; similarly the number of out-of-digit carries generated by the x2 * y2 term is simply the most significant digit in the least significant byte of the product. This is illustrated in the following example (numbers are written in hexadecimal instead of binary for convenience).
Example 4: 90 * 20
d3
d2 d1 d0
names of digit positions

*

9
2
0
0
packed BCD multiplier
packed BCD multiplier

1
02000
1
\
9 *
2
/
2
0
\
0 *
0
/
0

 
The most significant digit of the most significant byte is 1, indicating that there was one out-of-digit carry from the low-order digit when the 9*2 term was formed, Adjustment is to add 6 to that digit.
1
+
2
6
0
0

adjustment
1
8
0
0
packed BCD product

Thus, in the absence of cross terms, the number of out-of-digit carries that occur during a multiplication can be determined by examining the binary product. The cross terms, when present, overshadow the out-of-digit carry information in the product, thereby making the use of some other mechanism to record the carries essential. None of the Intel processors incorporates such a mechanism. (Prior to the 8086, multiplication itself was not even supported.) Once it was decided not to support packed BCD multiplication in the processors, no attempt was made to even analyze packed BCD division.
B. Unpacked BCD
 
 
Unpacked BCD representation of numbers consists of storing the encoded digits in the low-order four bits of consecutive bytes. An ASCII string of digits is a special case of unpacked BCD with the high-order four bits of each byte containing 0110.
Arithmetic operations on numbers represented as unpacked BCD digit strings can be formulated in terms of more primitive BCD operations on single-digit (two digits for dividends and two digits for products) unpacked BCD numbers.
1. Addition and Subtraction. Primitive unpacked additions and subtractions follow the same adjustment procedures as packed additions and subtractions.
2. Multiplication. Primitive unpacked multiplication involves multiplying a one-digit (one-byte) unpacked multiplicand by a one-digit (one-byte) unpacked multiplier to yield a two-digit (two-byte) unpacked product. If the high-order four bits of the multiplicand and multiplier are zeros (instead of don't-cares), each will represent the same value interpreted as a binary number or as a BCD number. A binary multiplication will yield a two-byte product in which the high-order byte is zero. The low-order byte of this product will have the correct value when interpreted as a binary number and can be adjusted to a two-byte BCD number as follows:
High-order byte = (binary product)/10
Low-order byte = binary product modulo 10
This is illustrated in the following example (numbers are written in hexadecimal instead of binary for convenience).
Example 5: 7 * 5
d1 d0names of digit positions
0
0
7
5
unpacked BCD multiplicand
unpacked BCD multiplier
23binary product
2
/ 0
3
A
binary product
adjustment for high-order byte (/10)
03unpacked BCD product (high-order byte)

modulo
2
0
3
A
binary product
adjustment for low-order byte (modulo 10)

0 5unpacked BCD product (low-order byte)

3. Division. Primitive unpacked division involves dividing a two-digit (two-byte) unpacked dividend by a one-digit (one-byte) unpacked divisor to yield a one-digit (one-byte) unpacked quotient and a one-digit (one-byte) unpacked remainder. If the high-order four bits in each byte of the dividend are zeros (instead of don't-cares), the dividend can be adjusted to a one-byte binary number as follows:
Binary dividend = 10 * high-order byte + low-order byte
If the high-order four bits of the divisor are zero, the divisor will represent the same value interpreted as a binary number or as a BCD number. A binary division of the adjusted (binary) dividend and BCD divisor will yield a one-byte quotient and a one-byte remainder, each representing the same value interpreted as a binary number or as a BCD number. This is illustrated in the following example (numbers are written in hexadecimal instead of binary for convenience).
Example 6: 45/6
d1
d0
names of digit positions
0
0

2
/0
4
5

D
6
unpacked BCD dividend (high-order byte)
unpacked BCD dividend (low-order byte)
adjusted dividend (4 * 10 + 5)
unpacked BCD divisor
0
0
7
3
unpacked BCD quotient
unpacked BCD remainder

4. Adjustment Instructions. The 8086 processor provides four adjustment instructions for use in performing primitive unpacked BCD arithmetic-one for addition, one for subtraction, one for multiplication, and one for division.
The addition and subtraction adjustments are performed on a
binary sum or difference assumed to be left in the one-byte AL register. To facilitate multi-digit arithmetic, whenever AL is altered by the addition or subtraction adjustments, the adjustments will also do the following:

  • set the CARRY flag (this facilitates multi-digit unpacked additions and subtractions)


  • consider the one-byte AH register to contain the next most significant digit and increment or decrement it as appropriate (this permits the addition adjustment to be used in a multi-digit unpacked multiplication)
The multiplication adjustment assumes that AL contains a binary product and places the two-digit unpacked BCD equivalent in AH and AL. The division adjustment assumes that AH and AL contain a two-digit unpacked BCD dividend and places the binary equivalent in AH and AL.
The following algorithms show how the adjustment instructions can be used to perform multi-digit unpacked arithmetic.
Addition
 
Let augend = a[N] a[N- 1] . . . a[2] a[1]
Let addend = b[N] b[N- 1] . . . b[2] b[1]
Let sum = c[N] c[N-1] . . . c[2] c[1]
0 ® (CARRY)
DO i = 1 to N
(a[i]) ® (AL)
(AL) + (b[i]) ® (AL)
where + denotes add-with-carry
add-adjust (AL) ® (AX)
(AL) ® (c[i])
Subtraction
 
Let minuend = a[N] a[N- 1] . . . a[2] a[1]
Let subtrahend = b[N] b[N 1] . . . b[2] bill]
Let difference = c[N] c[N-1] . . . c[2] c[1]
0 ® (CARRY)
DO i = 1 to N

(a[i]) ® (AL)
(AL) - (b[i]) ® (AL)
where - denotes subtract-with-borrow subtract-adjust (AL) ® (AX)
(AL) ® (c[i])
Multiplication
 
Let multiplicand = a[N] a[N- 1] . . . a[2] a[1]
Let multiplier = b
Let product = c[N+ 1] c[N] . . . c[2] c[1]

(b) AND OFH ® (b)
0 ® (c[1])
DO i = 1 to N
(a[i]) AND OFH ® (AL)
(AL) * (b) ® (AX)multiply-adjust (AL) ® (AX)
(AL) + (c[i]) ® (AL)
add-adjust (AL) ® (AX)
(AL) ® (c[i])
(AH) ® (c[i+ 1])
Division
 
Let dividend = a[N] a[N- 1] . . . a[2] a[1]
Let divisor = b
Let quotient = c[N] c[N-1] . . . c[2] c[1]

(b) and OHF ® (b)
0 ® (AR)
DO i = N to 1
(a[i]) AND OFH ® (AL)
divide-adjust (AX) ® (AL)
(AL) / (b) ® (AL)
with remainder going into (AR) (AL) ® (c[i])









 

 
Minicomputers A major attribute of computers below the class of maxi is their use in dedicated applications areas. The minicomputer evolved from a conceptual view of design wherein a programmable controller was perceived to be the cheapest, fastest way to implement a special-purpose function. The minicomputer did not require the generality of larger computers and hence required less software and less overhead. Thus minicomputers were leaner and more responsive than their cousins. The need for minicomputers evolved from several areas including control, switching, and data processing. IBM's first minicomputer was the 1401 (c. 1962). As initially conceived, the 1401 was a stored-program replacement for the former hardwired controllers used to interconnect card readers, magnetic tape units, and line printers in an offline batch support system. The CDC 160, introduced in 1960 at a price of $60,000, was the first high-performance, low-cost, real time computer. Like the 1401 it was designed as a support computer to a larger machine and as a computer to test peripherals. Although it was not intended to be sold as a programmable computer, it was subsequently applied to scientific and commercial computations. The DEC PDP-5 was introduced in 1964 for real time data collection and control. The PDP-5 had a single 12-bit accumulator, a 1-bit link for overflow and multiple-precision arithmetic, and a 1-bit interrupt enable. The program counter was held in primary memory, and an analog-to-digital converter was built directly into the accumulator. The immediate successor of the PDP-5, the PDP-8, can be credited with triggering the minicomputer revolution. Its small size (half a cabinet) and cost ($18,000) brought the computer into the region where it was cost-effective in dedicated real time applications, especially since it could be packaged as part of a larger system. By 1980 over 100,000 PDP-8's had been sold since their introduction in 1966. From these origins, where the minicomputer was considered to be the minimal-complexity computer, the minicomputer has grown in functionality and performance to the point where it rivals the higher-cost, general-purpose computers of a decade ago. This section describes four minicomputers: the PDP-8, the PDP-11, the HP 2116, and the IBM System/38.
  The PDP-8 The 12-bit PDP-8 is described in a top-down fashion in Chap. 8. The description is carried from the PMS and ISP levels to register-transfer, gate, and circuit levels, illustrating the hierarchy of design. Since the PDP-8 is conceptually simple, it is possible to provide substantial details of the design in terms of the mid-1960s discrete technology used to implement the original PDP-8. A Kiviat graph for the original PDP-8 is shown in Fig. 1. Chapter 15 illustrates how the PDP-8 might be implemented by using contemporary bit-sliced microprogrammed chip sets. The design illustrates the use of ISP to describe the hardware building blocks (the Am2901 and 2909) and microcode to emulate other ISPs. PDP-8 programs have been successfully executed by using the ISP simulator on this bit-sliced PDP-8. After Chap. 15, machines are discussed only at the register-transfer level or above. However, the reader should have enough working knowledge about technology at this point to use Am2900 chips and/or ISP in design exercises completing the details in lower-level descriptions of other machines in this book. We encourage the reader to try at least a paper exercise of some other machine. Finally, Chap. 46 summarizes the evolution of the PDP-8 family of implementations over a decade of technological change ranging from discrete logic to microcomputer implementations.
  The PDP-11 The need was felt to increase the functionality of minimal computers, especially by providing a larger address space. This, coupled with a change from 6-bit (e.g., two characters per PDP-8
647
Minicomputers
12-bit word) to 8-bit character representations lead to a large number of 16-bit machines. The PDP-11, one of the most popular 16-bit minicomputers, is discussed in Chap. 38, which is complete with ISP and PMS descriptions. The performance of one PDP-11 family member, the Model 70, is summarized by the Kiviat graph in Fig. 2. Chapter 39 reviews the major implementation tradeoffs, while Chap. 47 outlines the evolution of the PDP-11 family. A maxicomputer, the VAX-11/780, with strong PDP-11 family ties and PDP-11 compatability, is described in Chap. 42. A contemporary of the PDP-11 is the data General Nova. Whereas the PDP-11 sought to increase the semantic content of instructions, the Nova designers sought an ISP whose implementations would be simple, provide high performance, and be oriented toward MSI technology. The generic Nova implementation consists of a single fast loop from register file to ALU, to shifter, to condition code sensing, and back to the register file (see Fig. 3). Each instruction can cause one of each type of function to execute on a single cycle throughout the loop. Thus machine-level instructions are microcoded a la the PDP-8 operate group of instructions. The similarity with the PDP-8 is not surprising, since the Nova designers were veteran PDIP-8 implementors and users. In order to pay for this rich, easily decoded class of register-to-register instructions, the Nova has a meager (e.g., four-
instruction) memory/register class of instructions. Being simple, the Nova ISP consistently was cheaper and faster on individual instructions than its PDP-11 rival. The Nova ISP was implemented in a single LSI chip in 1976, a feat not yet matched by the PDP-11 as of 1979. The Nova ISP philosophy represents an interesting tradeoff between ISP power and speed. One way to increase system performance is to increase the semantic content of the ISP so that fewer instructions have to be executed to complete a task. Another way to increase performance is to execute a lot of very simple instructions very fast. In the latter case, an optimizing compiler can provide a higher-level abstract machine so that the user never has to bother with assembly language. One cannot tell from the marketplace which approach is best, since the PDP-11 and the Nova have become the second and third minicomputer ISP, respectively, to surpass 50,000 units sold.
  The HP 2116 A contemporary of the PDP-11 resembling a 16-bit stretch of the PDP-8 is the HP 2116. The HP 2116 was also influenced by a PD P-S alumnus, John Kondek. An instruction set similar to that of the HP 2116 is contained in Chap. 31, where a cousin of the HP 2116 was used to implement a desk-top computer. Another variation of the ISP is found in Chap. 49. Strong family ties with the HP 2116 ISP can be found in the more recent HP 2100 ISP.
  The IBM System/38 Chapter 32 describes a business-oriented minicomputer that provides an architectural interface above the traditional ISP level.
                           A New Architecture for Mini-Computers: The DEC PDP-11
_________________________________________________________________________

Introduction
The mini-computer2 has a wide variety of uses: communications controller; instrument controller; large-system pre-processor; real-time data acquisition systems . . . ; desk calculator. Historically, Digital Equipment Corporation's PDP-8 Family, with 6,000 installations has been the archetype of these mini-computers.
In some applications current mini-computers have limitations. These limitations show up when the scope of their initial task is increased (e.g., using a higher level language, or processing more variables). Increasing the scope of the task generally requires the use of more comprehensive executives and system control programs, hence larger memories and more processing. This larger system tends to be at the limit of current mini-computer capability, thus the user receives diminishing returns with respect to memory, speed efficiency and program development time. This limitation is not surprising since the basic architectural concepts for current mini-computers were formed in the early 1960's. First, the design was constrained by cost, resulting in rather simple processor logic and register configurations. Second, application experience was not available. For example, the early constraints often created computing designs with what we now consider weaknesses:
1 Limited addressing capability, particularly of larger core sizes
2 Few registers, general registers, accumulators, index registers, base registers
3 No hardware stack facilities
4 Limited priority interrupt structures, and thus slow context switching among multiple programs (tasks)
5 No byte string handling
6 No read only memory facilities
7 Very elementary I/O processing
8 No larger model computer, once a user outgrows a particular model
9 High programming costs because users program in machine language.
In developing a new computer the architecture should at least solve the above problems. Fortunately, in the late 1960's integrated circuit semiconductor technology became available so that newer computers could be designed which solve these problems at low cost. Also, by 1970 application experience was available to influence the design. The new architecture should thus lower programming cost while maintaining the low hardware cost of mini-computers.
The DEC PDP-11, Model 20 is the first computer of a computer family designed to span a range of functions and performance. The Model 20 is specifically discussed, although design guidelines are presented for other members of the family. The Model 20 would nominally be classified as a third generation (integrated circuits), 16-bit word, 1 central processor with eight 16-bit general registers, using two's complement arithmetic and addressing up to 216 eight bit bytes of primary memory (core). Though classified as a general register processor, the operand accessing mechanism allows it to perform equally well as a 0-(stack), 1-(general register) and 2-(memory-to-memory) address computer. The computer's components (processor, memories, controls, terminals) are connected via a single switch, called the Unibus.


1AFIPS Proc. SJCC, 1970, pp. 657-675.
2The PDP-11 design is predicated on being a member of one (or more) of the micro, midi, mini, . . . , maxi (computer name) markets. We will define these names as belonging to computers of the third generation (integrated circuit to medium scale integrated circuit technology), having a core memory with cycle time of .5 ~ 2 microseconds, a clock rate of 5 ~ 10Mhz . . . , a single processor with interrupts and usually applied to doing a particular task (e.g., controlling a memory or communications lines, pre-processing for a larger system, process control). The specialized names are defined as follows:

Maximum addressable primary memory (words)Processor and memory cost (1970 kilodollars)Word length (bits)Processor state (words)Data types
Micro8 K~ 58 ~ 122Integers, words, boolean vectors
Mini32K5 ~ 1012 ~162-4Vectors (i.e., indexing)
Midi65 ~ 128 K10 ~ 2016 ~ 244-16Double length floating
The machine is described using the PMS and ISP notation of Bell and Newell [1971] at different levels. The following descriptive sections correspond to the levels: external design constraints level; the PMS level-the way components are interconnected and allow information to flow; the program level or ISP (Instruction Set Processor)-the abstract machine which interprets programs; and finally, the logical design level. (We omit a discussion of the circuit level-the PDP- 11 being constructed from TTL integrated circuits.)

Design Constraints
The principal design objective is yet to be tested; namely, do users like the machine? This will be tested both in the market place and by the features that are emulated in newer machines; it will indirectly be tested by the life span of the PDP-11 and any offspring.
Word Length
The most critical constraint, word length (defined by IBM) was chosen to be a multiple of 8 bits. The memory word length for the Model 20 is 16 bits, although there are 32- and 48-bit instructions and 8- and 16-bit data. Other members of the family might have up to 80 bit instructions with 8-, 16-, 32- and 48-bit data. The internal, and preferred external character set was chosen to be 8-bit ASCII.
Range and Performance
Performance and function range (extendability) were the main design constraints; in fact, they were the main reasons to build a new computer. DEC already has (4) computer families that span a range1 but are incompatible. In addition to the range, the initial machine was constrained to fall within the small-computer product line, which means to have about the same performance as a PDP-8. The initial machine outperforms the PDP-5, LINC, and PDP-4 based families. Performance, of course, is both a function of the instruction set and the technology. Here, we're fundamentally only concerned with the instruction set performance because faster hardware will always increase performance for any family. Unlike the earlier DEC families, the PDP-11 had to be designed so that new models with significantly more performance can be added to the family.
A rather obvious goal is maximum performance for a given model. Designs were programmed using benchmarks, and the results compared with both DEC and potentially competitive machines. Although the selling price was constrained to lie in the $5,000 to $10,000 range, it was realized that the decreasing cost of logic would allow a more complex organization than earlier DEC computers. A design which could take advantage of medium- and eventually large-scale integration was an important consideration. First, it could make the computer perform well; and second, it would extend the computer family's life. For these reasons, a general registers organization was chosen.
Interrupt Response. Since the PDP-11 will be used for real time control applications, it is important that devices can communicate with one another quickly (i.e., the response time of a request should be short). A multiple priority level, nested interrupt mechanism was selected; additional priority levels are provided by the physical position of a device on the Unibus. Software polling is unnecessary because each device interrupt corresponds to a unique address.
Software
The total system including software is of course the main objective of the design. Two techniques were used to aid programmability: first benchmarks gave a continuous indication as to how well the machine interpreted programs; second, systems programmers continually evaluated the design. Their evaluation considered: what code the compiler would produce; how would the loader work; ease of program relocability; the use of a debugging program; how the compiler, assembler and editor would be coded-in effect, other benchmarks; how real time monitors would be written to use the various facilities and present a clean interface to the users; finally the ease of coding a program.

Modularity
Structural flexibility (sometimes called modularity) for a particular model was desired. A flexible and straightforward method for interconnecting components had to be used because of varying user needs (among user classes and over time). Users should have the ability to configure an optimum system based on cost, performance and reliability, both by interconnection and, when necessary, constructing new components. Since users build special hardware, a computer should be easily interfaced. As a by-product of modularity, computer components can be produced and stocked, rather than tailor-made on order. The physical structure is almost identical to the PMS structure discussed in the following section; thus, reasonably large building blocks are available to the user.
Microprogramming
A note on microprogramming is in order because of current interest in the "firmware" concept. We believe microprogramming, as we understand it , can be a worthwhile

technique as it applies to processor design. For example, microprogramming can probably be used in larger computers when floating point data operators are needed. The IBM System/360 has made use of the technique for defining processors that interpret both the System/360 instruction set and earlier family instruction sets (e.g., 1401, 1620, 7090). In the PDP-11 the basic instruction set is quite straightforward and does not necessitate microprogrammed interpretation. The processor-memory connection is asynchronous and therefore memory of any speed can be connect ed. The instruction set encourages the user to write reentrant programs; thus, read-only memory can be used as part of primary memory to gain the permanency and performance normally attributed to microprogramming. In fact, the Model 10 computer which will not be further discussed has a 1024-word read only memory, and a 128-word read-write memory. Understandability Understandability was perhaps the most fundamental constraint (or goal) although it is now somewhat less important to have a machine that can be quickly understood by a novice computer user than it was a few years ago. DEC's early success has been predicated on selling to an intelligent but inexperienced user. Understandability, though hard to measure, is an important goal because all (potential) users must understand the computer. A straightforward design should simplify the systems programming task; in the case of a compiler, it should make translation (particularly code generation) easier.
  PDP-1 1 Structure at the PMS Level1 Introduction PDP-11 has the same organizational structure as nearly all present day computers (Fig. 1). The primitive PMS components are: the primary memory (Mp) which holds the programs while the central processor (Pc) interprets them; io controls (Kio) which manage data transfers between terminals (T) or secondary memories (Ms) to primary memory (Mp); the components outside the computer at periphery (X) either humans (H) or some external process (e.g., another computer); the processor console (T. console) by which humans communicate with the computer and observe its behavior and affect changes in its state; and a switch (S) with its control (K) which allows all the other components to communicate with one another. In the case of PDP-11, the central logical switch structure is implemented using a bus or chained switch (8) called the Unibus, as shown in Fig. 2. Each physical component has a

1A descriptive (block-diagram) level [Bell and Newell, 1971] to describe the relationship of the computer components: processors memories, switches, controls, Links, terminals and data operators.
 

switch for placing messages on the bus or taking messages off the bus. The central control decides the next component to use the bus for a message (call). The S (Unibus) differs from most switches because any component can communicate with any other component. The types of messages in the PDP-11 are along the lines of the hierarchical structure common to present day computers. The single bus makes conventional and other structures possible. The message processes in the structure which utilize S (Unibus) are:
1 The central processor (Pc) requests that data be read or written from or to primary memory (Mp) for instructions and data. The processor calls a particular memory module by concurrently specifying the module's address, and the address within the modules. Depending on whether the processor requests reading or writing, data is transmitted either from the memory to the processor or vice versa. 2 The central processor (Pc) controls the initialization of secondary memory (Ms) and terminal (T) activity. The processor sets status bits in the control associated with a particular Ms or T, and the device proceeds with the specified action (e.g., reading a card, or punching a character into paper tape). Since some devices transfer data vectors directly to primary memory, the vector control information (i.e., the memory location and length) is given as initialization information. 3 Controls request the processor's attention in the form of interrupts. An interrupt request to the processor has the effect of changing the state of the processor; thus the processor begins executing a program associated with the interrupting process. Note, the interrupt process is only a signaling method, and when the processor interruption occurs, the interruptee specifies a unique address value to the processor. The address is a starting address for a program. 4 The central processor can control the transmission of data between a control (for T or Ms) and either the processor or a primary memory for program controlled data transfers. The device signals for attention using the interrupt dialogue and the central processor responds by managing the data transmission in a fashion similar to transmitting initialization information. 5 Some device controls (for T or Ms) transfer data directly to/from primary memory without central processor intervention. In this mode the device behaves similar to a processor; a memory address is specified, and the data is transmitted between the device and primary memory. 6 The transfer of data between two controls, e.g., a secondary memory (disk) and say a terminal/T. display is not precluded, provided the two use compatible message formats.
As we show more detail in the structure there are, of course, more messages (and more simultaneous activity). The above does not describe the shared control and its associated switching which is typical of a magnetic tape and magnetic disk secondary memory systems. A control for a DECtape memory (Fig. 3) has an S(' DECtape bus) for transmitting data between a single tape unit and the DECtape transport. The existence of this kind of structure is based on the relatively high cost of the control relative to the cost of the tape and the value of being able to run concurrently with other tapes. There is also a dialogue at the periphery between X-T and X-Ms which does not use the Unibus. (For example, the removal of a magnetic tape reel from a tape unit or a human user (H) striking a typewriter key are typical dialogues.)
All of these dialogues lead to the hierarchy of present computers (Fig. 4). In this hierarchy we can see the paths by which the above messages are passed (Pc-Mp; Pc-K; K-Pc; Kio-T and Kio-Ms; and Kio-Mp; and, at the periphery, T-X and T-Ms; and T. console-H). Model 20 Implementation Figure 5 shows the detailed structure of a uni-processor, Model 20 PDP-11 with its various components (options). In Fig. 5 the Unibus characteristics are suppressed. (The detailed properties of the switch are described in the logical design section.)

Extensions to Increase Performance The reader should note (Fig. 5) that the important limitations of the bus are: a concurrency of one, namely, only one dialogue can occur at a given time, and a maximum transfer rate of one 16-hit word per .75 m sec., giving a transfer rate of 21.3 megabits/second. While the bus is not a limit for a uni-processor structure, it is a limit for multiprocessor structures. The bus also imposes an artificial limit on the system performance when high speed devices (e.g., TV cameras, disks) are transferring data to multiple primary memories. On a larger system with multiple independent memories the supply of memory cycles is 17 megabits/second times the number of modules. Since there is such a large supply of memory cycles/second and since the central processor can only

absorb approximately 16 megabits/second, the simple one Unibus structure must be modified to make the memory cycles available. Two changes are necessary: first, each of the memory modules have to be changed so that multiple units can access each module on an independent basis; and second, there must be independent control accessing mechanisms. Figure 6 shows how a single memory is modified to have more access ports (i.e., connect to 4 Unibusses). Figure 7 shows a system with 3 independent memory modules which are accessed by 2 independent Unibusses. Note that two of the secondary memories and one of the transducers are connected to both Unibusses. It should be noted that devices which can potentially interfere with Pc-Mp accesses are constructed with two ports; for simple systems, the two ports are both connected to the same bus, but for systems with more busses, the second connection is to an independent bus. Figure 8 shows a multiprocessor system with two central processors and three Unibusses. Two of the Unibus controls are included within the two processors, and the third bus is controlled by an independent control unit. The structure also has a second switch to allow either of two processors (Unibusses) to access common shared devices. The interrupt mechanism allows either processor to respond to an interrupt and similarly either processor may issue initialization information on an anonymous basis. A
control unit is needed so that two processors can communicate with one another; shared primary memory is normally used to carry the body of the message. A control connected to two Pc's (see Fig. 8) can be used for reliability; either processor or Unibus could fail, and the shared Ms would still be accessible. Higher Performance Processors Increasing the bus width has the greatest effect on performance. A single bus limits data transmission to 21.3 megabits/second, and though Model 20 memories are 16 megabits/second, faster (or wider) data path width modules will be limited by the bus. The Model 20 is not restricted, but for higher performance processors operating on double word (fixed point) or triple word (floating point) data two or three accesses are required for a single data type. The direct method to improve the performance is to double or triple the primary memory and central processor data path widths. Thus, the bus data rate is automatically doubled or tripled. For 32- or 48-bit memories a coupling control unit is needed so that devices of either width appear isomorphic to one another. The coupler maps a data request of a given width into a higher- or lower-width request for the bus being coupled to, as shown in Fig. 9. (The bus is limited to a fixed number of devices for electrical reasons; thus, to extend the bus a bus repeating unit is needed. The bus repeating control unit is almost identical to the bus coupler.) A computer with a 48-bit primary memory and processor and 16-bit secondary memory and terminals (transducers) is shown in Fig. 9. In summary, the design goal was to have a modular structure providing the final user with freedom and flexibility to match his needs. A secondary goal of the Unibus is open-endedness by
providing multiple busses and defining wider path busses. Finally, and most important, the Unibus is straightforward.
  The Instruction Set Processor (ISP) Level-Architecture1 Introduction, Background and Design Constraints The Instruction Set Processor (ISP) is the machine defined by hardware and/or software which interprets programs. As such, an ISP is independent of technology and specific implementations. The instruction set is one of the least understood aspects of computer design; currently it is an art. There is currently no theory of instruction sets, although there have been attempts to construct them [Maurer, 1966], and there has also been an attempt to have a computer program design an instruction set [Haney, 1968]. We have used the conventional approach in this design: first a basic ISP was adopted and then incremental design modifications were made (based on the results of the benchmarks).2 Although the approach to the design was conventional, the 1The word architecture has been operationally defined [Amdahl, Blaauw, and Brooks, 1964] as 'the attributes of a system as seen by a programmer, i.e., the conceptual structure and functional behavior, as distinct from the organization of the data flow and controls, the logical design and the physical implementation." 2A predecessor multiregister computer was proposed which used a similar design process. Benchmark programs were coded on each of 10 'competitive' machines, and the object of the design was to get a machine which gave the best score on the benchmarks. This approach had several fallacies: the machine had no basic character of its own; the machine was difficult to program since the multiple registers were assigned to specific functions and had inherent idiosyncrasies to score well on the benchmarks; the machine did not perform well for programs other than those used in the benchmark test; and finally, compilers which took advantage of the machine appeared to be difficult to write. Since all "competitive machines" had been hand-coded from a common flowchart rather than separate flowcharts for each machine, the apparent high performance may have been due to the flowchart organization.

resulting machine is not. A common classification of processors is as zero-, one-, two-, three-, or three-plus-one-address machines. This scheme has the form:
op l1, l2, l3, l4
where l1 specifies the location (address) in which to store the result of the binary operation (op) of the contents of operand locations l2 and l3, and l4 specifies the location of the next instruction.
The action of the instruction is of the form:
l1 ¬ l2 op l3; goto l4
The other addressing schemes assume specific values for one or more of these locations. Thus, the one-address von Neumann [Burks, Goldstine, and von Neumann, 1962] machines assumes l1 = l2 = the "accumulator" and l4 is the location following that of the current instruction. The two-address machine assumes 11 12; 14 is the next address.
Historically, the trend in machine design has been to move from a 1 or 2 word accumulator structure as in the von Neumann machine towards a machine, with accumulator and index register(s).1 As the number of registers is increased the assignment of the registers to specific functions becomes more undesirable and inflexible; thus, the general-register concept has developed. The use of an array of general registers in the processor was apparently first used in the first-generation, vacuum-tube machine, PEGASUS [Elliott et al., 1956] and appears to be an outgrowth of both 1- and 2-address structures. (Two alternative structures-the early 2- and 3-address per instruction computers may he disregarded, since they tend to always access primary memory for results as well as temporary storage and thus are wasteful of time and memory cycles, and require a long instruction.) The stack concept (zero-address) provides the most efficient access method for specifying algorithms, since very little space, only the access addresses and the operators, needs to he given. In this scheme the operands of an operator are always assumed to he on the "top of the stack." The stack has the additional advantage that arithmetic expression evaluation and compiler statement parsing have been developed to use a stack effectively. The disadvantage of the stack is due in part to the nature of current memory technology. That is, stack memories have to he simulated with random access memories, multiple stacks are usually required, and even though small stack memories exist, as the stack overflows, the primary memory (core) has to be used.
Even though the trend has been toward the general register concept (which, of course, is similar to a two-address scheme in which one of the addresses is limited to small values), it is important to recognize that any design is a compromise. There are situations for which any of these schemes can he shown to be "best." The IBM System/360 series uses a general register structure, and their designers [Amdahl, Blaauw, and Brooks, 1964] claim the following advantages for the scheme:
1 Registers can he assigned to various functions: base addressing, address calculation, fixed point arithmetic and indexing.
2 Availability of technology makes the general registers structure attractive.
The System/360 designers also claim that a stack organized machine such as the English Electric KDF 9 [Allmark and Lucking, 1962] or the Burroughs B5000 [Lonergan and King, 1961] has the following disadvantages:
1 Performance is derived from fast registers, not the way they are used.
2 Stack organization is too limiting and requires many copy and swap operations.
3 The overall storage of general registers and stack machines are the same, considering point 2.
4 The stack has a bottom, and when placed in slower memory there is a performance loss.
5 Subroutine transparency is not easily realized with one stack.
6 Variable length data is awkward with a stack.
We generally concur with points 1, 2, and 4. Point 5 is an erroneous conclusion, and point 6 is irrelevant (that is, general register machines have the same problem). The general-register scheme also allows processor implementations with a high degree of parallelism since instructions of a local block all can operate on several registers concurrently. A set of truly general purpose registers should also have additional uses. For example, in the DEC PDP-10, general registers are used for address integers, indexing, floating point, boolean vectors (bits), or program flags and stack pointers. The general registers are also addressable as primary memory, and thus, short program loops can reside within them and be interpreted faster. It was observed in operation that PDP-10 stack operations were very powerful and often used (accounting for as many as 20% of the executed instructions, in some programs, e.g., the compilers.)
The basic design decision which sets the PDP-11 apart was based on the observation that by using truly general registers and by suitable addressing mechanisms it was possible to consider the

machine as a zero-address (stack), one-address (general register), or two-address (memory-to-memory) computer. Thus, it is possible to use whichever addressing scheme, or mixture of schemes, is most appropriate.
Another important design decision for the instruction set was to have only a few data types in the basic machine, and to have a rather complete set of operations for each data type. (Alternative designs might have more data types with few operations, or few data types with few operations.) In part, this was dictated by the machine size. The conversion between data types must be easily accomplished either automatically or with 1 or 2 instructions. The data types should also be sufficiently primitive to allow other data types to be defined by software (and by hardware in more powerful versions of the machine). The basic data type of the machine is the 16 bit integer which uses the two's complement convention for sign. This data type is also identical to an address.

PDP-11 Model 20 Instruction Set (Basic Instruction Set)
A formal description of the basic instruction set is given in Appendix 1 using the ISP notation [Bell and Newell, 1971]. The remainder of this section will discuss the machine in a conventional manner.

Primary Memory. The primary memory (core) is addressed as either 216 bytes or 215 words using a 16 bit number. The linear address space is also used to access the input-output devices. The device state, data and control registers are read or written like normal memory locations.

General Register. The general registers are named: R[0:7] <15:0>1; that is, there are 8 registers each with 16 bits. The naming is done starting at the left with bit 15 (the sign bit) to the least significant bit 0. There are synonyms for R[6] and R[7]:

Stack Pointer/SP<15:0> : = R[6]<15:0>. Used to access a special stack which is used to store the state of interrupts, traps and subroutine calls
Program Counter/PC<15:0> : = R[7]<15:0>. Points to the current instruction being interpreted. It will be seen that the fact that PC is one of the general registers is crucial to the design.

Any general register, R[0:7], can be used as a stack pointer. The special Stack Pointer (SP) has additional properties that force it to be used for changing processor state interrupts, traps, and subroutine calls (It also can be used to control dynamic temporary storage subroutines.)



1A definition of the ISP notation used here may be found in Chapter 4.
In addition to the above registers there are S bits used (from a possible 16) for processor status, called PS<15:0> register. Four bits are the Condition Codes (CC) associated with arithmetic results; the T-bit controls tracing; and three bits control the priority of running programs Priority <2:0>. Individual bits are mapped in PS as shown in Appendix 1.

Data Types and Primitive Operations. There are two data lengths in the basic machine: bytes and words, which are 8 and 16 bits, respectively. The non-trivial data types are word length integers (w.i.); byte length integers (by.i); word length boolean vectors (w.bv), i.e., 16 independent bits (booleans) in a 1 dimensional array; and byte length boolean vectors (by.bv). The operations on byte and word boolean vectors are identical. Since a common use of a byte is to hold several flag bits (booleans), the operations can be combined to form the complete set of 16 operations. The logical operations are: "clear," "complement," "inclusive or," and "implication" (x É y or Ø x V y).
There is a complete set of arithmetic operations for the word integers in the basic instruction set. The arithmetic operations are: add, subtract, multiply (optional), divide (optional), compare, add one, subtract one, clear, negate, and multiply and divide by powers of two (shift). Since the address integer size is 16 bits, these data types are most important. Byte length integers are operated on as words by moving them to the general registers where they take on the value of word integers. Word length integer operations are carried out and the results are returned to memory (truncated).
The floating point instructions defined by software (not part of the basic instruction set) require the definition of two additional data types (of length two and three), i.e., double word (d.w.) and triple (t.w.) words. Two additional data types, double integer (d.i.) and triple floating point (t.f. or f) are provided for arithmetic. These data types imply certain additional operations and the conversion to the more primitive data types.

Address (Operand) Calculation. The general methods provided for accessing operands are the most interesting (perhaps unique) part of the machine's structure. By defining several access methods to a set of general registers, to memory, or to a stack (controlled by a general register), the computer is able to be a 0, 1 and 2 address machine. The encoding of the instruction Source (S) fields and Destination (D) fields are given in Fig. 10 together with a list of the various access modes that are possible. (Appendix 1 gives a formal description of the effective address calculation process.)
It should be noted from Fig. 10 that all the common access modes are included (direct, indirect, immediate, relative, indexed, and indexed indirect) plus several relatively uncommon ones. Relative (to PC) access is used to simplify program loading,


Mini-Computers: The DEC PDP-11 657
while immediate mode speeds up execution. The relatively uncommon access modes, auto-increment and auto-decrement, are used for two purposes: access to stack under control of the registers1 and access to bytes or words organized as strings or vectors. The indirect access mode allows a stack to hold addresses of data (instead of data). This mode is desirable when manipulating longer and variable-length data types (e.g., strings, double fixed and triple floating point). The register auto increment mode may be used to access a byte string; thus, for example, after each access, the register can be made to point to the next data item. This is used for moving data blocks, searching for particular elements of a vector, and byte-string operations (e.g., movement, comparisons, editing). This addressing structure provides flexibility while retaining the same, or better, coding efficiency than classical machines. As
an example of the flexibility possible, consider the variations possible with the most trivial word instruction MOVE (see Fig. 11) The MOVE instruction is coded as it would appear in conventional 2-address, 1-address (general register) and 0-address (stack) computers. The two-address format is particularly nice for MOVE, because it provides an efficient encoding for the common operation: A ¬ B (note, the stack and general registers are not involved). The vector move A[I] ¬ B(I) is also efficiently encoded. For the general register (and 1-address format), there are about 13 MOVE operations that are commonly used. Six moves can be encoded for the stack (about the same number found in stack machines). Instruction Formats. There are several instruction decoding formats depending on whether 0, 1, or 2 operands have to be explicitly referenced. When 2 operands are required, they are identified as Source/S and Destination/D and the result is placed at Destination/D. For single operand instructions (unary operators) the instruction action is D ¬ u D; and for two operand instructions (binary operators) the action is D ¬ D b S (where u and bare unary and binary operators, e.g., Ø , - and +, -, ´, /,
respectively. Instructions are specified by a 16-bit word. The most common binary operator format (that for operations requiring two addresses) is shown below.
15 . . . 12; 11 . . .6; 5. . . 0
op 
D
S
The other instruction formats are given in Fig. 12. Instruction Interpretation Process. The instruction interpretation process is given in Fig. 13, and follows the common fetch-execute cycle. There are three major states: (1) interrupting-the PC and PS are placed on the stack accessed by the Stack Pointer/SP, and the new state is taken from an address specified by the source requesting the trap or interrupt; (2) trace (controlled by T-bit)- essentially one instruction at a time is executed as a trace trap occurs after each instruction; and (3) normal instruction interpretation. The five (lower) states in the diagram are concerned with instruction fetching, operand fetching, executing the operation specified by the instruction and storing the result. The non-trivial details for fetching and storing the operands are not shown in the diagram but can be constructed from the effective address calculation process (Appendix 1). The state diagram, though simplified, is similar to 2- and 3-address computers, but is distinctly different than a 1 address (1 accumulator) computer.
The ISP description (Appendix 1) gives the operation of each of the instructions, and the more conventional diagram (Fig. 12) shows the decoding of instruction classes. The ISP description is somewhat incomplete; for example, the add instruction is defined as: ADD (: = bop 0010) ¬ (CC,D ¬ D + S); addition does not exactly describe the changes to the Condition Codes/CC (which means whenever a binary opcode [hop] of 00102 occurs the ADD instruction is executed with the above effect). In general, the CC are based on the result, that is, Z is set if the result is zero, N if negative, C if a carry occurs, and V if an overflow was detected as a result of the operation. Conditional branch instructions may thus follow the arithmetic instruction to test the results of the CC bits. Examples of Addressing Schemes Use as a Stack (Zero Address) Machine. Figure 14 lists typical zero-address machine instructions together with the PDP-11 instructions which perform the same function. It should be noted that translation (compilation) from normal infix expressions to reverse Polish is a comparatively trivial task. Thus, one of the primary reasons for using stacks is for the evaluation of expressions in reverse Polish form.

The DEC PDP-11 659
Consider an assignment statement of the form D ¬ A + B/C
which has the reverse Polish form
DABC/+ ¬
and would normally be encoded on a stack machine as follows
load stack address of D
load stack A
load stack B
load stack C /
+
store
However, with the PDP-11 there is an address method for improving the program encoding and run time, while not losing the stack concept. An encoding improvement is made by doing an operation to the top of the stack from a direct memory location (while loading). Thus the previous example could be coded as:
load stack B divide stack by C add A to stack store stack D
Use as a One-Address (General Register) Machine. The PDP-11 is a general register computer and should he judged on that basis. Benchmarks have been coded to compare the PDP-11 with the larger DEC PDP-10. A 16 bit processor performs better than the DEC PDP-10 in terms of bit efficiency, but not with time or memory cycles. A PDP-11 with a 32 bit wide memory would, however, decrease time by nearly a factor of two, making the times essentially comparable.
Use as a Two-Address Machine. Figure 15 lists typical two-address machine instructions together with the equivalent PDP-11 instructions for performing the same operations. The most useful instruction is probably the MOVE instruction because it does not use the stack or general registers. Unary instructions which operate on and test primary memory are also useful and efficient instructions. Extensions of the Instruction Set for Real (Floating Point) Arithmetic The most significant factor that affects performance is whether a machine has operators for manipulating data in a particular format. The inherent generality of a stored program computer allows any computer by subroutine to simulate another-given enough time and memory. The biggest and perhaps only factor that separates a small computer from a large computer is whether floating point data is understood by the computer. For example, a small computer with a cycle time of 1.0 microseconds and 16 bit memory width might have the following characteristics for a floating point add, excluding data accesses:
Programmed 
Programmed but (special normalize and differencing of 
exponent instructions) Microprogrammed hardware  Hardwired
250 microseconds 75 microseconds
 
  25 microseconds 2 microseconds
It should be noted that the ratios between programmed and hardwired interpretation varies by roughly two orders of magnitude. The basic hardwiring scheme and the programmed scheme should allow binary program compatibility, assuming there is an interpretive program for the various operators in the Model 20. For example, consider one scheme which would add eight 48 bit registers which are addressable in the extended instruction set.

Minicomputers 660
The eight floating registers, F, would he mapped into eight double length (32 hit) registers, D. In order to access the various parts of F or D registers, registers F0 and F1 are mapped onto registers R0 to R2 and R3 to R5. Since the instruction set operation code is almost completely encoded already for byte and word length data, a new encoding scheme is necessary to specify the proposed additional instructions. This scheme adds two instructions: enter floating point mode and execute one floating point instruction. The instructions for floating point and double word data would be:
 
binary ops opfloating point/fand double word /d
bop' D¬ + - ´ / compareFMOVE  FADD  FSUB  FMUL  FDIV  FCMPDMOVE  DADD  DSUB DMUL  DDIV  DCMP
unary ops
uop' D -FNEG DNEG
Logical Design of S(Unibus) and Pc The logical design level is concerned with the physical implementation and the constituent combinatorial and sequential logic elements which form the various computer components (e.g., processors, memories, controls). Physically, these components are separate and connected to the Unibus following the lines of the PMS structure. Unibus Organization Figure 16 gives a PMS diagram of the Pc and the entering signals from the Unibus. The control unit for the Unibus, housed in Pc for the Model 20, is not shown in the figure.
The PDP-11 Unibus has 56 bi-directional signals conventionally used for program-controlled data transfers (processor to control), direct-memory data transfers (processor or control to memory) and control-to-processor interrupt. The Unibus is interlocked; thus transactions operate independently of the bus length and response time of the master and slave. Since the bus is bi-directional and is used by all devices, any device can communicate with any other device. The controlling device is the master, and the device to which the master is communicating is the slave. For example, a data transfer from processor (master) to memory (always a slave) uses the Data Out dialogue facility for writing and a transfer from memory to processor uses the Data In dialogue facility for reading. Bus Control. Most of the time the processor is bus master fetching instructions and operands from memory and storing results in memory. Bus mastership is determined by the current processor priority and the priority' line upon which a bus request is made and the physical placement of a requesting device on the linked bus. The assignment of bus mastership is done concurrently with normal communication (dialogues). Unibus Dialogues Three types of dialogues use the Unibus. All the dialogues have a common protocol which first consists of obtaining the bus mastership (which is done concurrently with a previous transaction)
A New Architecture for Mini-Computers: The DEC PDP-11 661
followed by a data exchange with the requested device. The dialogues are: Interrupt; Data In and Date In Pause; and Data Out and Data Out Byte.

Interrupt. Interrupt can be initiated by a master immediately after receiving bus mastership. An address is transmitted from the master to the slave on Interrupt. Normally, subordinate control devices use this method to transmit an interrupt signal to the processor.

Data In and Data In Pause. These two bus operations transmit slave's data (whose address is specified by the master) to the master. For the Data In Pause operation data is read into the master and the master responds with data which is to be rewritten in the slave.

Data Out and Data Out Byte. These two operations transfer data from the master to the slave at the address specified by the master. For Data Out a word at the address specified by the address lines is transferred from master to slave. Data Out Byte allows a single data byte to be transmitted.

Processor Logical Design
The Pc is designed using TTL logical design components and occupies approximately eight 8" ´ 12" printed circuit boards, The organization of the logic is shown in Fig. 16. The Pc is physically connected to two other components, the console and the Unibus. The control for the Unibus is housed in the Pc and occupies one of the printed circuit boards. The most regular part of the Pc, the arithmetic and state section, is shown at the top of the figure. The 16-word scratch-pad memory and combinatorial logic data operators, D(shift) and D(adder, logical ops), form the most regular part of the processor's structure. The 16-word memory holds most of the 8-word processor state found in the ISP, and the 8 bits that form the Status word are stored in an 8-bit register. The input to the adder-shift network has two latches which are either memories or gates. The output of the adder-shift network can be read to either the data or address parts of the Unibus, or back to the scratch-pad array.
The instruction decoding and arithmetic control are less regular than the above data and state and these are shown in the lower part of the figure. There are two major sections: the instruction fetching and decoding control and the instruction set interpreter (which in effect defines the ISP). The later control section operates on, hence controls, the arithmetic and state parts of the Pc. A final control is concerned with the interface to the Unibus (distinct from the Unibus control that is housed in the Pc).

Conclusions

In this paper we have endeavored to give a complete description of the PDP-11 Model 20 computer at four descriptive levels. These present an unambiguous specification at two levels (the PMS structure and the ISP), and, in addition, specify the constraints for the design at the top level, and give the reader some idea of the implementation at the bottom level logical design. We have also presented guidelines for forming additional models that would belong to the same family.





 


 


            Implementation and Performance Evaluation of the PDP-11 Family
_____________________________________________________________________________

In order that methodologies useful in the design of complex systems may be developed, existing designs must be studied. The DEC PDP-11 was selected for a case study because there are a number of designs (eight are considered here), because the designs span a wide range in basic performance (7 to 1) and component technology (bipolar SSI to MOS LSI), and because the designs represent relatively complex systems.
The goals of the chapter are twofold: (1) to provide actual data about design tradeoffs and (2) to suggest design methodologies based on these data. An archetypical PDP-11 implementation is described.
Two methodologies are presented. A top-down approach uses micro-cycle and memory-read-pause times to account for 90 percent of the variation in processor performance. This approach can be used in initial system planning. A bottom-up approach uses relative frequency of functions to determine the impact of design tradeoffs on performance. This approach can be used in design-space exploration of a single design. Finally, the general cost/performance design tradeoffs used in the PDP-11 are summarized.

1. Introduction
 
As semiconductor technology has evolved, the digital systems designer has been presented with an ever-increasing set of primitive components from which to construct systems: standard SSI, MSI, and LSI, as well as custom LSI components. This expanding choice makes it more difficult to arrive at a near-optimal cost/performance ratio in a design. In the case of highly complex systems, the situation is even worse, since different primitives may be cost-effective in different subareas of such systems.
Historically, digital system design has been more of an art than a science. Good designs have evolved from a mixture of experience, intuition, and trial and error. Only rarely have design methodologies been developed (among those that have are two-level combinational logic minimization and wire-wrap routing schemes, for example). Effective design methodologies are essential for the cost-effective design of more complex systems. In addition, if the methodologies are sufficiently detailed, they can be applied in high-level design automation systems [Siewiorek and Barbacci, 1976].
Design methodologies may be developed by studying the results of the human design process. There are at least two ways to study this process. The first involves a controlled design experiment where several designers perform the same task. By contrasting the results, the range of design variation and technique can be established [Thomas and Siewiorek, 1977]. However, this approach is limited to fairly small design situations because of the redundant use of the human designers.
The second approach examines a series of existing designs that meet the same functional specification while spanning a wide range of design constraints in terms of cost, performance, etc. This paper considers the second approach and uses the DEC PDP-111 minicomputer line as a basis of study. The PDP-11 was selected on account of the large number of implementations (eight are considered here) with designs spanning a wide range in performance (roughly 7 to 1) and component technology (bipolar SSI, MSI, MOS custom LSI). The designs are relatively complex and seem to embody good design tradeoffs as ultimately reflected by their price/performance and commercial success.
Attention here is focused mainly upon the CPU. Memory performance enhancements such as caching are considered only insofar as they impinge upon CPU performance.
This paper is divided into three major parts. The first part (Sec. 2) provides an overview of the PDP- 11 functional specification (its architecture) and serves as background for subsequent discussion of design tradeoffs. The second part (Sec. 3) presents an archetypical implementation. The last part (Secs. 4 and 5) presents methodologies for determining the impact of various design parameters on system performance. The magnitude of the impact is quantified for several parameters, and the use of the results in design situations is discussed.

2. Architectural Overview

The PDP-11 family is a set of small- to medium-scale stored-program central processors with compatible instruction sets [Bell et al., 1970]. The family evolution in terms of increased performance, constant cost, and constant performance successors is traced in Fig. 1.2 Since the 11/45, 11/55, and 11/70 use the same processor, only the 11/45 is treated in this study.
A PDP-11 system consists of three parts: a PDP-11 processor, a collection of memories and peripherals, and a link called the Unibus over which they all communicate (Fig. 2).
A number of features, not otherwise considered here, are available as options on certain processors. These include memory management and floating-point arithmetic. The next three sub-

Implementation and Performance Evaluation of the PDP-11 Family 667
sections summarize the major architectural features of the PDP-11, including memory organization, processor state, addressing modes, instruction set, and Unibus protocol. The references list a number of processor handbooks and other documents which provide a more precise definition of the PDP-11 architecture than is possible here. 2.1 Memory and Processor State The central processor contains the control logic and data paths for instruction fetching and execution. Processor instructions act upon operands located either in memory or in one of eight general registers. These operands may be either 8-bit bytes or 16-bit words. Memory is byte- or word-addressable. Word addresses must be even. If N is a word address, then N is the byte address of the low-order byte of the word and N + 1 is the byte address of the high-order byte of the word. The control and data registers of peripheral devices are also accessed through the memory address space, and the top 4 kilowords of the space are reserved for this purpose. The general registers are 16 bits in length and are referred to as R0 through R7. R6 is used as the system stack pointer (SP) to maintain a push-down list in memory upon which subroutine and
interrupt linkages are kept. R7 is the program counter (PC) and always points to the next instruction to be fetched from memory. With minor exceptions (noted below) the SP and PC are accessible in exactly the same manner as any of the other general registers (R0 through R5). Data-manipulation instructions fall into two categories: arithmetic instructions (which interpret their operands as 2's complement integers) and logic instructions (which interpret their operands as bit vectors). A set of condition code flags is maintained by the processor and is updated according to the sign and presence of carry/overflow from the result of any data manipulation instruction. The condition codes, processor interrupt priority, and a flag enabling program execution tracing are contained in a processor status word (PS), which is accessible as a word in the memory addressing space. 2.2 Addressing Modes and Instruction Set The PDP-11 instruction set allows source and destination operands to be referenced via eight different addressing modes. An operand reference consists of a field specifying which of the eight modes is to be used and a second field specifying which of the eight general registers is to be used. The addressing modes are:
Mode 0 Register. The operand is contained in the specified register. Mode 1 Register deferred. The contents of the specified register are used to address the memory location containing the operand. Mode 2 Autoincrement. The contents of the specified register are used to address the memory location containing the operand, and the register is then incremented. Mode 3 Autoincrement deferred The contents of the specified register address a word in memory containing the address of the operand in memory. The specified register is incremented after the reference. Mode 4 Autodecrement. The contents of the specified register are first decremented and then used to address the memory location containing the operand. Mode 5 Autodecrement deferred The contents of the specified register are first decremented and then used to address a word in memory containing the address of the operand in memory. Mode 6 Indexed The word following the instruction is fetched and added to the contents of the specified general register to form the address of the memory location containing the operand. Mode 7 Indexed deferred The word following the instruction is fetched and added to the contents of the specified general register to form the address of a word in memory containing the address of the operand in memory.The various addressing modes simplify the manipulation of

  Minicomputers
diverse data structures such as stacks and tables. When used with the program counter these modes enable immediate operands and absolute and PC-relative addressing. The deferred modes permit indirect addressing.
The PDP-11 instruction set is made up of the following types of instructions:

Single-operand instructions. A destination operand is fetched by the CPU, modified in accordance with the instruction, and then restored to the destination.
Double-operand instructions. A source operand is fetched, followed by the destination operand. The appropriate operation is performed on the two operands and the result restored to the destination. In a few double-operand instructions, such as Exclusive OR (XOR), source mode 0 (register addressing) is implicit.
Branch instructions. The condition specified by the instruction is checked, and if it is true, a branch is taken using a field contained in the instruction as a displacement from the current instruction address.
Jumps. Jump instructions allow sequential program flow to be altered either permanently (in a jump) or temporarily (in a jump to subroutine).
Control, trap, and miscellaneous instructions. Various instructions are available for subroutine and interrupt returns, halts, etc.
Floating-point instructions. A floating-point processor is available as an option with several PDP-11 CPUs. Floating-point implementation will not be considered in this paper.

For the purpose of looking at the instruction execution cycle of the various PDP-11 processors, each cycle shall be broken into five distinct phases:1

Fetch. This phase consists of fetching the current instruction from memory and interpreting its opcode.
Source. This phase entails fetching the source operand for double-operand instructions from memory or a general register and loading it into the appropriate register in the data paths in preparation for the execute phase.
Destination. This phase is used to get the destination operand for single- and double-operand instructions into the data paths for manipulation in the execute phase. For JMP and JSR instructions the jump address is calculated.
Execute. During this phase the operation specified by the current instruction is performed and any result rewritten into the destination.
Service. This phase is only entered between execution of the last instruction and fetch of the next to grant a pending bus request, acknowledge an interrupt, or enter console mode after the execution of a HALT instruction or activation .of the console halt key.
1N. B.: The instruction phase names are identical to those used by DEC; however, their application here to a state within a given machine may differ from DEC's since the intent here is to make the discussion consistent over all machines.
2.3 The Unibus
 
 
All communication among the components of a PDP-11 system takes place on a set of bidirectional lines referred to collectively as the Unibus. The LSI-11 is an exception and uses an adaptation of the Unibus. The Unibus lines carry address, data, and control signals to all memories and peripherals attached to the CPU. Transactions on the Unibus are asynchronous with the processor. At any given time there will be one device which it addresses, the addressed device becoming the bus slave. This communication may consist of data transfers or, in the case where the processor is slave, an interrupt request. The data transfers which may be initiated by the master are:
DATO Data out-A word is transferred from master to slave.
DATOB Data out, byte-A byte is transferred from master to slave.
DATI Data in-A word is transferred from slave to master.
DATIP Data in, pause-A word is transferred from slave to master and the slave awaits a transfer from master back to slave to replace the information
that was read. The Unibus control allows no other data transfer to intervene between the read and the write cycles. This makes possible the reading and alteration of a memory location as an indivisible operation. In addition it permits the use of a read/modify/write cycle with core memories in place of the longer sequence of a read cycle followed by a write cycle.
3. PDP-11 Implementation
 
The midrange PDP-11 s have comparable implementations, yet their performances vary by a factor of 7. This section discusses the features common to these implementations and the variations found between machines which provide the dimensions along which they may be characterized.

3.1 Common Implementation Features
All PDP-11 implementations can be decomposed into a set of data paths and a control unit. The data paths store and operate upon byte and word data and interface to the Unibus, which permits

Implementation and Performance Evaluation of the PDP-1 1 Family 669
them to read from and write to memory and peripheral devices. The control unit provides all the signals necessary to evoke the appropriate operations in the data paths and Unibus interface. All PDP-11's have comparable data-path and control unit implementations that allow them to be contrasted in a uniform way. In this section a basis for comparing these machines shall be established and used to characterize them. 3.1.1 Data Paths. An archetype may be constructed from which the data paths of all midrange PDP-11's differ but minimally. This archetype is diagrammed in Fig. 3. All major registers and processing elements, as well as the links and switches which interconnect them, are indicated. The data-path illustrations for individual implementations are shown in Figs. 5 through 7. These figures are laid out in a common format to encourage comparison. Note that with very few exceptions all data paths are 16 bits wide (the PDP-11 word size). The heart of the data paths is the arithmetic logic unit or ALU, through which all data circulate and where most of the processing actually takes place. Among the operations performed by the ALU are addition, subtraction, l's and 2's complementation, and logical ANDing and ORing.
Minicomputers
The inputs to the ALU are the A leg and the B leg. The A leg is normally fed from a multiplexer (Aleg MUX), which may select from an operand supplied it from the scratch-pad memory (S PM) and possibly from a small set of constants and/or the processor status register (PS). The B leg also is typically fed from its own MUX (Bleg MUX), its selections being among the B register and certain constants. In addition, the Bleg MUX may be configured so that byte selection, sign extension, and other functions may be performed on the operand which it supplies to the ALU. Following the ALU is a multiplexer (the AMUX) typically used to select between the output of the ALU, the data lines of the Unibus, and certain constants. The output of the AMUX provides
Implementation and Performance Evaluation of the PDP-1 1 Family 671
the only feedback path in all midrange PDP-11 implementations except the 11/60 and acts as an input to all major processor registers. The internal registers lie at the beginning of the data paths. The instruction register (IR) contains the current instruction. The bus address register (BA) holds the address placed on the Unibus by the processor. The program status register (PS) contains the processor priority, memory-management-unit modes, condition code flags, and instruction trace-trap enable bit. The scratch-pad memory (SPM) is an array of 16 individually addressable registers which include the general registers (R0 to R7) plus a number of internal registers not accessible to the programmer. The B register (Breg) is used to hold the B leg operand supplied to the ALU. The variations from this archetype are surprisingly minor. The most frequently used elements (such as the ALU and SPM) are relatively fixed in their position in the data paths from implementation to implementation. Elements which are less frequently used, and hence have less of an impact on performance, can he seen to occupy positions which vary more between implementations. Variations to be encountered include routings for the bus address and processor status register; the point of generation for certain constants; the position of the byte swapper, sign extender, and rotate/shift logic; and the use of certain auxiliary registers present in some designs and not others. 3.1.2 Control Unit. The control unit for all PDP-11 processors (with the exception of the PDP-11/20) is microprogrammed [Wilkes and Stringer, 1953]. The considerations leading to the use of this style of control implementation in the PDP-11 are discussed in O'Loughlin [1975]. The major advantage of microprogramming is flexibility in the derivation of control signals to gate register transfers, to synchronize with Unibus logic, to control microcycle timing, and to evoke changes in control flow. The way in which a microprogrammed control unit accomplishes all of these actions impacts performance. Figure 4 represents the archetypical PDP-11 microprogrammed control unit. The contents of the microaddress register determine the current control-unit state and are used to access the next microinstruction word from the control store. Pulses from the clock generator strobe the microword and microaddress registers, loading them with the next microword and next microaddress, respectively. Repeated clock pulses thus cause the control unit to sequence through a series of states. The period spent by the control unit in one state is called a microcycle (or simply cycle when this does not lead to confusion with memory or instruction cycles), and the duration of the state as determined by the clock is known as the cycle time. The microword register shortens cycle time by allowing the next micro- word to be fetched from the control store while the current microword is being used.

   Minicomputers
Most of the fields of the microword supply signals for conditioning and clocking the data paths. Many of the fields act directly or with a small amount of decoding, supplying their signals to multiplexers and registers to select routings for data and to enable registers to shift, increment, or load on the master clock. Other fields are decoded according to the state of the data paths. An instance of this is the use of auxiliary ALU control logic to generate function-select signals for the ALU as a function of the instruction contained in the IR. Performance as determined by microcycle count is in large measure established by the connectivity of the data paths and the degree to which their functionality can be evoked by the data-path control fields of the microprogram word.
The complexity of the clock logic varies with each implementation. Typically the clock is fixed at a single period and duty cycle; however, processors such as the 11/34 and 11/40 can select from two or three different clock periods for a given cycle depending upon a field in the microword register. This can significantly improve performance in machines where the longer cycles are necessary only infrequently.
The clock logic must provide some means for synchronizing processor and Unibus operation, since the two operate asynchronously with respect to one another. Two alternate approaches are employed in midrange implementations. Interlocked operation, the simpler approach, shuts off the processor clock when a Unibus operation is initiated and turns it back on when the operation is complete. This effectively keeps microprogram flow and Unibus operation in lockstep with no overlap. Overlapped operation is a somewhat more involved approach which continues processor clocking after a DATI or DATIP is initiated. The microinstruction requiring the result of the operation has a function bit set which turns off the processor clock until the result is available. This approach makes it possible for the processor to continue running for several microcycles while a data transfer is being performed, improving performance.
The sequence of states through which the control unit passes would be fixed if it were not for the branch-on-microtest (BUT) logic. This logic generates a modifier based upon the current state of the data paths and Unibus interface (contents of the instruction register, current bus requests, etc.) and a BUT field in the microword currently being accessed from the control store, which selects the condition on which the branch is to be based. The modifier (which will be zero in the case that no branch is selected or that the condition is false) is ORed in with the next microinstruction address so that the next control-unit state is not only a function of the current state but also a function of the state of the data paths. Instruction decoding and addressing mode decoding are two prime examples of the application of BUTs. Certain code points in the BUT field do not select branch conditions, but rather provide control signals to the data paths, Unibus interface, or the control unit itself. These are known as active or working BUTs.
The JAM logic is a part of the microprogram flow-altering mechanism. This logic forces the microaddress register to a known state in the event of an exceptional condition such as a memory access error (bus timeout, stack overflow, parity error, etc.) or power-up by ORing all is into the next microaddress through the BUT logic. A microroutine beginning at the address of all is handles these trapped conditions. The old microaddress is not saved (an exception to this occurs in the case of the PDP-11/60); consequently, the interrupted microprogram sequence is lost and the microtrap ends by restarting the instruction interpretation cycle with the fetch phase.
The structure of the microprogram is determined largely by the BUTs available to implement it and by the degree to which special cases in the instruction set are exploited by these BUTs. This may have a measurable influence on performance as in the case of instruction decoding. The fetch phase of the instruction cycle is concluded by a BUT that branches to the appropriate point in the microcode based upon the contents of the instruction register. This branch can be quite complex, since it is based upon source mode for double-operand instructions, destination mode for single-operand instructions, and op code for all other types of instructions. Some processors can perform the execute phase of certain instructions (such as set/clear condition code) during the last cycle of the fetch phase; this means that the fetch or service phase for the next instruction might also be entered from BUT IRDECODE. Complicating the situation is the large number of possibilities for each phase. For instance, there are not only eight different destination addressing modes, but also subcases for each that vary for byte and word and for memory-modifying, memory-nonmodifying, MOV, and JMP/JSR instructions.
Some PDP-11 implementations such as the 11/10 make as much use of common microcode as possible to reduce the number of control states. This allows much of the IR decoding to be deferred until some later time into a microroutine which might handle a number of different cases; for instance, byte- and word-operand addressing is done by the same microroutine in a number of PDP-11s. Since the cost of control states has been dropping with the cost of control-store ROM, there has been a trend toward providing separate microroutines optimized for each special case, as in the 11/60. Thus more special cases must be broken out at the BUT IRDECODE, and so the logic to implement this BUT becomes increasingly involved. There is a payoff, though, because there are a smaller number of control states for IR decoding and fewer BUTs. Performance is boosted as well, since frequently occurring special cases such as MOV register to destination can be optimized.
Implementation and Performance Evaluation of the PDP-1 1 Family 673

4. Measuring the Effect of Design Tradeoffs on Performance

There are two alternative approaches to the problem of determining just how the particular binding of different design decisions affects the performance of each machine:

1 Top-down approach. Attempt to isolate the effect of a particular design tradeoff over the entire space of implementations by fitting the individual performance figures for the whole family of machines to a mathematical model which treats the design parameters as independent variables and performance as the dependent variable.
2 Bottom-up approach. Make a detailed sensitivity analysis of a particular tradeoff within a particular machine by comparing the performance of the machine both with and without the design feature while leaving all other design features the same.

Each approach has its assets and liabilities for assessing design tradeoffs. The first method requires no information about the implementation of a machine, but does require a sufficiently large collection of different implementations, a sufficiently small number of independent variables, and an adequate mathematical model in order to explain the variance in the dependent variable to some reasonable level of statistical confidence. The second method, on the other hand, requires a great deal of knowledge about the implementation of the given system and a correspondingly great amount of analysis to isolate the effect of the single design decision on the performance of the complete system. The information that is yielded is quite exact, but applies only to the single point chosen in the design space and may not be generalized to other points in the space unless the assumptions concerning the machine's implementation are similarly generalizable. In the following subsections the first method is used to determine the dominant tradeoffs and the second method is used to estimate the impact of individual implementation tradeoffs.

4.1 Quantifying Performance
Measuring the change in performance of a particular PDP-11 processor model due to design changes presupposes the existence of some performance metric. Average instruction execution time was chosen because of its obvious relationship to instruction-stream throughput. Neglected are such overhead factors as direct memory access, interrupt servicing, and, on the LSI-11, dynamic memory refresh. Average instruction execution times may be obtained by benchmarking or by calculation from instruction frequency and timing data. The latter method was chosen because of its freedom from the extraneous factors noted above and from the normal clock rate variations found from machine to machine of a given model. This method also allows us to calculate the change in average instruction execution time that would result from some change in the implementation. Such frequency-driven design has already been applied in practice to the PDP-11/60 [Mudge, 1977].
The instruction frequencies are tabulated in Appendix 1 and include the frequencies of the various addressing modes, These figures were calculated from measurements made by Strecker1 on 7.6 million instruction executions traced in 10 different PDP-11 instruction streams encountered in various applications. While there is a reasonable amount of variation of frequencies from one stream to the next, the figures should he representative.
Instruction times were tabulated for each of the eight PDP-11 implementations and reported in Snow and Siewiorek [1978]. These times were calculated from the engineering documents for each machine. The times differ from those published in the PDP-11 processor handbooks for two reasons. First, in the handbooks, times have been redistributed among phases to ease the process of calculating instruction times. In Snow and Siewiorek the attempt has been made to accurately characterize each phase. Second, there are inaccuracies in the handbooks arising from conservative timing estimates and engineering revisions. The figures included here may be considered more accurate.
A performance figure is arrived at for each machine by weighting its instruction times by frequency. The results, given in Table 1, form the basis of the analyses to follow.

4.2 Analysis of Variance of PDP-11

Performance: Top-Down Approach
The first method of analysis described above will be employed in an attempt to explain most of the variance in PDP-11 performance in terms of two parameters:

1 Microcycle time. The microcycle time is used as a measure of processor performance which excludes the effect of the memory subsystem.
2 Memory-read-pause time. The memory-read-pause time is defined as the period of time during which the processor clock is suspended during a memory read. For machines with processor/Unibus overlap, the clock is assumed to be turned off by the same microinstruction which initiates the memory access. Memory-read-pause time is used as a measure of the memory subsystem's impact on processor performance. Note that this time is less than the memory access time since all PDP-11 processor clocks will continue to run at least partially concurrently with a memory access.
Table 1 Average PDP-1 1 Instruction Execution Times in Microseconds

FetchSourceDestinationExecuteTotalSpeed relative to LSI-11
LSI-l 1
2.514
0.689
1.360
1.320
5.883
1.000
PDP-11/04
1.940
0.610
0.811
0.682
4.043
1.455
PDP-11/10
1.500
0.573
0.929
1.094
4.096
1.436
PDP-11/20
1.490
0.468
0.802
0.768
3.529
1.667
PDP-11/34
1.630
0.397
0.538
0.464
3.029
1.942
PDP-11/40
0.958
0.260
0.294
0.575
2.087
2.819
PDP-11/45 (bipolar memory)
0.363
0.101
0.213
0.185
0.863
6.820
PDP-11/60 (87% cache hit ratio)
0.541
0.185
0.218
0.635
1.578
3.727

The choice of these two factors is motivated by their dominant contribution to, and (approximately) linear relationship with, performance. Keeping the number of independent variables low is also important because of the small number of data points being fitted to the model.
The model itself is of the form:

ti = k1c1i + k2c2i
where ti = the average instruction execution time of machine i from Table 1
c1i = the microcycle time of machine i (for machine with selectable microcycle times, the predominant time is used)
c2i = the memory-read-pause time of machine

This model is only an approximation, since it assumes k1 and k2 will be constant over all machines. In general this will not be the case. k1 is the number of microcycles expected in a canonical instruction. This number will be a function mainly of data-path connectivity, and strictly speaking, another factor should be included to take that variability into account; however, since the data-path organizations of all PDP- 11 implementations considered here (except the 11/03, 11/45, and 11/60) are quite comparable, the simplifying assumption of calling them all identical at the price of explaining somewhat less of the variance shall be made. k2 is the number of memory accesses expected in a canonical instruction and also exhibits some variability from machine to machine. A small part of this is due to the fact that some PDP-11's actually take more memory cycles to perform a given instruction than do others (this is really only a factor in certain 11/10 instructions, notably JMP and JSR, and the 11/20 MOV instruction). A more important source of variability is the Unibus-processor overlap logic incorporated into some PDP-11 implementations, which effectively reduces the actual contribution of the k2c2i term by overlapping more memory access time with processor operation than is excluded from the memory-read-pause time.
Given the model and the dependent and independent data for each machine as given in Table 2, a linear regression was applied to determine the coefficients k1 and k2 and to find out how much of the variance is explained by the model.
If the regression is applied over all eight processors, k1 = 11.580, k2 = 1.162, and R2 0.904. R2 is the amount of variance accounted for by the model, or 90.4 percent. If the regression is applied to just the six midrange processors, k1 = 10.896, k2 = 1.194, and R2 = 0.962. R2 increases to 96.2 percent partly because fewer data points are being fitted to the model and partly because the LSI-11 and 11/45 can be expected to have different k coefficients from those of the midrange machines and hence do not fit the model as well. Note that if two midrange machines, the 11/04 and the 11/40, are eliminated instead of the LSI-11 and 11/45, then R2 decreases to 89.3 percent rather than increasing. The k coefficients are close to what should be expected for average microcycle and memory cycle counts. Since k1 is much larger than

Table 2 Top-Down Model Parameters in Microseconds


Independent variables
Dependent variable

Microcycle timeMemory- read- pause- timeAverage instruction execution time
LSI-11 0.4000.400 5.883
PDP-11/04 0.2600.940 4.043
PDP-11/10 0.3000.600 4.096
PDP-11/20 0.2800.370 3.529
PDP-11/34 0.1800.940 3.029
PDP-11/40 0.1400.500 2.087
PDP-11/45 (bipolar memory)0.1500.000 0.863
PDP-11/60 (87% cache hit ratio)0.170 0.140 1.578
Implementation and Performance Evaluation of the PDP-11 Family 675




k2, average instruction time is more sensitive to microcycle time than to memory-read-pause time by a factor of k1/k2 or approximately 10. The implication for the designer .is that much more performance can be gained or lost by perturbing the microcycle time than the memory-read-pause time.
Although this method lacks statistical rigor, it is reasonably safe to say that memory and microcycle speed do have by far the largest impact on performance and that the dependency is quantifiable to some degree.

4.3 Measuring Second-Order Effects: Bottom-up Approach
  
It is a great deal harder to measure the effect of other design tradeoffs on performance. The approximate methods employed in the previous section cannot be used, because the effects being measured tend to be swamped out by first-order effects and often either cancel or reinforce one another, making linear models useless. For these reasons such tradeoffs must be evaluated on a design-by-design basis as explained above. This subsection will evaluate several design tradeoffs in this way.

4.3.1 Effect of Adding a Byte Swapper to the 11/10. The PDP-11/10 uses a sequence of eight shifts to swap bytes and access odd bytes. While saving the cost of a byte swapper, this has a negative effect on performance. In this subsection the performance gained by the addition of a byte swapper either before the B register or as part of the Bleg multiplexer is calculated. Adding a byte swapper would change five different parts of the instruction interpretation process: the source and destination phases where an odd-byte operand is read from memory, the execute phase where a swap byte instruction is executed in destination mode 0 and in destination modes 1 through 7, and the execute phase where an odd-byte address is modified. In each of these cases seven fast shift cycles would be eliminated and the remaining normal-speed shift cycle could be replaced by a byte swap cycle resulting in a saving of seven fast shift cycles or 1.050 m s. None of this time would be overlapped with Unibus operations; hence, all would be saved. This saving is only effected, however, when a byte swap or odd-byte access is actually performed. The frequency with which this occurs is just the sum of the frequencies of the individual cases noted above, or 0.0640. Multiplying by the time saved per occurrence gives a saving of 0. 0672 m s or 1.64 percent of the average instruction execution time. The insignificance of this saving can well be used to support the decision for leaving the byte swapper out of the PDP-11/10.

4.3.2 Effect of adding Processor/Unibus Overlap to the 11/04. Processor/Unibus overlap is not a feature of the 11/04 control unit. Adding this feature involves altering the control unit/Unibus synchronization logic so that the processor clock continues to run until a microcycle requiring the Unibus data from a DATI or DATIP is detected. A bus address register must also be added to drive the Unibus lines after the microcycle initiating the DATIP is completed. This alteration allows time to be saved in two ways. First, processor cycles may be overlapped with memory read cycles, as explained in Subsection 3.1.2. Second, since Unibus data are not read into the data paths during the cycle in which the DATIP occurs, the path from the ALU through the AMUX and back to the registers is freed. This permits certain operations to be performed in the same cycle as the DATIP; for example, the microword BA¬ PC; DATI; PC¬ PC+2 could be used to start fetching the word pointed to by the PC while simultaneously incrementing the PC to address the next word. The cycle following could then load the Unibus data directly into a scratch-pad register rather than loading the data into the Breg and then into the scratch-pad on the following cycle, as is necessary without overlap logic. A saving of two microcycle times would result.
DATI and DATIP operations are scattered liberally throughout the 11/04 microcode; however, only those cycles in which an overlap would produce a time saving need be considered, An average of 0.730 cycles can be saved or overlapped during each instruction. If all of the overlapped time is actually saved, then 0.190 m s. or 4.70 percent, will be pared from the average instruction execution time. This amounts to a 4.93 percent increase in performance.

4.3.3 Effect of Caching on the 11/60. The PDP-11/60 uses a cache to decrease its effective memory-read-pause time. The degree to which this time is reduced depends upon three factors: the cache-read-hit pause time, the cache-read-miss pause time, and the ratio of cache-read hits to total memory read accesses. A write-through cache is assumed; therefore, the timing of memory write accesses is not affected by caching and only read accesses need be considered. The performance of the 11/60 as measured by average instruction execution time is modeled exactly as a function of the above three parameters by the equation

t = k1 + k2(k3a + k4[1-a])
where t = the average instruction execution time
a = the cache hit ratio
k1 = the average execution time of a PDP-11/60 instruction excluding memory-read-pause time but including memory-write-pause time (1.339m s)
k2 = the number of memory reads per average instruction (1.713)
k3 = the memory-read-pause time for a cache hit (0.000m s)
k4 = the memory-read-pause time for a cache miss (l.075m s)

The above equation can be rearranged to yield:

t = (k1 + k2k4) - k2(k4- k3)a

The first term and the coefficient of the second term in the equation above are equivalent to 3.181 m s and 1.842 m s respectively with the given k parameter values. This reduces the average instruction time to a function of the cache hit ratio, making it possible to compare the effect of various caching schemes on 11/60 performance in terms of this one parameter.
The effect of various cache organizations on the hit ratio is described for the PDP-11 family in general in Strecker [1976b] and for the PDP-11/60 in particular in Mudge [1977]. If no cache is provided, the hit ratio is effectively 0 and the average instruction execution time reduces to the first term in the model, or 3.181 m s. A set-associative cache with a set size of 1 word and a cache size of 1,024 words has been found through simulation to give a .87 hit ratio. An average instruction time of 1.578 m s results in a 101.52 percent improvement in performance over that without the cache.
The cache organization described above is that actually employed in the 11/60. It has the virtue of being relatively simple to implement and therefore reasonably inexpensive. Set size or cache size can be increased to attain a higher hit ratio at a correspondingly higher cost. One alternative cache organization is a set size of 2 words and a cache size of 2,048 words. This organization boosts the hit ratio to .93, resulting in an instruction time of 1.468 m s, an increase in performance of 7. 53 percent. This increased performance must be paid for, however, since twice as many memory chips are needed. Because the performance increment derived from the second cache organization is much smaller than that of the first while the cost increment is approximately the same, the first is more cost-effective.

4.3.4 Design Tradeoffs Affecting the Fetch Phase. The fetch phase holds much potential for performance improvement, since it consists of a single short sequence of microoperations that, as Table 1 clearly shows, involves a sizable fraction of the average instruction time because of the inevitable memory access and possible service operations. In this subsection two approaches to cutting this time are evaluated for four different processors.
The Unibus interface logic of the PDP-11/04 and that of the 11/34 are very similar. Both insert a delay into the initial microcycle of the fetch phase to allow time for bus-grant arbitration circuitry to settle so that a microbranch can be taken if a serviceable condition exists. If the arbitration logic were redesigned to eliminate this delay, the average instruction execution time would drop by 0.220 m s for the 11/04 and 0.150 m s for the 11/34.1 The resulting increases in performance would be 5.75 percent and 5.21 percent respectively.
Another example of a design feature affecting the fetch phase is the operand- instruction fetch overlap mechanism of the 11/40, 11/45, and 11/60. From the normal fetch times in the appendix and the actual average fetch times given in Table 1, the saving in fetch phase time alone can be calculated to be 0.162 m s for the 11/40, 0.087 m s for the 11/45, and 0.118 m s for the 11/60, or an increase of 7.77 percent, 10.07 percent, and 8.11 percent over what their respective performances would be if fetch phase time were not overlapped.
These examples demonstrate the practicality of optimizing sequences of control states that have a high frequency of occurrence rather than just those which have long durations. The 11/10 byte-swap logic is quite slow, but it is utilized infrequently, so that its impact upon performance is small; while the bus arbitration logic of the 11/34 exacts only a small time penalty but does so each time an instruction is executed and results in a larger performance impact. The usefulness of frequency data should thus be apparent, since the bottlenecks in a design are often not where intuition says they should be.

5. Summary and Use of the Methodologies

The PDP-11 offers an interesting opportunity to examine an architecture with numerous implementations spanning a wide range of price and performance. The implementations appear to fall into three distinct categories: the midrange machines (PDP-11/04/10/20/34/40/60); an inexpensive, relatively low-performance machine (LSI-11); and a comparatively expensive but high-performance machine (PDP-11/45). The midrange machines are all minor variations on a common theme with each implementation introducing much less variability than might be expected. Their differences reside in the presence or absence of certain embellishments rather than in any major structural differences. This common design scheme is still quite recognizable in the LSI-11 and even in the PDP-11/45. The deviations of the LSI-11 arise from limitations imposed by semiconductor technology rather than directly from cost or performance considerations, although the technology decision derives from cost. In the PDP-11/45, on the other hand, the quantum jump in complexity is purely motivated by the desire to squeeze the maximum performance out of the architecture.
From the overall performance model presented in Sec. 4.2 of this chapter, it is evident that instruction-stream processing can be speeded up by improving either the performance of the memory subsystem or the performance of the processor. Memory subsystem performance depends upon the number of memory accesses in a canonical instruction and the effective memory-read-pause time. There is not much that can be done about the first number, since it is a function of the architecture and thus largely fixed. The second number may be improved, however, by the use
of faster memory components or techniques such as caching.
The performance of the PDP-11 processor itself can be enhanced in two ways: by cutting the number of processor cycles to perform a given function or by cutting the time used per microcycle. Several approaches to decreasing the effective microcycle count have been demonstrated:

1 Structure the data paths for maximum parallelism. The PDP-11/45 can perform much more in a given microcycle than any of the midrange PDP-11's and thus needs fewer microcycles to complete an instruction. To obtain this increased functionality, however, a much more elaborate set of data paths is required in addition to a highly developed control unit to exercise them to maximum potential. Such a change is not an incremental one and involves rethinking the entire implementation.
2 Structure the microcode to take best advantage of instruction features. All processors except the 11/10 handle JMP/JSR addressing modes as a special case in the microcode. Most do the same for the destination modes of the MOV instruction because of its high frequency. Varying degrees of sophistication in instruction dispatching from the BUT IRDECODE at the end of every fetch is evident in different machines and results in various performance improvements.
3 Cut effective microcycle count by overlapping processor and Unibus operation. The PDP-11/10 demonstrates that a large microcycle count can be effectively reduced by placing cycles in parallel with memory access operations whenever possible.

Increasing microcycle speed is perhaps more generally useful, since it can often be applied without making substantial changes to an entire implementation. Several of the midrange PDP-11's achieve most of their performance improvement by increasing microcycle speed in the following ways:

1 Make the data paths faster. The PDP-11/34 demonstrates the improvement in microcycle time that can result from the judicious use of Schottky TTL in such heavily traveled points as the ALU. Replacing the ALU and carry/look-ahead logic alone with Schottky equivalents saves approximately 35 ns in propagation delay. With cycle times running 300 ns and less, this amounts to better than a 10 percent increase in speed.
2 Make each microcycle take only as long as necessary. The 11/34 and 11/40 both use selectable microcycle times to speed up cycles that do not entail long data-path propagation delays.

Circuit technology is perhaps the single most important factor in performance. It is only stating the obvious to say that doubling circuit speed will double total performance. Aside from raw speed, circuit technology dictates what it is economically feasible to build, as witnessed by the SSI PDP-11/20, the MSI PDP-11/40, and the LSI-11. Just the limitations of a particular circuit technology at a given point in time may dictate much about the design tradeoffs that can be made, as in the case of the LSI-11.
Turning to the methodologies, the two presented in Sec.4 of this chapter can be used at various times during the design cycle. The top-down approach can be used to estimate the performance of a proposed implementation, or to plan a family of implementations, given only the characteristics of the selected technology and a general estimate of data-path and memory-cycle utilization.
The bottom-up approach can be used to perturb an existing or planned design to determine the performance payoff of a particular design tradeoff. The relative frequencies of each function (e.g., addressing modes and instructions), while required for an accurate prediction, may not be available. There are, however, alternative ways to estimate relative frequencies. Consider the three following situations:

1 At least one implementation exists. An analysis of the implementation in typical usage (i.e., benchmark programs for a stored-program computer) can provide the relative frequencies.
2 No implementation exists, but similar systems exist. The frequency data may be extrapolated from measurements made on a machine with a similar architecture.
3 No implementation exists and there are no prior similar systems. From knowledge of the specifications, a set of most-used functions can be estimated (e.g., instruction fetch, register and relative addressing, move and add instructions for a stored-program computer). The design is then optimized for these functions.

Of course, the relative-frequency data should always be updated to take into account new data.
Our purpose in writing this chapter has been twofold: to provide data about design tradeoffs and to suggest design methodologies based on these data. It is hoped that the design data will stimulate the study of other methodologies while the results of the design methodologies presented here have demonstrated their usefulness to designers.
APPENDIX 1 INSTRUCTION TIME COMPONENT FREQUENCIES
This appendix tabulates the frequencies of PDP-11 instructions and addressing modes. These data were derived as explained in Subsection 4.1. Frequencies are given for the occurrence of each phase (e.g., source, which occurs only during double-operand instructions), each subcase of each phase (e.g., jump destination, which occurs only during jump or jump to subroutine instructions), and each instance of each phase, such as a particular addressing mode or instruction. The frequency with which the phase is skipped is listed for source and destination phases. Source and destination odd-byte-addressing frequencies are listed as well because of their effect on instruction timing.

Frequency

Fetch 1.0000
Source
Mode
0.4069
0 R
1 @R or(R)
2 (R)+
3 @(R)+
4 -(R)
S @-(R)
6 X(R)
7 @X(R)
No Source
NOTE: Frequency of odd-byte addressing (SM1-7) = 0.0252.
0.13770.03380.15870.01220.03520.00000.02710.0022
0.5931
Destination Mode 0.6872
Data Manipulation
Mode
0 R
1 @R or R
2 (R) +
3 @(R)+
4 -(R)
5 @-(R)
6 X(R)
7 @X(R)
0.6355
0.31460.05990.08540.03070.08230.00000.05470.0080
NOTE: Frequency of odd-byte addressing (DM1-7) = 0.0213.
Jump (JMP/JSR)
Operand Mode
0R
1 @R or(R)
2 (R)+
3 @(R)+
4 -(R)
5 @-(R)
6 X(R)
7 @X(R)
0.0517
0.0000 (ILLEGAL) 0.00000.00000.00790.00000.00000.04380.0000
No Destination 0.3128
Execute
Instruction
1.0000
Double operand
ADD
SUB
BIC
BICB
BIS
BISB
CMP
CMPB
BIT
BITB
MOV
MOVB
XOR
0.4069
0.05240.02740.03090.00000.00120.00130.06260.02120.00410.00140.15170.05240.0000
Single operand
CLR
CLRB
COM
COMB
INC
INCB
DEC
DECB
NEG
NEGB
ADC
ADCB
SBC
SBCB
ROR
RORB
ROL
ROLB
ASR
ASRB
ASL
ASLB
TST
TSTB
SWAB
SXT

Frequency

Jump
JMP
JSR
0.0517
0.02720.0245
Control, trap and miscellaneous
Set/clear condition codes
MARK
RTS
RTI
RTT
IOT
EMT
TRAP
BPT
0.0270
0.00170.00000.02360.00000.00000.00000.00170.00000.0000
NOTES: Frequency of destination odd-byte addressing (DM1-7) = 0.0213.
Execution frequencies indicated as 0.0000 have an aggregate frequency < 0.0050.
Maxicomputers
Introduction
What distinguishes the maxicomputer class from the classes already presented? As illustrated in Chap. 1, one primary characteristic is price. The maxicomputer tends to be the largest machine that can be built in a given technology at a given time. The typical price for a maxicomputer in 1980 was greater than $1 million. Another characteristic used in Chap. 1 was a large virtual-address space. In 1980 this meant a virtual-address space size in excess of 16 Mbyte.
Maxicomputers usually have a rich set of data-types. Over the years the scientific data-types have progressed from short-word to long-word fixed-point scalars, to floating-point scalars, and finally to vectors and arrays. Commercial data-types have progressed from character-at-a-time to fixed-length instructions using descriptors and on to variable character strings. The PMS structure of maxicomputers has-evolved from a single Pc to 1-Pc-n-Pio, then to m-Pc-n-Pio, and on to C-Cio [data-base]-Cio [communication].
Not all maxicomputers satisfy all the characteristics. Several maxicomputers have just basic processing performance as a goal and have only high-performance implementations (as do the TI ASC and the CRAY-1), often with a limited range of peripherals and software. Other maxicomputers have a family of program- compatible implementations spanning a large performance range (as do the System/369-370 Model 91 and Model 195 and the VAX-11). Particular implementations of these families of machines may be high-performance; however, such implementations are constrained by the family ISP, which may not have provision for features related only to high performance. (As an example of such a feature, the TI ASC has a PREPARE TO BRANCH instruction that notifies instruction prefetch logic of an upcoming branch. By prefetching instructions down both possible branch paths this instruction can keep the instruction pipeline filled.)
This section examines five maxicomputers. The System/360 and the VAX-11 represent implementation families, while the CRAY-1 and the TI ASC are explicitly targeted for the very-high-performance market, where the goal is solely performance. The CDC 6600, while designed primarily for the high-performance market, can be assembled into lower-performance models if the high-performance central processor is deleted.

The IBM System/360
 
The IBM System/360 is the name given to a third-generation series of computers. More recent than the System/360 is the IBM System/370, which has been followed by cost-reduced implementations in the Series 3030 and Series 4300, which constitute the current primary IBM product line. Chapters 40 and 41 focus on the ISP of the original System/360. A discussion of the System/370 and the 3030 and 4300 series plus a comparison of the various models in the System/360, System/370, Series 3030, and Series 4300 is covered in Part 4, Sec. 5.
The following discussion covers only the processor. The instruction set consists of two classes, scientific ISP and data processing ISP, which operate on the different data-types. These data-types correspond roughly to the IBM 7090 and IBM 1401 [Bell and Newell, 1971]. For the scientific ISP there are half- and single-word integers; address integers; single, double, and quadruple (in the Model 85) floating point; and logical words (boolean vectors). For the data processing ISP there are address or single-word integers, multiple-byte strings, and multiple-digit decimal strings. These many data-types give the 360 strength in the minds of its various types of users. However, the many data-types, each performing few operations, may be of questionable utility and may constrain the ISP design in a way that a more complete operation set for a few basic data-types does not.
The ISP uses a general-register organization, as is common in virtually all computers in use during the 1970s. The ISP power can be compared with several similar multiple-register ISP structures, such as those of the UNIVAC 1107 and 1108; the CDC 6600 and 7600; the CRAY-1; the DEC PDP-6, PDP-10, PDP-11, and VAX-11; the Intel 8080 and 8086; the SDS Sigma 5 and Sigma 7; and the early general-register-organized machine Pegasus [Elliott et al., 1956]. Of these machines the System/360 scientific ISP appears to be the weakest in terms of instruction effectiveness and the completeness of its instruction set. As part of the Military Computer Family (MCF) project [Computer, 1977; CFA, 1977], a statistically designed experiment was conducted to compare the effectiveness of the Interdata 8/32, PDP-11, and IBM System/360 ISP. Sixteen programmers implemented test programs from a set of 12 benchmark descriptions. In all, 99 programs were written and measured. The results indicated that the System/360 required 21 percent and 46 percent more memory to store programs than the PDP-11 and the Interdata 8/32, respectively. Further, the System/360 required 37 percent and 49 percent more bytes than the PDP-11 and Interdata 8/32, respectively, to be transferred between primary memory and the processor during execution of the test programs.
In the following discussion, it would be instructive to contrast
680
Maxicomputers 681

the System/360 ISP with a more contemporary ISP, such as that of the VAX-11. For example, in the VAX-11/780 (Chap. 42), symmetry is provided in the instruction set. For any binary operation b the following are possible:
GR ¬ GR b Mp
GR ¬ GR b GR
Mp ¬ GR b Mp
Mp ¬ Mp b Mp
Memory/register to register Register to register Memory/register to memory Memory/memory to memory
The 360 ISP provides only the first two. Additional instructions (or modes) would increase the instruction length.
In the System/360 the only advantage taken of general registers is to make them suitable for use as index registers, base registers, and arithmetic accumulators (for operand storage). Of course, the commitment to extend the general-purpose nature of these general registers would require more operations.
The 360 has a separate set of general registers for floating-point data, whereas the VAX-11/780 uses one register set for all data-types. Data-type-specific register sets provide more processor state and temporary storage hut again detract from the general-purpose ability of the existing registers. Special commands are required to manipulate the floating-point registers independently of the other general registers. Unfortunately the floating-point instruction set is not quite complete (e.g., in conversion from fixed to floating point; several instructions are needed to move data between the fixed and floating registers).
When multiple data-types are available, it is desirable to have the ability to convert between them unless the operations are complete in themselves.. The VAX-11/780 provides a full set of instructions for converting between data-types. The System/360, on the other hand, might use more data-conversion instructions, for example, between the following:

1 Fixed-precision integers and floating-point data.
2 Address-size integers and any other data.
3 Half-word integers and other data.
4 Decimal and byte string and other data. (Conversion between decimal string and byte string is provided.)

Some of the facilities are redundant and might be handled by better hut fewer instructions. For example, decimal strings are not completely variable-length (they are variable up to 31 digits, stored in 16 bytes), and so essentially the same arithmetic results could be obtained by using fixed multiple-length binary integers. This would remove the special decimal arithmetic and still give the same result. If a large quantity of fixed-field decimal or byte data were processed, then the binary-decimal conversion instructions would be useful.
The communication instructions between Pc and Pio are minimal with the System/360. The Pc must set up Pio program data, but there are inadequate facilities in the Pc for quickly forming Pio instructions (which are actually yet another data- type). There are, in effect, a large number of Pio's, as each device is independent of all others. However, signaling of all Pio's is via a single interrupt channel to the Pc. By contrast, the VAX-11 I/O devices are implemented as a set of registers with addresses in the memory address space. Thus the entire instruction set is usable to directly control the I/O activity. There are no specific I/O instructions.
The Pc state consists of 26 words of 32 bits each:

1 Program state words, including the instruction counter (2 words)
2 Sixteen general registers (16 words)
3 Four 2-word floating-point general registers (8 words)

Many instructions must be executed (taking appreciable time) to preserve the Pc state and establish a new one. A single instruction would be preferable; even better would be an instruction to exchange processor states, as in the CDC 6600 (Chap. 43).
As originally designed in the System/360, the methods used to address data in Mp had some disadvantages. It is impossible to fetch an arbitrary word in Mp in a single instruction, because the address space is limited to a direct address of only 212 bytes. Any Mp access outside the range requires an offset or base address to be placed in a general register. Accesses to several large arrays may take significant time if a base address has to be loaded each time. The reason for using a small direct address is to save space in the instruction. The VAX-11 provides multiple addressing modes, including direct access to 231 bytes, that gives the programmer flexibility in accessing arbitrary operands.
Another difficulty of the 360 addressing is the nonhomogeneity of the address space. Addressing is to the nearest byte, but the system remains organized by words; thus, many addresses are forced to be on word (and even doubleword) boundaries. For example, a double-precision data-type which requires two words of storage must be stored with the first word beginning at a multiple of an 8-byte address. (However, the Model 85, which is a late entry in the series, allows arbitrary alignment of data-types with word boundaries, while the System/370 eliminated this limitation.) When a general register is used as a base or index register, the value in the index register must correspond to the length of the data-type accessed. That is, for the value of a half
Maxicomputers
integer, single integer, single floating (long), and quadruple floating (extended), I must be multiplied by 2, 4, 4, 8, and 16, respectively, to access the proper element. The VAX-11 does not require data-types to be aligned on artificial boundaries.
A single instruction to load or store any string of bits in Mp (as provided in the IBM Stretch) would provide a great deal of generality. Provided the length were up to 64 bits, such an instruction might eliminate the need for the more specialized data-types.
A basic scheme for dynamic multiprogramming through pro gram swapping was nonexistent in the System/360 because of the inadequate relocation hardware. Only a simple method of Mp protection is provided, using protection keys (see Part 2, Sec. 2). This scheme associates a 4-bit number (key) and a 1-bit write protect with each 2-Kbyte block, and each Pc access must have the correct number. Both protection of Mp and assignment of Mp to a particular task (greater than 24 tasks) are necessary in a dynamic multiprogramming environment. Although the architects of the System/360 advocate its use for multiprogramming, the operating system does not enforce conventions to enable a program to be moved, once its execution is started. Indeed, the nature of the System/360 addressing is based on absolute binary addresses within a program. The later, experimental Model 67 does, however, have a very nice scheme for protection, relocation, and name assignment to program segments [Arden et al., 1966].

VAX

The VAX-11 (Virtual Address Extension) is a 32-bit successor to the PDP-11 minicomputer (Chap. 38). The VAX-11 ISP bears a strong kinship to the PDP-11 ISP, especially with respect to addressing modes.
While the primary reason for creating an ISP based on 32-bit words was for a 32-bit address space, the extra word width allowed for the addition of new data-types (strings, characters, etc.) and a general cleaning up of the instruction format (e.g., from a variety of op code field lengths of 4, 8, 10, and 16 bits in the PDP-11 to multiples of 8-bit fields). Several of the perceived shortcomings of the System/360 instruction set were fixed, including:


1 ISP symmetry for source and destination operands.
2 A complete set of instructions for each data-type and for converting between data-types.
3 General-register architecture where the registers are used for all data-types. There are no special registers dedicated to a subset of the data-types.
4 I/O handling through the address space, as in the PDP-11. The same set of instructions can be used in either data manipulation or I/O.
5 A virtual-memory system that provides both program protection and memory relocation.
6 Rapid context swap through automatic register saving as determined by a settable bit mask.
7 Addressability of any location in memory by a single instruction.
8 Stacks and stack operators integral to the design, especially for procedure calls.

The VAX-11 ISP represents what the System/360 ISP could have been given 10 years of experience in instruction sets. The evolution of the VAX-11 ISP from the PDP-11 ISP is an interesting study of concern for user-program compatibility on architectures using different word lengths. This evolution is also interesting to compare with that of the System/360 and System/370 (Chap. 51).
Figures 1 and 2 illustrate the PMS diagram and Kiviat graph for the first VAX implementation, the VAX-11/780. An LSI-11 serves as the console processor. The LSI-11 interprets commands typed on the console for machine control. The console teletype replaces the traditional console light and switch panel in performing functions such as HALT, SINGLE STEP, DEPOSIT, and EXAMINE. The console processor also provides for system initiation (booting), diagnosis (through microdiagnostics and the diagnostic control store), and status monitoring. Conceptually, the console terminal could be replaced by a phone line or serial line to another computer for remote monitoring and control.
A set-associative cache provides performance improvement on operand fetching. Because of the elaborate translation from virtual to real address, a translation buffer (or physical address cache) provides speedup to the address translation process.
Any mix of four Unibus or Massbus adaptors provides for attaching to peripheral buses that are not compatible with the VAX-11/780 processor/memory.

The CDC 6600, 7600, and CYBER Series

The CDC 6000 series development began in 1960, using high-speed transistors and discrete components of the second generation. The first 6600 was announced in July 1963 and first delivery was in September 1964. Subsequent, compatible successors included the 6400, in April 1966, which was implemented as a conventional Pc (a single shared arithmetic function unit instead of the 10 D's); the 6500, in October 1967, which uses two 6400 Pc's; and the 6416, in 1966, which has only peripheral and control
Maxicomputers 683


processors (PCP). The first 7600, which is nearly compatible, was delivered in 1969. The dual-processor 6700, consisting of a 6600 and a 6400 Pc, was introduced in October 1969. Subsequent modifications to the series in 1969 included the extension to 20 peripheral and control processors with 24 channels. CDC also marketed a 6400 with a smaller number of peripheral and control processors (the 6415-7, with seven). Reducing the maximum PCP number to seven also reduced the overall purchase cost by approximately $56,000 per processor. The computer organization, technology, and construction are described in Chap. 43. ISP descriptions for the Pc are given in Appendix 1 of Chap. 43. To obtain the very high logic speeds, the components are placed close together. The logic cards use a cordwood-type construction. The logic is direct-coupled-transistor logic, with 5 ns of propagation time and a clock of 25 ns. The fundamental minor cycle is 100 ns, and the major cycle is 1,000 ns, as is the memory cycle time. Since the component density is high (about 500,000 transistors in the 6600), the logic is cooled by conduction to a plate with Freon circulating through it. This series is interesting from many aspects. It remained the fastest operational computer for many years, until the advent of the IBM System/360 Model 91 and the follow-on CDC 7600. Its large component count almost implies it cannot exist as an operational entity. Thus it is a tribute to an organization, and the project leader-designer, Seymour Cray, that a large number exist. There are sufficiently high data bandwidths within the system that it remains balanced for most job mixes (an uncommon feature in large C's). It has high-performance Ms. disks and T. displays to avoid bottlenecks. The Pc's ISP is a nice variation of the general-register processor and allows for very efficient encoding of programs. The Pc is multiprogrammed and can be switched from job to job more quickly than any other computer. Ten smaller C's control the main Pc and allow it to spend time on useful (billable) work rather than on its own administration. The independent multiple data operators in the 6600 increase the speed by at least 21/2; times over a 6400, which has a shared D. Finally, it realizes
Maxicomputers
the 10 C's in a unique, interesting, and efficient manner. Not many computer systems can claim half as many innovations. PMS Structure A simplified PMS structure of the C[' 6400, 6600] is given in Fig. 3. Here we see the C[io; #1:10], each of which can access the central computer (Cc) primary memory (Mp). Figure 3 shows why we consider the 6600 to he fundamentally a network. Each Cio (actually a general-purpose, 12-bit C) can easily serve the specialized Pio function for Cc. The Mp of Cc is an Ms for a Cio, of course. By having a powerful Cio, more complex input/output tasks can he handled without Cc intervention. These tasks can include data-type conversion and error recovery, among others. The K's which are connected to a Cio can also be less complex. Figure 3 has about the same information as Fig. 1 in Chap. 43. A detailed PMS diagram for the C[' 6400, '6416, '6500, and '6600] is given in Fig. 4, accompanied by a Kiviat graph in Fig. 5 that is representative of the CDC 6600 series. The interesting

Maxicomputers 685
structural aspects can be seen from Fig. 4. The four configurations, 6400 through 6600, are included just by considering the pertinent parts of the structure. That is, a 6416 has no large Pc; a 6400 has a single, straightforward Pc; a 6500 has two Pc's; and the 6600 has a single, powerful Pc. The 6600 Pc has ten D's, so that several parts of a single instruction stream can be interpreted in parallel. A 6600 Pc also has considerable M. buffer to hold instructions so that Pc need not wait for Mp fetches. The implementation of the ten Cio's can be seen from the PMS diagram (Fig. 4). Here, only one physical processor is used on a time-shared basis. Each 0.1 m s. a new logical P is processed by the physical P. The ten Mp's are phased so that a new access occurs each 0.1 m s. The ten Mp's are always busy. Thus the i.rate is 10 ´ 12 bit/m s or 120 Mbit/s. This process of shifting a new Pc state into position each 0.1 m s has been likened to a barrel by CDC. A diagram of the process is shown in Fig. 6. The T's, K's, and M's are not given, although it should be mentioned that the following units are rather unique: a K for the management of 64 telegraph lines to be connected to a Cio; an Ms(disk) with four simultaneous access ports, each at a data-transfer rate of 1.68 megacharacters per second and a capacity of 168 megacharacters; an Ms[magnetic tape] with a K[#1:4] and S to allow simultaneous transfers to four Ms's; the T[display] for monitoring the system's operation; K's to other C's and Ms's; and conventional T[card reader, punch, line printer, etc.]. ISP The ISP description of the Pc is given in Appendix 1 of Chap. 43. The Pc has a very clean, straightforward, scientific-calculation-oriented ISP. We can consider it a variation on the general- register structure because the Pc state has three sets of general registers. Their use is explained both in Chap. 43 and its Appendix 1. This structure assumes that a program consists of several read accesses to a large array (or arrays) and a large number of operations on these accessed elements, followed by occasional write accesses to store results. We would agree that this is a valid assumption for scientific programs (e.g., look at a FORTRAN arithmetic statement), and it is probably valid for most other programs as well. Cc has provisions for multiprogramming in the form of a protection and relocation address register pair. The mapping is given in the ISP description for both Mp and Ms[' Extended Core Storage/ECS]. The 6600 PCP is about the same as the early CDC 160 minicomputer (see Part 3, Sec. 3). The PCP has an 18-bit A register because it has to process addresses for the large Cc. One interesting aspect of the 6400 which we question is the lack of communication among all components at the ISP (programming) level. When Pc stops, it has no way of explicitly informing any other components. There are no interprocessor interrupts. An I/O device cannot interrupt a Pio, nor can Pio's communicate with one another except by polling. The state switching for Pc is elegant, however, since a Pio can request Pc to stop a job, store Mps, and resume a new task in one instruction. (The t.save + t.restore ~ 2 m s.) The Operating System The Cio's functions are data transmission between a peripheral device and the large Cc via the Cio's Mp with some data transformation or conversions; complete task management, including initiation, termination, and error handling; and management of Pc. The Cio's perform in about the same manner as the C['Attached Support Processor] in the System/360 (see Part 4, Sec. 5). The operating-system software is managed by a single fixed Cio. The remaining nine Cio's are free, and as I/O tasks arise in the system, the Cio's assign themselves to particular tasks, carry out the tasks, and then free themselves to take on other tasks. The operating-system software resides in Mp(Pc) (that is, Cc), is accessible to all Cio's, and includes:

Maxicomputers

 
2 Programs for the Cio's: a Parts of the operating system used by the Cio responsible for the system management b I/O management programs (or programs to get the task management program from Ms) which Cio's use
In a typical system, one might expect to find the following assignment of PCP's:

1 Operating-system execution, including scheduling and management of Cc and all Cio's
Maxicomputers 687

 
2 Display of job-status data on T[display] 3 Ms[disk] transfer management 4 T[printers, card reader, card punch] 5 L[#1:3; to:C.satellite] 6 Ms[magnetic tape] 7 T[64 Teletypes] 8 Free to be used with Ms[disk] and Ms[magnetic tape] 9 Free 10 Free

The CDC 7600 Series
The CDC 7600 system is an upward-compatible member of the CDC 6000 series. Although the main Pc in the 7600 is compatible with the main Pc of the 6600, instructions have been added for controlling the I/O section and for communicating between Large Core Memories (LCM) and Small Core Memory (SCM). It is expected to compute at an average rate 4 to 6 times that of a C['6600]. The PMS structure (Fig. 7) is substantially different from that of the 6600. The C[' 7600] Peripheral Processing Unit (PPU), unlike the Peripheral and Control Processors of the C[' 6600], has a loose coupling with the main C. The PPUs are under control of the main
C when transferring words into SCM via K[ 'Input-Output Section]. The fifteen C[' PPU]'s have eight input/output channels. These channels, which can run concurrently, provide the link between C[' PPU] and peripheral Ms's and T's. Some of the PPUs are located in the same physical space as the Pc. The 7600 Pc can be interrupted by a clock, the PPUs, and a trap condition within the Pc. A breakpoint address, BPA, can be set up within Pc to initiate a trap when the program reaches BPA. This interruption scheme is in contrast to that of the 6600, which could not be interrupted or trapped. The 7600 interrupt may be a reaction to the lack of intercommunication in the 6600. The CDC CYBER Series The CDC CYBER-170's continued the line of computers beginning with the CDC 6600. The CYBER-170 series, manufactured in six models, was announced in 1970. This series extended into timesharing and multimode operations the concept of separate hardware for computation, input/output, and monitoring. The CYBER-170 series, while incorporating refined versions of the architecture and software of its predecessors, offers a broader range of performance levels and applications, as well as cost-effective operation. The CDC CYBER-170 series of machines features six compatible computer systems in the medium- to large-scale range. All of these high-performance machines share the same basic architec-
 Maxicomputers 689
ture; which distributes functions among a central processor, for computation, and auxiliary peripheral processors, which perform systems input/output and operating-system functions. (See Fig. 8.) For most of the CYBER-170 models the central processor is field-upgradable, and there is no software conversion necessary throughout the entire line. The six CYBER-170 models (171-176) are built with common components and exhibit a high degree of commonality in their basic configuration, which is composed of the central processor unit, the memory units, and the peripheral processors. All processors in the series are implemented in emitter-coupled logic integrated circuits, and the central memories are implemented in bipolar semiconductor logic. The Kiviat graph (Fig. 9) summarizes the CYBER-170 system performance. The models 171, 172, 173, and 174 feature a high-speed, unified arithmetic Central Processor Unit, which executes 18-bit and 60-bit operations, and a Compare Move Unit (CMU) to enhance the system's performance when it is working with variable-length character strings. The base Pc for the CYBER-170 series is the Model 171. A second processor, to increase system performance, is optional. A CMU is also available as an option. The Model 172 has a performance-enhanced Model 171 Pc. Again, one or two Pc's may be configured. The CMU is a standard feature with the Model 172. The Model 173 further enhances the performance level using the same basic Pc as Models 171 and 172; however, only one Pc
may be configured into the system. The CMU is again a standard feature. The Model 174 employs two Model 173 Pc's in a dual-processor configuration with each processor having a CMU. The Pc's for Models 175 and 176 have nine functional units, which allow concurrent execution of instructions. The Model 175 may have a standard or a performance-enhanced Pc. An instruction stack is also provided to allow fast retrieval of previously executed instructions. The Model 176 is an upgraded version of the Model 175 and in addition has an integrated interrupt system. The range of capabilities and performance between Models 171 and 176 is significant, and there is total compatibility among the six different processors. The lower-performance models are ideally suited as front-end systems for the more powerful Pc's. The peripheral processor subsystem consists of 10, 14, 17, or 20 functionally independent, programmable computers (peripheral processing units, or PPUs), each with 4,096 twelve-bit words of MOS memory. These act as system-control computers and peripheral processors. All PPUs communicate with central memory, external equipment, and each other through 12 or 24 independent bidirectional input/output channels. These channels transfer data at the rate of two 12-bit words per microsecond. For the Model 176, optional high-speed PPUs are required to drive high-speed mass-storage devices, such as the CDC 7639/819 units, which transfer data at rates of approximately 40 million bits per second. A minimum of 4 high-speed PPUs are necessary, and a maximum of 13 may be connected to the system. The central memory options for the CYBER-170 series range in size from 64 to 256 kilowords organized into 8 or 16 interleaved banks of 60-bit words. Depending on the model, the minor cycle transfer rate of the 60-bit words is 50, 27.5, or 25 ns. However, because of interleaving, the memory operates at much higher apparent access rates. The central memory provides orderly data flow between various system elements. The Central Memory may be supplemented with additional extended memory, which is available in increments ranging from 0.5 to 2 megawords. The extended memory may be used for system storage, data collection, job swapping, or user programs.
  The CRAY-1 Chapter 44 introduces the CRAY-1, a direct descendant of the CDC 6600 series. The similarities between the architectures are not surprising, owing to the fact that Seymour Cray was also the chief designer for the CRAY-1. Points of similarity with the CDC 6600 can be seen in the multiple functional units (address, scalar, vector, floating-point), the instruction buffer, and the field-length/limit registers for memory protection. The most important ISP improvement over the CDC 6600 is the addition of the vector data-type. A common feature of all the high-performance machines is the
extensive use of buffers to smooth the flow of data and to ensure that the Pc units never have to wait for data. There are buffers to smooth the flow of data to and from memory. There is also an instruction buffer, which provides three functions:

1 The prefetch of instructions in blocks from memory to smooth any mismatch between processor and memory subsystems. The memory boxes are usually n-way interleaved, so that a words can be fetched at once.
2 An instruction look-ahead past branches, which fetches instructions down both branch paths so that no matter what the outcome of the branch, instructions will be available for execution.
3 If the instruction buffer is large enough, an ability to contain and repeatedly execute whole program segments at instruction buffer speed. Thus the instruction buffer can double in function as a cache.

The arithmetic instructions in the CRAY-1 only operate on the large array of registers:

1 Eight 64-bit scalar registers
2 Eight sets of vector registers, each 64 registers of 64-bits each

These register files are meant to hold intermediate results until computations are completed. They also perform the function of a cache, except that the user or compiler must ensure data locality in the registers.
Figure 10 depicts the PMS structure of the CRAY-1, while Fig. 11 illustrates the internal Pc organization. Each of the 13 functional units is pipelined. Figure 12 shows the mass-storage subsystem, and Fig. 13 summarizes the CRAY-1 performance.
The Pc and memory are implemented in ECL logic. The processor has a 12.5-ns basic clock cycle time, and the memory has an access time of 50 ns. The Pc is capable of accessing a maximum of 1 million 64-bit words. The memory is expandable from 0.25 megaword to a maximum of 1 megaword. There are 12 input channels and 12 output channels in the input/output section. They connect to a Maintenance Control Unit (MCU), a mass-storage subsystem, and a variety of front-end systems or peripheral equipment. The MCU provides for system initialization and for monitoring system performance. The mass-storage subsystem has a maximal configuration that provides storage for 9.7 ´ 109 eight-hit characters. The CRAY-1 Operating System, COS, is a multiprogramming batch system with up to 63 jobs. As of 1979, two languages were supported: FORTRAN and Assembler. The FORTRAN compiler analyzes the innermost loops of FORTRAN to detect vectorizable sequences and then generates code that takes advantage of the processor organization.
In the fall of 1979, Cray Research introduced the 12 models of the S series computers. Ranging from the S/250 through the S/4400, the models differed in amount of main memory (1/4 megaword to 4 megawords) and I/O configuration. Three models (S/250, S/500, S/1000) have 1/4, 1/2, and 1 megaword of memory each with no I/O subsystem. The nine remaining models have either 1, 2, or 4 megawords of memory with 2, 3, or 4 I/O processors. In the maximal I/O subsystem configuration, there are four I/O processors, 1 megaword of I/O Buffer Memory (maximum transfer rate 2,560 Mbit/s), sixteen Block Multiplexer Channels, and forty-eight 606-Mbyte disks (total storage 2.9 ´ 109 bytes).
The first customer shipment of a CRAY-1 Computer System was in March 1976 to Los Alamos Scientific Laboratories (LASL). Other customer shipments as of 1979 include the National Center for Atmospheric Research, the Department of Defense (two systems), the National Magnetic Fusion Energy Computer Center, the European Centre for Medium Range Weather Forecasting, and an upgraded version to LASL.
The CRAY-1 processor's performance is 5 times that of a CDC 7600's or 15 times that of an IBM System/370 Model 168.

The TI ASC

The Texas Instruments Advanced Scientific Computer was initially planned for high-speed processing of seismic data. Therefore, vector data-types were also important for the ASC. The ASC shows some strong kinship to the CRAY machines, because it was built on the knowledge of the earlier CDC machines. But it also has some significant differences.
The most important problem was perceived as obtaining a high memory-processor bandwidth. Thus a Memory Control Unit (MCU) that could sustain a transfer rate of 640 megawords per second was designed. The MCU is actually a cross-point switch between eight processor ports and nine memory ports.
The ASC is controlled by eight peripheral processors (PP) executing operating-system code, as in the ten CDC 6600 peripheral processors. The PPs are implemented as virtual processors (VP), as in the CDC 6600. Each VP has its own register set (e.g., program counter, arithmetic, index, base, and instruction registers) sharing ROM, ALU, instruction decoder, and central memory buffers. Also, as in the CDC 6600, the PP's ISP is control-oriented and hence lacks the richer instruction set of the Central Processor (CP).
The CP has dedicated function registers: 16 base, 16 arithmetic, 8 index, and 8 for holding parameters for vector instructions. The CP employs multiple functional units, as do the CDC 6600 and the CRAY-1. However, the units are organized in a rigid order of succession called a pipeline. An ASC can support up to four pipelines of eight stations each. The instruction fetch/decode is
 

also pipelined in four stages (fetch, operand decode, effective address calculation, and operand fetch). Thus up to 36 instructions can be in various states of execution at the same time. The pipeline stages are usually smaller functions than the functions performed by the multiple functional units (e.g., exponent extract versus floating-point multiply). The pipeline also suffers if not all of the stages are to be used for a given instruction. An instruction that utilized stages 1, 2, 3, 5, 6, and 8 (floating add; see Fig. 7 in Chap. 45) would hold up an instruction that utilized only stages 1, 4, 5, and 8 (integer multiply) because of resource conflicts. The multiple functional units are only held up by the unavailability of operands. The pipeline concept, especially as used in instruction fetch/decode, is a very effective method to improve processor performance. Since the hardware cost to implement instruction fetch/decode is small, the method is used almost universally, even appearing on some microprocessors. The design and control of operand/operator parallelism is much more complex. Chapter 19 outlines in detail the design of the IBM System/360 Model 91, which employed multiple functional units. The most effective way to gain high-speed parallelism is to have the operands in registers and use register-to-register instructions. The CRAY-1 does this through having only register-to-register arithmetic instructions and forcing the user or compiler to convert the applications program to register form. Since the System/360 Model 91 had to adhere to the System/360 ISP, hardware was added to dynamically convert System/360 instructions into a pseudoregister instruction set. Chapter 19 also nicely explains the problems faced (and a solution to them) by parallelism in multiple functional units. In particular, the following problems have to be addressed:
1 Condition code dependencies. When a branch instruction is encountered, some strategy must be established on instruction fetching until the instruction setting the condition code is completely executed.
 
 



2 Use of results before they are available. Because of the parallelism, several instructions are in partial states of execution at the same time. An instruction requiring the results of a previous instruction has to be held up until the results are available. If the whole instruction stream is held up until the one instruction completes execution, the parallelism in the multiple functional units is lost.Finally, Dean [1973] states that there are several advantages of TI's PP approach over CDC's:

As has been stated before, the PPU is the unit in which the operating system executes. Other manufacturers have attempted to execute the executive functions in peripheral units with less than satisfactory results. There are some significant differences in the ASC architecture which make this possible, however. First, the PPU can execute code directly from central memory. All of central memory can be addressed by each VP (virtual processor). To change the function of a VP requires only a branch to the new code. This feature greatly enhances the dynamic "balancing" of VP power to meet system requirements. When system programs are written in re-entrant form, one copy of the code suffices for several VP's, thereby reducing the memory requirements for system tasks. Second, a special set of 64 32-hit registers are built
into the PPU. These registers (called communication registers) are literally the nerve center for the entire ASC system. Each VP has the ability to test and set individual bits in these registers in a single clock time. This allows for interprocessor control and communication on a dynamic basis. These registers also serve as control bits for data channels, context switch status, and act as I/O channels for some low bandwidth devices (e.g., the operator's console). Finally, the PPU contains a small execute only memory. Routines stored in the read only memory are accessible to all VP's. The maximum size of this memory is 4096 words. It is used for three purposes. First, it is non-volatile and thus contains the bootstrap routines needed when power is first turned on. Second, it is fast. Being an integral part of the PPU, instructions may be fetched at the clock speed of the PPU. Finally, the ASC is basically a polling oriented system. When a VP is in a polling loop, the instructions can be placed in this memory and not interfere with main memory traffic. Hardware Technology The preceding discussion has been concerned with some of the special features and architecture of the ASC system. A final word is in order concerning its physical construction. All of the Central Processor, the Peripheral Processor, and the Memory Control Unit are fabricated of the advanced emitter-coupled logic (ECL) integrated circuits. These circuits are interconnected on 17-layer multilayer circuit board. Further, the hack panel wiring found on most large machines has been replaced by multilayer "motherboards" into which the circuit boards are plugged. The entire system is cooled by chilled water and forced

Maxicomputers
air within each logic column and appears to he relatively insensitive to the ambient temperature.
Comparison of Maxicomputers
 
 
Bucy and Senne [1978] reported on nonlinear filter design that required the solution of nonlinear partial differential equations. The problem was solved on eight machines, including a general purpose minicomputer (PDP-11/70); a microprogrammed, special purpose auxiliary processor (AP120 B); machines with multiple functional units (CDC 6600, CDC 7600, CRAY 1, IBM S/370-168); machines with pipelines (CDC STAR-100, CRAY 1) and an array processor (Illiac IV). The benchmark consisted of the following floating point computations: 53,341 adds, 28,864 multiplies, one division, and 32 exponentations. The resultant computation rates and cost per operation are depicted in Table 1. The most cost effective organization from a cost per operation is the functional specialized AP120B. However, when software development costs are considered, systems such as the CRAY 1 with vectorizing FORTRAN compilers may be the best long-term solution .
Summary A general introductory description of the logical structure of SYSTEM/360 is given. In addition, the functional units, the principal registers and formats, and the basic addressing and sequencing principles of the system are indicated.
In the SYSTEM/360 logical structure, processing efficiency and versatility are served by multiple accumulators, binary addressing, bit-manipulation operations, automatic indexing, fixed and variable field lengths, decimal and hexadecimal radices, and floating-point as well as fixed-point arithmetic. The provisions for program interruption, storage protection, and flexible CPU states contribute to effective operation. Base-register addressing, the standard interface between channels and input/output control units, and the machine-language compatibility among models contribute to flexible configurations and to orderly system expansion.
SYSTEM/360 is distinguished by a design orientation toward very large memories and a hierarchy of memory speeds, a broad spectrum of manipulative functions, and a uniform treatment of input/output functions that facilitates communication with a diversity of input/output devices. The overall structure lends itself to program-compatible embodiments over a wide range of performance levels.
The system, designed for operation with a supervisory program, has comprehensive facilities for storage protection, program relocation, nonstop operation, and program interruption. Privileged instructions associated with a supervisory operating state are included. The supervisory program schedules and governs the execution of multiple programs, handles exceptional conditions, and coordinates and issues input/output (I/O) instructions. Reliability is heightened by supplementing solid-state components with built-in checking and diagnostic aids. Interconnection facilities permit a wide variety of possibilities for multisystem operation.
The purpose of this discussion is to introduce the functional units of the system, as well as formats, codes, and conventions essential to characterization of the system.
Functional Structure
The SYSTEM/360 structure schematically outlined in Fig. 1 has seven announced embodiments. Six of these, namely, Models 30, 40, 50, 60, 62, and 70, will be treated here.2 Where requisite I/O devices, optional features, and storage capacity are present, these six models are logically identical for valid programs that contain explicit time dependencies only. Hence, even though the allowable channels or storage capacity may vary from model to model (as discussed in Chap. 41), the logical structure can be discussed without reference to specific models.
Input/Output
Direct communication with a large number of low-speed terminals and other I/O devices is provided through a special multiplexor channel unit. Communication with high-speed I/O devices is accommodated by the selector channel units. Conceptually, the input/output system acts as a set of subchannels that operate concurrently with one another and the processing unit. Each subchannel, instructed by its own control-word sequence, can govern a data transfer operation between storage and a selected I/O device. A multiplexor channel can function either as one or as many subchannels; a selector channel always functions as a single subchannel. The control unit of each I/O device attaches to the channels via a standard mechanical-electrical-programming inter face.
Processing
The processing unit has sixteen general purpose 32-bit registers used for addressing, indexing, and accumulating. Four 64-bit floating-point accumulators are optionally available. The inclusion of multiple registers permits effective use to be made of small high-speed memories. Four distinct types of processing are provided: logical manipulation of individual bits, character strings and fixed words; decimal arithmetic on digit strings; fixed-point binary arithmetic; and floating-point arithmetic. The processing unit, together with the central control function, will be referred to as the central processing unit (CPU). The basic registers and data paths of the CPU are shown in Fig. 2.
The CPU's of the various models yield a substantial range in performance. Relative to the smallest model (Model 30), the internal performance of the largest (Model 70) is approximately 50:1 for scientific computation and 15:1 for commercial data processing.
Control Because of the extensive instruction set, SYSTEM/360 control is more elaborate than in conventional computers. Control functions include internal sequencing of each operation; sequencing from instruction to instruction (with branching and interruption); governing of many I/O transfers; and the monitoring, signaling, timing, and storage protection essential to total system operation. The control equipment is combined with a programmed supervisor, which coordinates and issues all I/O instructions, handles exceptional conditions, loads and relocates programs and data, manages storage, and supervises scheduling and execution of multiple programs. To a problem programmer, the supervisory program and the control equipment are indistinguishable. The functional structure of SYSTEM/360, like that of most computers, is most concisely described by considering the data formats, the types of manipulations performed on them, and the instruction formats by which these manipulations are specified. Information Formats The several SYSTEM/360 data formats are shown in Fig. 3. An 8-bit unit of information is fundamental to most of the formats. A

The Structure of SYSTEM/360 697
consecutive group of n such units constitutes a field of length n. Fixed-length fields of length one, two, four, and eight are termed bytes, halfwords, words, and double words, respectively. In many instructions, the operation code implies one of these four fields as the length of the operands. On the other hand, the length is explicit in an instruction that refers to operands of variable length. The location of a stored field is specified by the address of the leftmost byte of the field. Variable-length fields may start on any byte location, but a fixed-length field of two, four, or eight bytes must have an address that is a multiple of 2, 4, or 8, respectively. Some of the various alignment possibilities are apparent from Fig. 3. Storage addresses are represented by binary integers in the system. Storage capacities are always expressed as numbers of bytes.
 
  Processing Operations The SYSTEM/360 operations fall into four classes: fixed-point arithmetic, floating-point arithmetic, logical operations, and decimal arithmetic. These classes differ in the data formats used, the registers involved, the operations provided, and the way the field length is stated. Fixed-Point Arithmetic The basic arithmetic operand is the 32-bit fixed-point binary word. Halfword operands may be specified in most operations for the sake of improved speed or storage utilization. Some products and all dividends are 64 bits long, using an even-odd register pair. Because the 32-bit words accommodate the 24-bit address, the entire fixed-point instruction set, including multiplication, division, shifting, and several logical operations, can be used in address computation. A two's complement notation is used for fixed-point operands. Additions, subtractions, multiplications, divisions, and comparisons take one operand from a register and another from either a register or storage. Multiple-precision arithmetic is made convenient by the two's complement notation and by recognition of the carry from one word to another. A pair of conversion instructions, CONVERT TO BINARY and CONVERT TO DECIMAL, provide transition between decimal and binary radices without the use of tables. Multiple-register loading and storing instructions facilitate subroutine switching. Floating-Point Arithmetic Floating-point numbers may occur in either of two fixed-length formats-short or long. These formats differ only in the length of the fractions, as indicated in Fig. 3. The fraction of a floating-point
 
number is expressed in 4-bit hexadecimal (base 16) digits. In the short format, the fraction has six hexadecimal digits; in the long format, the fraction has 14 hexadecimal digits. The short length is equivalent to seven decimal places of precision. The long length gives up to 17 decimal places of precision, thus eliminating most requirements for double-precision arithmetic. The radix point of the fraction is assumed to be immediately to the left of the high-order fraction digit. To provide the proper magnitude for the floating-point number, the fraction is considered to be multiplied by a power of 16. The characteristic portion, bits 1 through 7 of both formats, is used to indicate this power. The characteristic is treated as an excess 64 number with a range from - 64 through +63, and permits representation of decimal numbers with magnitudes in the range of 10-78 to 1075 Bit position 0 in either format is the fraction sign, S. The fraction of negative numbers is carried in true form. Floating-point operations are performed with one operand from a register and another from either a register or storage. The
result, placed in a register, is generally of the same length as the operands. Logical Operations Operations for comparison, translation, editing, bit testing, and bit setting are provided for processing logical fields of fixed and variable lengths. Fixed-length logical operands, which consist of one, four, or eight bytes, are processed from the general registers. Logical operations can also be performed on fields of up to 256 bytes, in which case the fields are processed from left to right, one byte at a time. Moreover, two powerful scanning instructions permit byte-by-byte translation and testing via tables. An important special case of variable-length logical operations is the one-byte field, whose individual bits can be tested, set, reset, and inverted as specified by an 8-bit mask in the instruction. Character Codes Any 8-bit character set can be processed, although certain restrictions are assumed in the decimal arithmetic and editing operations. However, all character-set-sensitive, I/O equipment assumes either the Extended Binary-Coded-Decimal Interchange Code (EBCDIC) of Fig. 4 or the code of Fig. 5, which is an eight-bit extension of a seven-bit code proposed by the International Standards Organization. Decimal Arithmetic Decimal arithmetic can improve performance for processes requiring few computational steps per datum between the source input and the output. In these cases, where radix conversion from decimal to binary and back to decimal is not justified, the use of registers for intermediate results usually yields no advantage over storage-to-storage processing. Hence, decimal arithmetic is pro-
 
vided in SYSTEM/360 with operands as well as results located in storage, as in the IBM 1400 series. Decimal arithmetic includes addition, subtraction, multiplication, division, and comparison.
The decimal digits 0 through 9 are represented in the 4-bit binary-coded-decimal form by 0000 through 1001, respectively. The patterns 1010 through 1111 are not valid as digits and are interpreted as sign codes: 1011 and 1101 represent a minus, the other four a plus. The sign patterns generated in decimal arithmetic depend upon the character set preferred. For EBCDIC, the patterns are 1100 and 1101; for the code of Fig. 5, they are 1010 and 1011. The choice between the two codes is determined by a mode bit.
Decimal digits, packed two to a byte, appear in fields of variable length (from 1 to 16 bytes) and are accompanied by a sign in the rightmost four hits of the low-order byte. Operand fields can be located on any byte boundary, and can have lengths up to 31 digits and sign. Operands participating in an operation have independent lengths. Negative numbers are carried in true form. Instructions are provided for packing and unpacking decimal numbers. Packing of digits leads to efficient use of storage, increased arithmetic performance, and improved rates of data transmission. For purely decimal fields, for example, a 90,000- byte/second tape drive reads and writes 180,000 digits/second.
Instruction Formats
Instruction formats contain one, two, or three halfwords, depending upon the number of storage addresses necessary for the operation. If no storage address is required of an instruction, one halfword suffices. A two-halfword instruction specifies one address; a three-halfword instruction specifies two addresses. All instructions must be aligned on halfword boundaries.
The five basic instruction formats, denoted by the format
The Structure of SYSTEM/360 701
mnemonics RR, RX, RS, SI, and SS are shown in Fig. 6. RR denotes a register-to-register operation, RX a register and indexed-storage operation, RS a register and storage operation, SI a storage and immediate-operand operation, and SS a storage-to storage operation.
In each format, the first instruction halfword consists of two parts. The first byte contains the operation code. The length and format of an instruction are indicated by the first two bits of the operation code.
The second byte is used either as two 4-bit fields or as a single 8-bit field. This byte is specified from among the following:
Four-bit operand register designator (R) Four-bit index register designator (X) Four-bit mask (M) Four-bit field length specification (L) Eight-bit field length specification Eight-bit byte of immediate data (I)
The second and third halfwords each specify a 4-bit base register designator (B), followed by a 12-hit displacement (D).
Addressing An effective storage address E is a 24-bit binary integer given, in the typical case, by E = B + X + Dwhere B and X are 24-bit integers from general registers



  
identified by fields B and X, respectively, and the displacement D is a 12-bit integer contained in every instruction that references storage.
The base B can be used for static relocation of programs and data. In record processing, the base can identify a record; in array calculations, it can specify the location of an array. The index X can provide the relative address of an element within an array. Together, B and X permit double indexing in array processing.
The displacement provides for relative addressing of up to 4095 bytes beyond the element or base address. In array calculations, the displacement can identify one of many items associated with an element. Thus, multiple arrays whose indices move together are best stored in an interleaved manner. In the processing of records, the displacement can identify items within a record.
In forming an effective address, the base and index are treated as unsigned 24-bit positive binary integers and the displacement as a 12-bit positive binary integer. The three are added as 24-bit binary numbers, ignoring overflow. Since every address is formed with the aid of a base, programs can be readily and generally relocated by changing the contents of base registers.
A zero base or index designator implies that a zero quantity must be used in forming the address, regardless of the contents of general register 0. A displacement of zero has no special significance. Initialization, modification, and testing of bases and indices can be carried out by fixed-point instructions, or by BRANCH AND LINK, BRANCH ON COUNT, or BRANCH ON INDEX instructions. LOAD EFFECTIVE ADDRESS provides not only a convenient housekeeping operation, but also, when the same register is specified for result and operand, an immediate register-incrementing operation.

Sequencing

Normally, the CPU takes instructions in sequence. After an instruction is fetched from a location specified by the instruction counter, the instruction counter is increased by the number of bytes in the instruction.
Conceptually, all halfwords of an instruction are fetched from storage after the preceding operation is completed and before execution of the current operation, even though physical storage word size and overlap of instruction execution with storage access may cause the actual instruction fetching to be different. Thus, an instruction can be modified by the instruction that immediately precedes it in the instruction stream, and cannot effectively modify itself during execution.
Branching


Most branching is accomplished by a single BRANCH ON CONDITION operation that inspects a 2-bit condition register. Many of the arithmetic, logical, and I/O operations indicate an outcome by setting the condition register to one of its four possible states. Subsequently a conditional branch can select one of the states as a criterion for branching. For example, the condition code reflects such conditions as non-zero result, first operand high, operands equal, overflow, channel busy, zero, etc. Once set, the condition register remains unchanged until modified by an instruction execution that reflects a different condition code.
The outcome of address arithmetic and counting operations can be tested by a conditional branch to effect loop control. Two instructions, BRANCH ON COUNT and BRANCH ON INDEX, provide for one-instruction execution of the most common arithmetic-test combinations.
Program Status Word


A program status word (PSW), a double word having the format shown in Fig. 7, contains information required for proper execution of a given program. A PSW includes an instruction address, condition code, and several mask and mode fields. The active or controlling PSW is called the current PSW. By storing the current PSW during an interruption, the status of the interrupted program is preserved.
Interruption


Five classes of interruption conditions are distinguished: input/output, program, supervisor call, external, and machine check.
For each class, two PSW's, called old and new, are maintained in the main-storage locations shown in Table 1. An interruption in a given class stores the current PSW as an old PSW and then takes the corresponding new PSW as the current PSW. If, at the conclusion of the interruption routine, old and current PSW's are interchanged, the system can be restored to its prior state and the interrupted routine can be continued.
The system mask, program mask, and machine-check mask bits in the PSW may be used to control certain interruptions. When masked off, some interruptions remain pending while others are merely ignored. The system mask can keep I/O and external interruptions pending, the program mask can cause four of the 15 program interruptions to be ignored, and the machine-check mask can cause machine-check interruptions to be ignored. Other interruptions cannot be masked off.
Appropriate CPU response to a special condition in the channels and I/O units is facilitated by an I/O interruption. The addresses of the channel and I/O unit involved are recorded in the old PSW. Related information is preserved in a channel status word that is stored as a result of the interruption.
Unusual conditions encountered in a program create program interruptions. Eight of the fifteen possible conditions involve overflows, improper divides, lost significance, and exponent underflow. The remaining seven deal with improper addresses,
 The Structure of SYSTEM/360 703
attempted execution of privileged instructions, and similar conditions.
A supervisor-call interruption results from execution of the instruction SUPERVISOR CALL. Eight bits from the instruction format are placed in the interruption code of the old PSW, permitting a message to be associated with the interruption. SUPERVISOR CALL permits a problem program to switch CPU control back to the supervisor.
Through an external interruption, a CPU can respond to signals from the interruption key on the system control panel, the timer,
Table 1 Permanent Storage Assignments
AddressByte lengthPurpose
08Initial program loading PSW
88Initial program loading CCW 1
168Initial program loading CCW 2
248External old PSW
328Supervisor call old PSW
408Program old PSW
488Machine check old PSW
568Input/output old PSW
648Channel status word
724Channel address word
764Unused
804Timer
844Unused
888External new PSW
968Supervisor call new PSW
1048Program new PSW
1128Machine check new PSW
1208Input/output new P8W
128Diagnostic scan-out area †
† The size of the diagnostic scan-out area is configuration dependent.
other CPU’s, or special devices. The source of the interruption is identified by an interruption code in bits 24 through 31 of the PSW.
The occurrence of a machine check (if not masked oil) terminates the current instruction, initiates a diagnostic procedure, and subsequently effects a machine-check interruption. A machine check is occasioned only by a hardware malfunction; it cannot be caused by invalid data or instructions.
Interrupt Priority
Interruption requests are honored between instruction executions. When several requests occur during execution of an instruction, they are honored in the following order: (1) machine check, (2) program or supervisor call, (3) external, and (4) input/output. Because the program and supervisor-call interruptions are mutually exclusive, they cannot occur at the same time.
If a machine-check interruption occurs, no other interruptions can be taken until this interruption is fully processed. Otherwise, the execution of the CPU program is delayed while PSW’s are appropriately stored and fetched for each interruption. When the last interruption request has been honored, instruction execution is resumed with the PSW last fetched. An interruption subroutine is then serviced for each interruption in the order (1) input/output, (2) external, and (3) program or supervisor call.
Program Status
Overall CPU status is determined by four alternatives: (1) stopped versus operating state, (2) running versus waiting state, (3) masked versus interruptable state, and (4) supervisor versus problem state.
In the stopped state, which is entered and left by manual procedure, instructions are not executed, interruptions are not accepted, and the timer is not updated. In the operating

state, the CPU is capable of executing instructions and of being interrupted.
In the running state, instruction fetching and execution proceeds in the normal manner. The wait state is typically entered by the program to await an interruption, for example, an I/O interruption or operator intervention from the console. In the wait state, no instructions are processed, the timer is updated, and I/O and external interruptions are accepted unless masked. Running versus waiting is determined by the setting of a bit in the current P5W.
The CPU may be interruptable or masked for the system, program, and machine interruptions. When the CPU is interruptable for a class of interruptions, these interruptions are accepted. When the CPU is masked, the system interruptions remain pending, but the program and machine-check interruptions are ignored. The interruptable states of the CPU are changed by altering mask bits in the current PSW.
In the problem state, processing instructions are valid, but all I/O instructions and a group of control instructions are invalid. In the supervisor state, all instructions are valid. The choice of problem or supervisor state is determined by a bit in the PSW.

Supervisory Facilities
Timer
 
 
A timer word in main storage location 80 is counted down at a rate of 50 or 60 cycles per second, depending on power line frequency. The word is treated as a signed integer according to the rules of fixed-point arithmetic. An external interrupt occurs when the value of the timer word goes from positive to negative. The full cycle time of the timer is 15.5 hours.
As an interval timer, the timer may be used to measure elapsed time over relatively short intervals. The timer can be set by a supervisory-mode program to any value at any time.
Direct Control
 
 
Two instructions, READ DIRECT and WRITE DIRECT, provide for the transfer of a single byte of information between an external device and the main storage of the system. These instructions are intended for use in synchronizing CPU's and special external devices.
Storage Protection
 
 
For protection purposes, main storage is divided into blocks of 2,048 bytes each. A four-bit storage key is associated with each block. When a store operation is attempted by an instruction, the protection key of the current PSW is compared with the storage key of the affected block. When storing is specified by a channel operation, a protection key supplied by the channel is used as the comparand. The keys are said to match if equal or if either is zero. A storage key is not part of addressable storage, and can be changed only by privileged instructions. The protection key of the CPU program is held in the current PSW. The protection key of a channel is recorded in a status word that is associated with the channel operation.
When a CPU operation causes a protection mismatch, its execution is suppressed or terminated, and the program execution is altered by an interruption. The protected storage location always remains unchanged. Similarly, protection mismatch due to an I/O operation terminates data transmission in such a way that the protected storage location remains unchanged.
Multisystem Operation
 
 
Communication between CPU's is made possible by shared control units, interconnected channels, or shared storage. Multisystem operation is supported by provisions for automatic relocation, indication of malfunctions, and CPU initialization.
Automatic relocation applies to the first 4,096 bytes of storage, an area that contains all permanent storage assignments and usually has special significance for supervisory programs. The relocation is accomplished by inserting a 12-bit prefix in each address whose high-order 12 bits are zero. Two manually set prefixes permit the use of an alternate area when storage malfunction occurs; the choice between prefixes is preserved in a trigger that is set during initial program loading.
To alert one CPU to the possible malfunction of another, a machine-check signal from a given CPU can serve as an external interruption to another CPU. By another special provision, initial program loading of a given CPU can be initiated by a signal from another CPU.

Input/Output
Devices and Control Units
Input/output devices include card equipment, magnetic tape units, disk storage, drum storage, typewriter-keyboard devices, printers, teleprocessing devices, and process control equipment. The I/O devices are regulated by control units, which provide the electrical, logical, and buffering capabilities necessary for I/O device operation. From the programming point of view, most control-unit and I/O device functions are indistinguishable. Sometimes the control unit is housed with an I/O device, as in the case of the printer.
A control unit functions only with those I/O devices for which it is designed, but all control units respond to a standard set of
The Structure of SYSTEM/360 705
signals from the channel. This control-unit-to-channel connection, called the I/O interface, enables the CPU to handle all I/O operations with only four instructions. I/O Instructions Input/output instructions can be executed only while the CPU is in the supervisor state. The four I/O instructions are START I/O, HALT I/O, TEST CHANNEL, and TEST I/O. START I/O initiates an I/O operation; its address field specifies a channel and an I/O device. If the channel facilities are free, the instruction is accepted and the CPU continues its program. The channel independently selects the specified I/O device. HALT I/O terminates a channel operation. TEST CHANNEL sets the condition code in the PSW to indicate the state of the channel addressed by the instruction. The code then indicates one of the following conditions: channel available, interruption condition in channel, channel working, or channel not operational. TEST I/O sets the PSW condition code to indicate the state of the addressed channel, subchannel, and I/O device. Channels Channels provide the data path and control for I/O devices as they communicate with main storage. In the multiplexor channel, the single data path can be time-shared by several low-speed devices (card readers, punches, printers, terminals, etc.) and the channel has the functional character of many subchannels, each of which services one I/O device at a time. On the other hand, the selector channel, which is designed for high-speed devices, has the functional character of a single subchannel. All subchannels respond to the same I/O instructions. Each can fetch its own control word sequence, govern the transfer of data and control signals, count record lengths, and interrupt the CPU on exceptions. Two modes of operation, burst and multiplex, are provided for multiplexor channels. In burst mode, the channel facilities are monopolized for the duration of data transfer to or from a particular I/O device. The selector channel functions only in the burst mode. In multiplex mode, the multiplexor channel sustains several simultaneous I/O operations: bytes of data are interleaved and then routed between selection I/O devices and desired locations in main storage. At the conclusion of an operation launched by START I/O or TEST I/O, an I/O interruption occurs. At this time a channel status word (CSW) is stored in location 64. Figure 8 shows the CSW format. The CSW provides information about the termination of the I/O operation. Successful execution of START I/O causes the channel to fetch a channel address word from main-storage location 72. This word specifies the storage-protection key that governs the I/O operation, as well as the location of the first eight bytes of information that the channel fetches from main storage. These 64 bits comprise a channel command word (CCW). Figure 9 shows the CCW format. Channel Program One or more CCW's make up the channel program that directs channel operations. Each CCW points to the next one to be fetched, except for the last in the chain which so identifies itself. Six channel commands are provided: read, write, read backward, sense, transfer in channel, and control. The read command defines an area in main storage and causes a read operation from the selected I/O device. The write command causes data to he written by the selected device. The read-backward command is akin to the read command, but the external medium is moved in the opposite direction and bytes read backward are placed in descending main storage locations. The control command contains information, called an order, that is used to control the selected I/O device. Orders, peculiar to the particular I/O device in use, can specify such functions as rewinding a tape unit, searching for a particular track in disk
storage, or line skipping on a printer. In a functional sense, the CPU executes I/O instructions, the channels execute commands, and the control units and devices execute orders. The sense command specifies a main storage location and transfers one or more bytes of status information from the selected control unit. It provides details concerning the selected I/O device, such as a stacker-full condition of a card reader or a file-protected condition of a magnetic-tape reel. A channel program normally obtains CCW's from a consecutive string of storage locations. The string can be broken by a transfer-in-channel command that specifies the location of the next CCW to he used by the channel. External documents, such as punched cards or magnetic tape, may carry CCW's that can be used by the channel to govern the reading of the documents. The input/output interruptions caused by termination of an I/O operation, or by operator intervention at the I/O device, enable the CPU to provide appropriate programmed response to conditions as they occur in I/O devices or channels. Conditions responsible for I/O interruption requests are preserved in the I/O devices or channels until recognized by the CPU. During execution of START I/O, a command can be rejected by a busy condition, program check, etc. Rejection is indicated in the condition code of the PSW, and additional detail on the conditions that precluded initiation of the I/O operation is provided in a CSW. Manual Control The need for manual control is minimal because of the design of the system and supervisory program. A control panel provides the ability to reset the system; store and display information in main storage, in registers, and in the PSW; and load initial program information. After an input device is selected with the load unit switches, depressing a load key causes a read from the selected input device. The six words of information that are read into main storage provide the PSW and the CCW's required for subsequent operation. Instruction Set The SYSTEM/360 instructions, classified by format and function, are displayed in Table 2. Operation codes and mnemonic abbreviations are also shown. With the previously described formats in mind, much of the generality provided by the system is apparent in this listing.
 
  Table 2 (opposite) System/360 instructions
 
 
Summary The performance range desired of SYSTEM/360 is obtained by variations in the storage, processing, control, and channel functions of the several models. The Systematic variations in speed, size, and degree of simultaneity that characterize the functional components and elements of each model are discussed.

A primary goal in the SYSTEM/360 design effort was a wide range of processing unit performances coupled with complete program compatibility. In keeping with this goal, the logical structure of the resultant system lends itself to a wide choice of components and techniques in the engineering of models for desired performance levels.
This paper discusses basic choices made in implementing six SYSTEM/360 models spanning a performance range of fifty to one. It should be emphasized that the problems of model implementation were studied throughout the design period, and many of the decisions concerning logical structure were influenced by difficulties anticipated or encountered in implementation.

Performance Adjustment
The choices made in arriving at the desired performances fall into four areas:
Main storage
Central processing unit (CPU) registers and data paths
Sequence control
Input/output (I/O channels)
Each of the adjustable parameters of these areas can be subordinated, for present purposes, to one of three general factors: basic speed, size, and degree of simultaneity.
Main Storage
Storage Speed and Size
The interaction of the general factors is most obvious in the area of main storage. Here the basic speeds vary over a relatively small range: from a 2.5-m sec cycle for the Model 40 to a 1.0-m sec cycle for Models 62 and 70. However, in combination with the other two factors, a 32:1 range in overall storage data rate is obtained, as shown in Table 1.
Most important of the three factors is size. The width of main storage, i.e., the amount of data obtained with one storage access, ranges from one byte for the Model 30, two bytes for the Model 40, and four bytes for the Model 50, to eight bytes for Models 60, 62, and 70.
Another size factor, less direct in its effect, is the total number of bytes in main storage, which can make a large difference in system throughput by reducing the number of references to external storage media. This number ranges from a minimum of 8192 bytes on Model 30 to a maximum of 524,288 bytes on Models 60, 62, and 70, An option of up to eight million more bytes of slower-speed, large-capacity core storage can further increase the throughput in some applications.
Interleaved Storage
Simultaneity in the core storage of Models 60 and 70 is obtained by overlapping the cycles of two storage units. Addresses are staggered in the two units, and a series of requests for successive words activates the two units alternately, thus doubling the maximum rate. For increased system performance, this technique is less effective than doubling the basic speed of a single unit, since the access time to a single word is not improved, and successive references frequently occur to the same unit. This is illustrated by comparing the performances of Models 60 and 62, whose only difference is the choice between two overlapped 2.0-m sec storage units and one single l.0-m sec storage unit, respectively. The performance of Model 62 is approximately 1.5 times that of Model 60.

CPU Registers and Data Paths
Circuit Speed
SYSTEM/360 has three families of logic circuits, as shown in Table 2, each using the same solid-logic technology. One family, having a nominal delay of 30 nsec per logical stage or level, is used in the data paths of Models 30, 40, and 50. A second and faster family with a nominal delay of 10 nsec per level is used in Models 60 and 62. The fastest family, with a delay of 6 nsec, is used in Model 70.
The fundamental determinant of CPU speed is the time required to take data from the internal registers, process the data through the adder or other logical unit, and return the result to a register. This cycle time is determined by the delay per logical circuit level and the number of levels in the register-to-adder path, the adder, and the adder-to-register return path.
Table 1 System/360 Main Storage Characteristics
Model 30Model 40Model 50Model 60Model 62Model 70
Cycle time (m sec)2.02.52.02.01.01.0
Width(bytes)124888
Interleaved accessnononoyesnoyes
Maximum data rate (bytes/m sec)0.50.82.08.08.016.0
Minimum storage size (bytes)8,19216,38465,536131,072262,144262,144
Maximum storage size (bytes)65,536262,144262,144524,288524,288524,288
Large capacity storage attachablenonoyesyesyesyes
number of levels varies because of the trade-off that can usually be made between the number of circuit modules and the number of logical levels. Thus, the cycle time of the system varies from 1.0 m sec for Model 30 (with 30-nsec circuits, a relatively small number of modules, and more logic levels) and 0.5 m sec for Model 50 (also with 30-nsec circuits, but with more modules and fewer levels) to 0.2 m sec for Model 70 (with 6-nsec circuits).
Local Storage
The speed of the CPU depends also on the speed of the general and floating-point registers. In Model 30, these registers are located in an extension to the main core storage and have a read-write time of 2.0 m sec. In Model 40, the registers are located in a small core-storage unit, called local storage, with a read-write time of 1.25 m sec. Here, the operation of the local storage maybe overlapped with main storage. In Model 50, the registers are in a local storage with a read-write time of only 0.5 m sec. In Model 60/62, the local storage has the logical characteristics of a core storage with nondestructive read-out; however, it is actually constructed as an array of registers using the 30-nsec family of logic circuits, and has a read-write time of 0. 25 m sec. In Model 70, the general and floating-point registers are implemented with 6-nsec logic circuits and communicate directly with the adder and other data paths.
The two principal measures of size in the CPU are the width of the data paths and the number of bytes of high-speed working registers.
Data Path Organization
Model 30 has an 8-bit wide (plus parity) adder path, through which all data transfers are made, and approximately 12 bytes of working registers.
Model 40 also has an 8-bit wide adder path, but has an additional 16-hit wide data transfer path. Approximately 15 bytes
Table 2 System/360 CPU Characteristics
Model 30Model 40Model 50Model 60/62Model 70
Circuit family: nominal delay per logic level (nsec)303030106
Cycle time (m sec)1.00.6250.50.250.2
Location of general and floating registersmain core storagelocal core storagelocal core storagelocal transistor storageTransistor registers
Width of general and floating register storage (bytes)12444 or 8
Speed of general and floating register storage (m sec)2.01.250.50.25
Width of main adder path (bits)88325664
Width of auxiliary transfer path (bits)168
Widths of auxiliary adder paths (bits)88, 8, and 24
Approximate number of bytes of register storage12153050100
Approximate number of bytes of working locations in local storage45 (main storage)48604
Relative computing speed13.51021/3050


The Structure of SYSTEM/360 713
of working registers are used, plus about 48 bytes of working locations in the local storage, exclusive of the general and floating-point registers.
Model 50 has a 32-bit wide adder path, an 8-bit wide data path used for handling individual bytes, approximately 30 bytes of working registers, plus about 60 bytes of working locations in the local storage.
Model 60/62 has a 56-bit wide main adder path, an 8-bit wide serial adder path, and approximately 50 bytes of working registers.
Model 70 has a 64-bit wide main adder, an 8-bit wide exponent adder, an 8-bit wide decimal adder, a 24-bit wide addressing adder, and several other data transfer paths, some of which have incrementing ability. The model has about 100 bytes of working registers plus the 96 bytes of floating point and general registers which, in Model 70, are directly associated with the data paths.
The models of SYSTEM/360 differ considerably in the number of relatively independent operations that can occur simultaneously in the CPU. Model 30, for example, operates serially: virtually all data transfers must pass through the adder, one byte at a time. Model 70, however, can have many operations taking place at the same time. The CPU of this model is divided into three units that operate somewhat independently. The instruction preparation unit fetches instructions from storage, prepares them by computing their effective addresses, and initiates the fetching of the required data. The execution unit performs the execution of the instruction prepared by the instruction unit. The third unit is a storage bus control which coordinates the various requests by the other units and by the channels for core-storage cycles. All three units normally operate simultaneously, and together provide a large degree of instruction overlap. Since each of the units contains a number of different data paths, several data transfers may be occurring on the same cycle in a single unit.
The operations of other SYSTEM/360 models fall between those mentioned. Model 50, for example, can have simultaneous data transfers through the main adder, through an auxiliary byte transfer path, and to or from local storage.
Sequence Control

Complex Instruction Sequences
Since the SYSTEM/360 has an extensive instruction set, the CPU’s must be capable of executing a large number of different sequences of basic operations. Furthermore, many instructions require sequences that are dependent on the data or addresses used. As shown in Table 3, these sequences of operations can be controlled by two methods; either by a conventional sequential logic circuit that uses the same types of circuit modules as used in the data paths or by a read-only storage device that contains a microprogram specifying the sequences to be performed for the different instructions.
Model 70 makes use of conventional sequential logic control mainly because of the high degree of simultaneity required. Also, a sufficiently fast read-only storage unit was not available at the time of development. The sequences to be performed in each of the Model 70 data paths have a considerable degree of independence. The read-only storage method of control does not easily lend itself to controlling these independent sequences, but is well adapted where the actions in each of the data paths are highly coordinated.
Read-Only Storage Control


The read-only storage method of control is described elsewhere [Peacock, n.d.]. This microprogram control, used in all but the fastest model of SYSTEM/360, is the only method known by which an extensive instruction set may be economically realized in a small system. This was demonstrated during the design of Model 60/62. Conventional logic control was originally planned for this model, but it became evident during the design period that too many circuit modules were required to implement the instruction set, even for this rather large system. Because a sufficiently fast read-only storage became available, it was adopted for sequence control at a substantial cost reduction.
The three factors of speed, size, and simultaneity are applicable
Table 3 System/360 Sequence Control Characteristics

Model 30Model 40Model 50Model 60/62Model 70
Typeread-only storageread-only storageread-only storageread-only storageSequential logic
Cycle time (m sec)1.00.6250.50.250.2
Width of read-only storage word (available bits)606090100
Number of read-only storage words available4096409628162816
Number of gate-control fields in read-only storage word9101516
Maxicomputers
to the read-only storage controls of the various SYSTEM/360 models. The speed of the read-only storage units corresponds to the cycle time of the CPU, and hence varies from 1.0 m sec per access for Model 30 down to 0.25 m sec for Models 60 and 62.
The size of read-only storage can vary in two ways—in width (number of bits per word) and in number of words. Since the bits of a word are used to control gates in the data paths, the width of storage is indirectly related to the complexity of the data paths. The widths of the read-only storages in SYSTEM/360 range from 60 bits for Models 30 and 40 to 100 bits for Models 60 and 62. The number of words is affected by several factors. First, of course, is the number and complexity of the control sequences to be executed. This is the same for all models except that Model 60/62 read-only storage contains no sequences for channel functions. The number of words tends to be greater for the smaller models, since these models require more cycles to accomplish the same function. Partially offsetting this is the fact that the greater degree of simultaneity in the larger systems often prevents the sharing of microprogram sequences between similar functions.
SYSTEM/360 employs no read-only storage simultaneity in the sense that more than one access is in progress at a given time. However, a single read-only storage word simultaneously controls several independent actions. The number of different gate control fields in a word provides some measure of this simultaneity. Model 30 has 9 such fields. Model 60/62 has 16.
Input/Output Channels
Channel Design
The SYSTEM/360 input/output channels may be considered from two viewpoints: the design of a channel itself, or the relationship of a channel to the whole system.
From the viewpoint of channel design, the raw speed of the components does not vary, since all channels use the 30-nsec family of circuits. However, the different channels do have access to different speeds of main storage and, in the three smaller models, different speeds of local storage.
The channels differ markedly in the amount of hardware devoted exclusively to channel use, as shown in Table 4. In the Model 30 multiplexor channel, this hardware amounts only to three 1-byte wide data paths, 11 latch bits for control, and a simple interface polling circuit. The channel used in Models 60, 62, and 70 contains about 300 bits of register storage, a 24-bit wide adder, and a complete set of sequential control circuits. The
Table 4 System/360 Channel Characteristics
Model 30Model 40Model 50Model 60/62Model 70
Selector channels
Maximum number attachable22366
Approximate maximum data rate on one channel in Kbyps †250400800 (1250 on high speed)12501250
Uses CPU data paths for:

iniation and termination
yesyesyesyesyes

byte transfers
nonononono

storage word transfers
nolow speed onlyyesnono

chaining
yesyesyesnono
CPU and I/O overlap possibleyesyesregular—yes high speed—noyesyes
Multiplexor channels
Maximum number attachable11100
Minimum number of subchannels321664
Maximum number of subchannels96128256
Maximum data rate in byte interleaved mode (Kbyps)163040
Maximum data rate in burst mode (Kbyps)200200200
Uses CPU data paths for all functionsyesyesyes
CPU and I/O overlap possible in byte modeyesyesyes
CPU and I/O overlap possible in burst modenonoyes
† Thousand bytes per second.

amount of hardware provided for other channels is somewhere in between these extremes.
The disparity in the amount of channel hardware reflects the extent to which the channels share CPU hardware in accomplishing their functions. Such sharing is done at the expense of increased interference with the CPU, of course. This interference ranges from complete lock-out of CPU operations at high data rates on some of the smaller models, to interference only in essential references to main storage by the channel in the large models.
Channel/System Relationship
 
 
When the channels are viewed in their relationship to the whole system, the three factors of speed, size, and simultaneity take on a different aspect. The channel is viewed as a system component, and its effect on system throughput and other system capabilities is of concern. The speeds of the channels vary from a maximum rate of about 16 thousand bytes per second (byte interleaved mode) on the multiplexor channel of Model 30 to a maximum rate of about 1250 thousand bytes per second on the channels of Models 60, 62, and 70. The size of each of the channels is the same, in the sense that each handles an 8-bit byte at a time and each can connect to eight different control units. A slight size difference exists among multiplexor channels in terms of the maximum number of subchannels.
The degree of channel simultaneity differs considerably among the various models of SYSTEM/360. For example, operation of the Model 30 or 40 multiplexor channels in burst mode inhibits all other activity on the system, as does operation of the special high-speed channel on Model 50. At the other extreme, as many as six selector channels can be operating concurrently with the CPU on Models 60, 62, or 70. A second type of simultaneity is present in the multiplexor channels available on Models 30, 40, and 50. When operating in byte interleaved mode, one of these channels can control a number of concurrently operating input/output devices, and the CPU can also continue operation.
Differences in Application Emphasis
 
The models of SYSTEM/360 differ not only in throughput but also in the relative speeds of the various operations. Some of these relative differences are simply a result of the design choices described in this paper, made to achieve the desired overall performance. The more basic differences in relative performance of the various operations, however, were intentional. These differences in emphasis suit each model to those applications expected to comprise its largest usage.
Thus the smallest system is particularly aimed at traditional commercial data processing applications. These are characterized by extensive input/output operations in relation to the internal processing, and by more character handling than arithmetic. The fast selector channels and character-oriented data paths of Model 30 result from this emphasis. But despite this emphasis, the general-purpose instruction set of SYSTEM/360 results in much better scientific application performance for Model 30 than for its comparable predecessors.
On the other hand, the large systems are expected to find particularly heavy use in scientific computation, where the emphasis is on rapid floating-point arithmetic. Thus Models 60, 62, and 70 contain registers and adders that can handle the frill length of a long format floating-point operand, yet do character operations one byte at a time.
No particular emphasis on either commercial or scientific applications characterizes the intermediate models. However, Models 40 and 50 are intended to be particularly suitable for communication-oriented and real-time applications. For example, Model 50 includes a multiplexor channel, storage protection, and a timer as standard features, and also provides the ability to share main storages between two CPU's in a multiprocessing arrangement.
Introduction
Large Virtual Address Space Minicomputers
Perhaps the most useful definition of a minicomputer system is based on price: depending on one's perspective such systems are typically found in the $20K to $200K range. The twin forces of market pull-as customers build increasingly complex systems on minicomputers-and technology push-as the semiconductor industry provides increasingly lower cost logic and memory elements-have induced minicomputer manufacturers to produce systems of considerable performance and memory capacity. Such systems are typified by the DEC PDP-11/70. From an architectural point of view, the characteristic which most distinguishes many of these systems from larger mainframe computers is the size of the virtual address space: the immediately available address space seen by an individual process. For many purposes the 65K byte virtual address space typically provided on minicomputers (such as the PDP-11) has not been and probably will not continue to be a severe limitation. However, there are some applications whose programming is impractical in a 65K byte virtual address space, and perhaps most importantly, others whose programming is appreciably simplified by having a large virtual address space. Given the relative trends in hardware and software costs, the latter point alone will insure that large virtual address space minicomputers play an increasingly important role in minicomputer product offerings.
In principle, there is no great challenge in designing a large virtual address minicomputer system. For example, many of the large mainframe computers could serve as architectural models for such a system. The real challenge lies in two areas: compatibility¾ very tangible and important; and simplicity¾ intangible but nonetheless important.
The first area is preserving the customer's and the computer manufacturer's investment in existing systems. This investment exists at many levels: basic hardware (principally busses and peripherals); systems and applications software; files and data bases; and personnel familiar with the programming, use, and operation of the systems. For example, just recently a major computer manufacturer abandoned a major effort for new computer architectures in favor of evolving its current architectures [McLean, 1977].
The second intangible area is the preservation of those attributes (other than price) which make minicomputer systems attractive. These include approachability, understandability, and ease of use. Preservation of these attributes suggests that simply modelling an extended virtual address minicomputer after a large mainframe computer is not wholly appropriate. It also suggests that during architectural design, tradeoffs must be made between snore than just performance, functionality, and cost. Performance or functionality features which are so complex that they appreciably compromise understanding or ease of use must be rejected as inappropriate for minicomputer systems.
VAX-11 Overview
VAX-11 is the Virtual Address eXtention of PDP-11l architecture [Bell et al., 1970; Bell and Strecker, 1976]. The most distinctive feature of VAX-11 is the extension of the virtual address from 16 bits as provided on the PDP-11 to 32 bits. With the 8-bit byte the basic addressable unit, the extension provides a virtual address space of about 4.3 gigabytes which, even given rapid improvement in memory technology, should be adequate far into the future.
Since maximal PDP-11 compatibility was a strong goal, early VAX-11 design efforts focused on literally extending the PDP-11: preserving the existing instruction formats and instruction set and fitting the virtual address extension around them. The objective here was to permit, to the extent possible, the running of existing programs in the extended virtual address environment. While realizing this objective was possible (there were three distinct designs), it was felt that the extended architecture designs were overly compromised in the areas of efficiency, functionality, and programming ease.
Consequently, it was decided to drop the constraint of the PDP-11 instruction format in designing the extended virtual address space or native mode of the VAX-11 architecture. However, in order to run existing PDP-11 programs, VAX-11 includes a PDP-11 compatibility mode. Compatibility mode provides the basic PDP-11 instruction set less only privileged instructions (such as HALT) and floating point instructions (which are optional on most PDP-11 processors and not required by most PDP-11 software).
In addition to compatibility mode, a number of other features to preserve PDP-11 investment have been provided in the VAX-11 architecture, the VAX-11 operating system VAX/VMS, and the VAX-11/780 implementation of the VAX-11 architecture. These features include:
1 The equivalent native mode data types and formats are identical to those on the PDP-11. Also, while extended, the VAX-11 native mode instruction set and addressing modes

   


are very close to those on the PDP-11. As a consequence VAX-11 native mode assembly language programming is quite similar to PDP-11 assembly language programming.
2 The VAX-11/780 uses the same peripheral busses (Unibus and Massbus) as the PDP-11 and uses the same peripherals.
3 The VAX/VMS operating system is an evolution of the PDP-11 RSX-11M and IAS operating systems, offers a similar although extended set of system services, and uses the same command languages. Additionally, VAXIVMS supports most of the RSX-11M/IAS system service requests issued by programs executing in compatibility mode.
4 The VAX/VMS file system is the same as used on the RSX-11M/IAS operating systems permitting interchange of files and volumes. The file access methods as implemented by the RMS record manager are also the same.
5 VAX-11 high level language compilers accept the same source languages as the equivalent PDP-11 compilers and execution of compiled programs gives the same results.

The coverage of all these aspects of VAX-11 is well beyond the scope of any single paper. The remainder of this paper discusses the design of the VAX-11 native mode architecture and gives an overview of the VAX-11/780 system.


VAX-11 Native Architecture

Processor State
Like the PDP-11, VAX-11 is organized around a general register processor state. This organization was favored because access to operands stored in general registers is fast (since the registers are internal to the processor and register accesses do not need to pass through a memory management mechanism) and because only a small number of bits in an instruction are needed to designate a register. Perhaps most importantly, the registers are used (as on the PDP-11) in conjunction with a large set of addressing modes which permit unusually flexible operand addressing methods.
Some consideration was given to a pure stack based architecture. However it was rejected because real program data suggests the superiority of two or three operand instruction formats [Myers, 1977b]. Actually VAX-11 is quite stack oriented, and although it is not optimally encoded for the purpose, can easily be used as a pure stack architecture if desired.
VAX-11 has 16 32-bit general registers (denoted R0-R15) which are used for both fixed and floating point operands. This is in contrast to the PDP-11 which has eight 16-bit general registers and six 64-bit floating point registers. The merged set of fixed and floating registers were preferred because it simplifies programming and permits a more effective allocation of the registers.
Four of the registers are assigned special meaning in the VAX-11 architecture:

1 R15 is the program counter (PC) which contains the address of the next byte to be interpreted in the instruction stream.
2 R14 is the stack pointer (SP) which contains the address of the top of the processor defined stack used for procedure and interrupt linkage.
3 R13 is the frame pointer (FP). The VAX-11 procedure calling convention builds a data structure on the stack called a stack frame. FP contains the address of this structure.
4 R12 is the argument pointer (AP). The VAX-11 procedure calling convention uses a data structure called an argument list. AP contains the address of this structure.

The remaining element of the user visible processor state (additional processor state seen mainly by privileged procedures is discussed later) is the 16-bit processor status word (PSW). The PSW contains the N, Z, V, and C condition codes which indicate respectively whether a previous instruction had a negative result, a zero result, a result which overflowed, or a result which produced a carry (or borrow). Also in the PSW are the IV, DV, and EU bits which enable processor trapping on integer overflow, decimal overflow, and floating underflow conditions respectively. (The trapping on conditions of floating overflow and divide by zero for any data type are always enabled.)
Finally, the PSW contains the T bit which when set forces a trap at the end of each instruction. This trap is useful for program debugging and analysis purposes.

Data Types and Formats

The VAX-11 data types are a superset of the PDP-11 data types. Where the PDP-11 and VAX-11 have equivalent data types the formats (representation in memory) are identical. Data type and data format identity is one of the most compelling forms of compatibility. It permits free interchange of binary data between PDP-11 and VAX-11 programs. It facilitates source level compatibility between equivalent PDP-11 and VAX-11 languages. It also greatly facilitates hardware implementation of and software support of the PDP-11 compatibility mode in the VAX-11 architecture.
The VAX-11 data types divide into five classes:

1 Integer data types are the 8-bit byte, the 16-bit word, the 32-bit longword, and the 64-bit quadword. Usually these data types are considered signed with negative values represented in two's complement form. However, for most purposes they can be Maxicomputers
VAX-11 instruction set provides support for this interpretation.
2 Floating data types are the 32-bit floating and the 64-bit double floating. These data types are binary normalized, have an 8-bit signed exponent, and have a 25- or 57-bit signed fraction with the redundant most significant fraction bit not represented.
3 The variable bit field data type is 0 to 32 bits located arbitrarily with respect to addressable byte boundaries. A bit field is specified by three operands: the address of a byte, the starting bit position P with respect to bit 0 of that byte, and the size S of the field. The VAX-11 instruction set provides for interpreting the field as signed or unsigned.
4 The character string data type is 0 to 65535 contiguous bytes. It is specified by two operands: the length and starting address of the string. Although the data type is named "character string," no special interpretation is placed on the values of the bytes in the character string.
5 The decimal string data types are 0 to 31 digits. They are specified by two operands: a length (in digits) and a starting address. The primary data type is packed decimal with two digits stored in each byte except that the byte containing the least significant digit contains a single digit and the sign. Two ASCII character decimal types are supported: leading separate sign and trailing embedded sign. The leading separate type is a "+ ," "—," or "<blank>" (equivalent to "+") ASCII character followed by 0 to 31 ASCII decimal digit characters. A trailing embedded sign decimal string is 0 to 31 bytes which are ASCII decimal digit characters except for the character containing least significant digit which is an arbitrary encoding of the digit and sign.
All of the data types except field may be stored on arbitrary byte boundaries—there are no alignment constraints. The field data type, of course, can start on an arbitrary bit boundary.
Attributes of and symbolic representations for most of the data types are given in Table 1 and Fig. 1.
Instruction Format and Address Modes


Most architectures provide a small number of relatively fixed instruction formats. Two problems often result. First, not all operands of an instruction have the same specification generality. For example, one operand must come from memory and another from a register; or one must come from the stack and another from memory. Second, only a limited number of operands can be accommodated: typically one or two. For instructions which inherently require more operands (such as field or string instructions), the additional operands are specified in ad hoc ways: small literal fields in instructions, specific registers or stack positions, or packed in fields of a single operand. Both these problems lead to increased programming complexity: they require superfluous move type instructions to get operands to places where they can be used and increase competition for potentially scarce resources such as registers.
To avoid these problems two criteria were used in the design of the VAX-11 instruction format: (1) all instructions should have the "natural" number of operands and (2) all operands should have the same generality in specification. These criteria led to a highly
Table 1 Data Types

Data typeSizeRange (decimal)
IntegerSignedUnsigned
Byte8bits—128 to + 1270 to255
Word16 bits—32768 to +327670 to 65535
Longword32 bits- 231 to +231- 10 to 232- 1
Quadword64 bits- 263 to + 263- 10 to 264- 1
Floating point±2.9 ´ 10- 3 to 1.7 ´ 1038
Floating32 bitsapproximately seven decimal digits precision
Double Floating64 bitsapproximately sixteen decimal digits precision
Packed decimal string0 to 16 bytes (31 digits)numeric, two digits per byte sign in low half of last byte
Character string0 to 65535 bytesone character per byte
Variable-length bit field0 to 32 bitsdependent on interpretation
interpreted as unsigned and the



A Virtual Address Extension to the DEC PDP-11 Family 719
variable instruction format. An instruction consists of a one or two1 byte opcode followed by the specifications for n operands (n > 0) where n is an implicit property of the opcode. An operand specification is one to 10 bytes in length and consists of a one or two byte operand specifier followed by (as required) zero to eight bytes of specifier extension. The operand specifier includes the address mode and designation of any registers needed to locate the operand. A specifier extension consists of a displacement, an address, or immediate data. The VAX-11 address modes are with one exception a superset of the PDP-11 address modes. The PDP-11 address mode autodecrement deferred was omitted from VAX-11 because it was rarely used. Most operand specifiers are one byte long and contain two 4-bit fields: the high order field (bits 7:4) contains the address mode designator and the lower field (bits 3:0) contains a general register designator. The address modes include:
1 Register mode in which the designated register contains the operand. 2 Register deferred mode in which the designated register contains the address of the operand. 3 Autodecrement mode in which the contents of the designated register are first decremented by the size (in bytes) of the operand and then used as-the address of the operand. 4 Autoincrement mode in which the contents of the designated register are first used as the address of the operand and are then incremented by the size of the operand. Note that if the designated register is PC, the operand is located in the instruction stream. This use of autoincrement mode is called immediate mode. In immediate mode the one to eight bytes of data are the specifier extension. Autoincrement mode can be used sequentially to process a vector in one direction and autodecrement mode used to process a vector in the opposite direction. Autoincrement, register deferred, and autodecrement modes can be applied to a single register to implement a stack data structure: autodecrement to "push," autoincrement to "pop," and register deferred to access the top of the stack. 5 Autoincrement deferred mode in which the contents of the designated register are used as the address of a longword in memory which contains the address of the operand. After this use, the contents of the register are incremented by four (the size in bytes of the longword address). Note that if PC is the designated register, the absolute address of the operand is located in the instruction stream. This use of autoincrement deferred mode is termed absolute mode. In absolute mode the 4-byte address is the specifier extension. 6 Displacement mode in which a displacement is added to the contents of the designated register to form the operand address. There are three displacement modes depending on whether a signed byte, word, or longword displacement is the specifier extension. These modes are termed byte, word, and longword displacement respectively. Note that if PC is the designated register, the operand is located relative to PC. For this use the modes are termed byte, word, and longword relative mode respectively. 7 Displacement deferred mode in which a displacement is added to the designated register to form the address of a longword containing the address of the operand. There are three displacement deferred modes depending on whether a signed byte, word, or longword displacement is the specifier extension. These modes are termed byte, word, and longword displacement respectively. Note that if PC is the designated register, the operand address is located relative to PC. For this use the modes are termed byte, word, and longword relative deferred mode respectively. 8 Literal mode in which the operand specifier itself contains a 6-bit literal which is the operand. For integer data types the literal encodes the values 0-63; for floating data types the literal includes three exponent and three fraction bits to give 64 common values. 9 Index mode which is not really a mode but rather a one byte prefix operator for any other mode which evaluates to a memory address (i.e., all modes except register and literal). The index mode prefix is cascaded with the operand specifier for that mode (called the base operand specifier) to form an aggregate two byte operand specifier. The base . No currently defined instructions use two byte opcodes.
Maxicomputers

operand specifier is used in the normal way to evaluate a base address. A copy of the contents of the register designated in the index prefix is multiplied by the size (in bytes) of the operand and added to the base address. The sum is the final operand address. There are three advantages to the VAX-11 form of indexing: (a) the index is scaled by the data size and thus the index register maintains a logical rather than a byte offset into an indexed data structure, (b) indexing can be applied to any of the address modes which generate memory addresses and this results in a comprehensive set of indexed addressing methods, and (c) the space required to specify indexing and the index register is paid only when indexing is used.The VAX-11 assembler syntax for the address modes is given in Fig. 2. The bracketed ({}) notation is optional and the programmer rarely needs to be concerned with displacement sizes or whether to choose literal or immediate mode. The programmer writes the simple form and assembler chooses the address mode which produces the shortest instruction length.
In order to give a better feeling for the instruction format and assembler notation, several examples are given in Figs. 3 to 5. In Fig. 3 is an instruction which moves a word from an address which is 56 plus the contents of R5 to an address which is 270 plus the contents of R6. Note, that the displacement 56 is representable in a byte while the displacement 270 requires a word. The instruction occupies 6 bytes. In Fig. 4 is an instruction which adds 1to a longword in R0 and stores the result at a memory address which is the sum of A and 4 times the contents of R. This instruction occupies 9 bytes. Finally, in Fig. 5 is a return from subroutine instruction. It has no explicit operands and occupies a single byte. The only significant instance where there is non-general specification of operands is in the specification of targets for


VAX-11/780¾ A Virtual Address Extension to the DEC PDP-11 Family 721
branch instructions. Since invariably the target of a branch instruction is a small displacement from the current PC, most branch instructions simply take a one byte PC relative displacement. This is exactly as if byte displacement mode were used with the PC used as the register, except that the operand specifier byte is not needed. Because of the pervasiveness of branch instructions in code, this one byte saving results in a non-trivial reduction in code size. An example of the branch instruction branch on equal is given in Fig. 6. Instruction Set A major goal of the VAX-11 instruction set design was to provide for effective compiler generated code. Four decisions helped to realize this goal:
1 A very regular and consistent treatment of operators. Thus, for example, since there is a divide longword instruction there are also divide word and divide byte instructions. 2 An avoidance of instructions unlikely to be generated by a compiler. 3 Inclusion of several forms of common operators. For example the integer add instructions are included in three forms: (a) one operand where the value one is added to an operand, (b) two operands where one operand is added to a second, and (c) three operands where one operand is added to a second and the result stored in a third. Since the VAX-11 instruction format allows frilly general specifications of the operands, VAX-11 programs often have the structure (though not the encoding) of the canonic program form proposed in Flynn [1977]. 4 Replacement of common instruction sequences with single instructions. Examples of this include procedure calling, multiway branching, loop control, and array subscript calculation.The effect of these decisions is reflected in several observations. First, despite the larger virtual address and instruction set support for more data types, compiler (and hand) generated code for VAX-11 is typically smaller than the equivalent PDP-11 code for algorithms operating on data types supported by the PDP-11. Second, of the 243 instructions in the instruction set about 75
percent are generated by the VAX-11 FORTRAN compiler. Of the instructions not generated, most operate on data types not part of the FORTRAN language. A complete list of the VAX-11 instructions is given in Appendix 1. The following gives an overview of the instruction set.
1 Integer logic and arithmetic¾ Byte, word, and longword are the primary data types. A fairly conventional group of arithmetic and logical instructions is provided. The result generating dyadic arithmetic and logical instructions are provided in two and three operand forms. A number of optimizations are included: clear which is a move of zero; test which is a compare against zero; and increment and decrement which are an optimization of add one and subtract one respectively. A complete set of converts is provided which covers both the integer and the floating data types. In contrast to other architectures only a few shift type instructions are provided: it was felt that shifts are mostly used for field isolation which is much more conveniently done with the field instructions described later. In order to support greater than longword precision integer operations, a few special instructions are provided: extended multiply and divide and add with carry and subtract with carry. 2 Floating point instructions¾ Again a conventional group of instructions are included with result producing dyadic operators in two and three operand forms. Several specialized floating point instructions are included. The extended modulus instruction multiplies two floating operands together and stores the integer and fraction parts of the product in separate result operands. The polynomial instruction computes a polynomial from a table of coefficients in memory. Both these instructions employ greater than normal precision and are useful in high accuracy mathematical routines. A convert rounded instruction is provided which implements the ALGOL rather than FORTRAN conventions for converting from floating point to integer. 3 Address instructions¾ The move address instructions store in the result operand the effective address of the source operand. The push address optimizations push on the stack (defined by SP) the effective address of the source operand. The latter are used extensively in subroutine calling. 4 Field instructions¾ The extract field instructions extract a 0 to 32-bit field, sign- or zero-extend it if it is less than 32 bits, and store it in a longword operand. The compare field instructions compare a (sign- or zero-extended if necessary) field against a longword operand. The find first instructions find the first occurrence of a set or clear bit in a field. 5 Control instructions¾ There is a complete set of condition al branches supporting both a signed and, where appropriate, an unsigned interpretation of the various data types. These branches test the condition codes and take a one byte
 
Maxicomputers

PC relative branch displacement. There are three unconditional branch instructions: the first taking a one byte PC relative displacement, the second taking a word PC relative displacement, and the third-called jump-taking a general operand specification. Paralleling these three instructions are three branch to subroutine instructions. These push the current PC on the stack before transferring control. The single byte return from subroutine instruction returns from subroutines called by these instructions. There is a set of branch on bit instructions which branch on the state of a single bit and, depending on the instruction, set, clear, or leave unchanged that bit.
The add compare and branch instructions are used for loop control. A step operand is added to the loop control operand and the sum compared against a limit operand. The result of the comparison determines whether the branch is taken. The sense of the comparison is based on the sign of the step operand. Optimizations of loop control include the add one and branch instructions which assume a step of one and the subtract one and branch instructions which assume a step of minus one and a limit of zero.
The case instructions implement the computed go to in FORTRAN and case statements in other languages. A selector operand is checked to see that it lies in range and is then used to select one of table of PC relative branch displacements following the instruction.
6 Queue instructions¾ The queue representation is a doubly linked circular list. Instructions are provided to insert an item into a queue or to remove an item from a queue.
7 Character string instructions¾ The general move character instruction takes five operands specifying the lengths and starting addresses of the source and destination strings and a fill character to be used if the source string is shorter than the destination string. The instruction functions correctly regardless of string overlap. An optimized move character instruction assumes the string lengths are equal and takes three operands. Paralleling the move instructions are two compare character instructions. The move translated characters instruction is similar to the general move character instruction except that the source string bytes are translated by a translation table specified by the instruction before being moved to destination string. The move translated until escape instruction stops if the result of a translation matches the escape character specified by one of its operands. The locate and skip character instructions find respectively the first occurrence or non-occurrence of a character in a string. The scan and span instructions find respectively the first occurrence or non-occurrence of a character within a specified character set in a string. The match characters instruction finds the first occurrence of a substring within a string which matches a specified pattern string.
8 Packed decimal instructions¾ A conventional set of arithmetic instructions is provided. The arithmetic shift and round instruction provides decimal point scaling and rounding. Converts are provided to and from longword integers, leading separate decimal strings, and trailing embedded decimal strings. A comprehensive edit instruction is included.
VAX-11 Procedure Instructions
 
 

A major goal of the VAX-11 design was to have a single system wide procedure calling convention which would apply to all inter-module calls in the various languages, calls for operating system services, and calls to the common run time system. Three VAX-11 instructions support this convention: two call instructions which are indistinguishable as far as the called procedure is concerned and a return instruction.
The call instructions assume that the first word of a procedure is an entry mask which specifies which registers are to be used by the procedure and thus need to be saved. (Actually only R0-R11 are controlled by the entry mask and bits 15:12 of the mask are reserved for other purposes.) After pushing the registers to be saved on the stack, the call instruction pushes AP, FP, PC, a longword containing the PSW and the entry mask, and a zero valued longword which is the initial value of a condition handler address. The call instruction then loads FP with the contents of SP and AP with the argument list address. The appearance of the stack frame after the call is shown in the upper part of Fig. 7.
The form of the argument list is shown in the lower part of Fig. 7. It consists of an argument count and list of longword arguments which are typically addresses. The CALLG instruction takes two operands: one specifying the procedure address and the other specifying the address of the argument list assumed arbitrarily located in memory. The CALLS instruction also takes two operands: one the procedure address and the other an argument count. CALLS assumes that the arguments have been pushed on the stack and pushes the argument count immediately prior to saving the registers controlled by the entry mask. It also sets bit 13 of the saved entry mask to indicate a CALLS instruction was used to make the call.
The return instruction uses FP to locate the stack frame. it loads SP with the contents of FP and restores PSW through PC by popping the stack. The saved entry mask controls the popping and restoring of R11 through R0. Finally if the bit indicating CALLS was set, the argument list is removed from the stack.

Memory Management Design Alternatives
  
Memory management comprises the mechanisms used (1) to map the virtual addresses generated by processes to physical memory addresses, (2) to control access to memory (i.e., to control whether a process has read, write, or no access to various areas of memory), and (3) to allow a process to execute even if all of its virtual address space is not simultaneously mapped to physical memory (i.e., to )
VAX-11/780¾ A Virtual Address Extension to the DEC PDP-11 Family 723
provide so called virtual memory facilities). The memory management proved to be the most difficult part of the architecture to design. Three alternatives were pursued and full designs were completed for the first two alternatives and nearly completed for the third. The three alternatives1 were:
1 A paged form of memory management with access control at the page level and a small number (4) of hierarchical access modes whose use would be dedicated to specific purposes. This represented an evolution of the PDP-11/70 memory management. 2 A paged and segmented form with access control at the segment level and a larger number (8) of hierarchical access modes which would he used quite generally. Although it differed in a number of ways, the design was motivated by the Multics [Organick, 1972; Schroeder and Saltzer, 1971] architecture and the Honeywell 6180 implementation. 3 A capabilities [Needham, 1972; Needham and Walker, 1977] form with access control provided by the capabilities and the ability to page larger objects described by the capabilities.
The first alternative was finally selected. The second alternative was rejected because it was felt that the real increase in functionality provided inadequately offset the increased architectural complexity. The third alternative appeared to offer functionality advantages that could be useful over the longer term. However, it was unlikely that these advantages could be exploited in the near term. Further it appeared that the complexity of the capabilities design was inappropriate for a minicomputer system.
Memory Mapping The 4.3 gigabyte virtual address space is divided into four regions as shown in Fig. 8. The first two regions¾ the program and control regions¾ comprise the per process virtual address space which is uniquely mapped for each process. The second two regions¾ the system region and a region reserved for future use¾ comprise the system virtual address space which is singly mapped for all processes. Each of the regions serves different purposes. The program region contains user programs and data and the top of the region is a dynamic memory allocation point. The control region contains operating system data structures specific to the process and the user stack. The system region contains procedures which are common to all processes (such as those that comprise the operating system and RMS) and (as will be seen later) page tables. A virtual address has the structure shown in the upper part of Fig. 9. Bits 8:0 specify a byte within a 512 byte page which is the


1It should not be construed that memory management is independent of the rest of the architecture. The various memory management alternatives required different definitions of the addressing modes and different instruction level support for addressing.
Maxicomputers
basic unit of mapping. Bits 29:9 specify a virtual page number (VPN). Bits 31:30 select the virtual address region. The mechanism of mapping consists of using the region select bits to select a page table which consists of page table entries (PTEs). After a check that it is not too large, the VPN is used to index into the page table to select a PTE. The PTE contains either (1) 21-bit physical page frame number which is concatenated with the nine low order byte in page bits to form a 30-bit physical address shown in the lower part of Fig. 9, or (2) an indication that the virtual page accessed is not in physical memory. The latter case is called a page fault. Instruction execution in the current procedure is suspended and control is transferred to an operating system procedure which will cause the missing virtual page to be brought into physical memory. At this point instruction execution in the suspended procedure can resume transparently. The page table for the system region is defined by the system base register which contains the physical address of the start of the system region page table and the system length register which contains the length of the table. Thus the system page table is contiguous in physical memory. The per process space page tables are defined similarly by the program and control region base registers and length registers. However, the base registers do not contain physical addresses: rather, they contain system region virtual addresses. Thus the per process page tables are contiguous in the system region virtual address space and are not necessarily contiguous in physical memory. This placement of the per process page tables permits them to be paged and avoids what would otherwise be a serious physical memory allocation problem. Access Control At a given point in time a process executes in one of four access modes. The modes from most privileged to least are called kernel, executive, supervisor and user. The use of these modes by VAX/VMS is as follows:
1 Kernel¾ Interrupt and exception handling, scheduling, paging, physical I/O, etc. 2 Executive¾ Logical I/O as provided by RMS.
3 Supervisor¾ The command interpreter.
4 User¾ User procedures and data.
The accessibility of each page (read, write, or no access) from each access mode is specified in the PTE for that page. Any attempt to improperly access a page is suppressed and control is transferred to an operating system procedure. The accessibility is assumed hierarchically ordered: if a page is writable from any given mode, it is also readable; and if a page is accessible from a less privileged mode, it is accessible from a more privileged mode. Thus, for example, a page can be readable and writable from kernel mode, only readable from executive mode, and inaccessible from supervisor and user modes.
A procedure executing in a less privileged mode often needs to call a procedure which executes in a more privileged mode: e.g., a user program needs an operating system service performed. The access mode is changed to a more privileged mode by executing a change mode instruction which transfers control to a routine executing at the new access mode. A return is made to original access mode by executing a return from exception or interrupt instruction (REI). The current access mode is stored in the processor status longword (PSL) whose low order 16 bits comprise the PSW. Also stored in the PSL is the previous access mode; i.e., the access mode from which the current access mode was called. The previous mode information is used by the special probe instructions which validate arguments passed in cross access mode calls. Procedures running at each of the access modes require a separate stack with appropriate accessibility. To facilitate this, each process has four copies of SP which are selected according to the current access mode field in the PSL. A procedure always accesses the correct stack by using R14. In an earlier section, it was stated that the VAX-11 standard CALL instruction is used for all calls including those for operating system services. Indeed procedures do call the operating system using the CALL instruction. The target of the CALL instruction is the minimal procedure consisting of an entry mask, a change mode instruction, and a return instruction. This access mode changing is transparent to the calling procedure. Interrupts and Exceptions Interrupts and exceptions are forced changes in control flow other than that explicitly indicated by the executing program. The
VAX-11/780¾ A Virtual Address Extension to the DEC PDP-11 Family 725


distinction between them is that interrupts are normally unrelated to the currently executing program while exceptions are a direct consequence of program execution. Examples of interrupt conditions are status changes in I/O devices while examples of exception conditions are arithmetic overflow or a memory management access control violation.
VAX-11 provides a 31 priority level interrupt system. Sixteen levels (16-31) are provided for hardware while 15 levels (1-15) are provided for software. (Level 0 is used for normal program execution.) The current interrupt priority level (IPL) is stored in a field in the PSL. When an interrupt request is made at a level higher than IPL, the current PC and PSL are pushed on the stack and new PC obtained from a vector selected by the interrupt requester (a new PSL is generated by the CPU). Interrupts are serviced by routines executing with kernel mode access control. Since interrupts are appropriately serviced in a system wide rather than a specific process context, the stack used for interrupts is defined by another stack pointer called the interrupt stack pointer. (Just as for the multiple stack pointers used in process context, an interrupt routine accesses the interrupt stack using R14.) An interrupt service is terminated by execution of an REI instruction which loads PC and PSL from the top two longwords on the stack.
Exceptions are handled like interrupts except for the following: (1) since exceptions arise in a specific process context, the kernel mode stack for that process is used to store PC and PSL and (2) additional parameters (such as the virtual address causing a page fault) may be pushed on the stack.

Process Context Switching
From the standpoint of the VAX-11 architecture, the process state or context consists of:

1 The 15 general registers R0-R13 and R15.
2 Four copies of R14 (SP): one for each of kernel, executive, supervisor, and user access modes.
3 The PSL.
4 Two base and two limit registers for the program and control region page tables.

This context is gathered together in a data structure called a process control block (PCB) which normally resides in memory. While a process is executing, the process context can be considered to reside in processor registers. To switch from one process to another it is required that the process context from the previously executing process be saved in its PCB in memory and the process context for the process about to be executed to be loaded from its PCB in memory. Two VAX-11 instructions support context switching. The save process context instruction saves the complete process context in memory while the load process context instruction loads the complete process context from memory.

I/O
Much like the PDP-11, VAX-11 has no specific I/O instructions. Rather, I/O devices and device controllers are implemented with a set of registers which have addresses in the physical memory address space. The CPU controls I/O devices by writing these registers; the devices return status by writing these registers and the CPU subsequently reading them. The normal memory management mechanism controls access to I/O device registers and a process having a particular device's registers mapped into its address space can control that device using the regular instruction set.

Compatibility Mode
As mentioned in the VAX-11 overview, compatibility mode in the VAX-11 architecture provides the basic PDP-11 instruction set less privileged and floating point instructions. Compatibility mode is intended to support a user as opposed to an operating system environment. Normally a compatibility mode program is combined with a set of native mode procedures whose purpose is to map service requests from some particular PDP-11 operating system environment into VAX/VMS services.
In compatibility mode the 16-bit PDP-11 addresses are zero extended to 32-bits where standard native mode mapping and access control apply. The eight 16-bit PDP-11 general registers overmap the native mode general registers R0-R6 and R15 and thus the PDP-11 processor state is contained wholly within the native mode processor state.
Compatibility mode is entered by setting the compatibility mode bit in the PSL. Compatibility mode is left by executing a PDP-11 trap instruction (such as used to make operating service requests), and on interrupts and exceptions.

VAX-11/780 Implementation

VAX-I1/780
The VAX-11/780 computer system is the first implementation of the VAX-11 architecture. For instructions executed in compatibility mode, the VAX-11/780 has a performance comparable to the PDP-11/70. For instructions executed in native mode, the -11/780 has a performance in excess of the -11/70 and thus represents the new high end of the -11 (LSI-11, PDP-11, VAX-11) family.
A block diagram of the -11/780 system is given in Fig. 10. The system consists of a central processing unit (CPU), the console subsystem, the memory subsystem, and the I/O subsystem. The
Maxicomputers
CPU and the memory and I/O subsystems are joined by a high speed synchronous bus called the Synchronous Backplane Interconnect (SBI). CPU The CPU is a microprogrammed processor which implements the native and compatibility mode instruction sets, the memory management, and the interrupt and exception mechanisms. The CPU has 32-bit main data paths and is built almost entirely of conventional Schottky TTL components. To reduce effective memory access time the CPU includes an 8K byte write through cache or buffer memory. The cache organization is 2-way associative with an 8-byte block size. To reduce delays due to writes, the CPU includes a write buffer. The CPU issues the write to the buffer and the actual memory write takes place in parallel with other CPU activity. The CPU contains a 128 entry address translation buffer which is a cache of recent virtual to physical translations. The buffer is divided into two 64 entry sections: one for the per process regions and one for the system region. This division facilitates permitting the system region translations to remain unaffected by a process context switch. A fourth buffer in the CPU is the 8-byte instruction buffer. It serves two purposes. First, it decomposes the highly variable instruction format into its basic components and, second, it constantly fetches ahead to reduce delays in obtaining the instruction components. The CPU includes two standard clocks. The programmable real-time clock is used by the operating system for local timing purposes. The time-of-year clock with its own battery backup is the long term time references for the operating system. It is automatically read on system startup to eliminate the need for manual entry of data and time. The CPU includes 12K bytes of writable diagnostic control store (WDCS) which is used for diagnostic purposes, implementation of certain instructions, and for future microcode changes. As an option for very sophisticated users, another 12K bytes of writable control store is available. A second option is the floating point accelerator (FPA). Although the basic CPU implements the full floating point instruction set, the FPA provides high speed floating point hardware. It is logically invisible to programs and only affects their running time. Console Subsystem The console subsystem is centered around an LSI-11 computer with 16K bytes of RAM and 8K bytes of ROM (used to store the LSI-11 bootstrap, LSI-11 diagnostics, and console routines). Also included are a floppy disk, an interface to the console terminal, and a port for remote diagnostic purposes. The floppy disk in the console subsystem serves multiple purposes. It stores the main system bootstrap and diagnostics and serves as a medium for distribution of software updates. SBI The SBI is the primary control and data transfer path in the -11/780 system. Because the cache and write buffer largely decouple the CPU performance from the memory access time, the SBL design was optimized for bandwidth and reliability rather than the lowest possible access time. The SBI is a synchronous bus with a cycle time of 200 nsec. The data path width of the SBI is 32 bits. During each 200 nsec cycle either 32 bits of data or a 30-bit physical address can be transferred. Since each 32-bit read or write requires transmission of both address and data, two SBI cycles are used for a complete transaction. The SBI protocol permits 64-bit reads or writes using one address cycle and two data transfer cycles: the CPU and I/O subsystem use this mode whenever possible. For read transactions the bus is reacquired by the memory in order to send the data: thus the bus is not held during the memory access time. Arbitration of the SBI is distributed: each interface to the SBI has a specific priority and its own bus request line. When an interface wishes to use the bus, it asserts its bus request line. If at the end of a 200 nsec cycle there are no interfaces of higher priority requesting the bus, the interface takes control of the bus. Extensive checking is done on the SBI. Each transfer is parity checked and confirmed by the receiver. The arbitration process and general observance of the SBI protocol are checked by each SBI interface during each SBI cycle. The processor maintains a
VAX-11/780—A Virtual Address Extension to the DEC PDP-1 1 Family 727
running 16-cycle history of the SBI: any SBI error condition causes this history to be locked and preserved for diagnostic purposes.

Memory Subsystem
The memory subsystem consists of one or two memory controllers with up to 1M bytes of memory on each. The memory is organized in 64-bit quadwords with an 8-bit ECC which provides single bit error correction and double bit error detection. The memory is built of 4K MOS RAM components.
The memory controllers have buffers which hold up to four memory requests. These buffers substantially increase the utilization of the SBI and memory by permitting the pipelining of multiple memory requests. If desired, quadword physical addresses can be interleaved across the memory controllers.
As an option, battery backup is available which preserves the contents of memory across short term power failures.

I/O Subsystem
The I/O subsystem consists of buffered interfaces or adapters between the SBI and the two types of peripheral busses used on PDP-11 systems: the Unibus and the Massbus. One Unibus adapter and up to four Massbus adapters can be configured on a VAX-11/780 system.
The Unibus is a medium speed multiplexor bus which is used as a primary memory as well as peripheral bus in many PDP-11 systems. It has an 18-bit physical address space and supports byte and word transfers. In addition to implementing the Unibus protocol and transmitting interrupts to the CPU, the Unibus adapter provides two other functions. The first is mapping 18-bit Unibus addresses to 30-bit SBI physical addresses. This is accomplished in a manner substantially identical to the virtual to physical mapping implemented by the CPU. The Unibus address space is divided into 512 512-byte pages. Each Unibus page has a page table entry (residing in the Unibus adapter) which maps addresses in that page to physical memory addresses. In addition to providing address translation, the mapping permits contiguous transfers on the Unibus which cross page boundaries to be mapped to discontiguous physical memory page frames.
The second function performed by the Unibus adapter is assembling 16-bit Unibus transfers (both reads and writes) into 64-bit SBI transfers. This operation (which is applicable only to block transfers such as from disks) appreciably reduces SBI traffic due to Unibus operations. There are 158-byte buffers in the Unibus adapter permitting 15 simultaneous buffered transactions. Additionally there is an unbuffered path through the Unibus adapter permitting an arbitrary number of simultaneous unbuffered transfers.
The Massbus is a high speed block bus used primarily for disks and tapes. The Massbus adapter provides much the same functionality as the Unibus adapter. The physical addresses into which transfers are made are defined by a page table: again this permits contiguous device transfers into discontiguous physical memory.
Buffering is provided in the Massbus adapter which minimizes the probability of device overruns and assembles data into 64-bit units for transfer over the SBI.

References
Bell and Strecker [1976]; Bell et al. [1970]; Flynn [1977]; Levy and Eckhouse [1980]; McLean [1977]; Myers [1977b]; Needham [1972]; Needham and Walker [1977]; Organick [1972]; Schrocker and Saltzer [1971].
APPENDIX 1 VAX-11 INSTRUCTION SET
Integer and Floating Point Logical Instructions

MOV-Move(B,W, L, F, D,Q)†
MNEG-Move Negated(B,W,L,F,D)
MCOM-Move Complemented(B,W,L)
MOVZ-Move Zero-Extended(BW,BL,WL)
CLR-Clear(B,W,L=F,Q=D)
CVT-Convert(B,W, L, F, D)(B,W, L, F, D)
CVTR-LConvert Rounded(F,D) to Longword
CMP-Compare(B,W,L,F,D)
TST-Test(B,W,L,F,D)
BIS-2Bit Set(B,W,L)2-Operand
BIS-3Bit Set(B,W,L)3-Operand
BIC-2Bit Clear(B,W,L)2-Operand
BIC-3Bit Clear(B,W,L)3-Operand
BIT-Bit Test(B,W,L)
XOR-2Exclusive OR(B,W, L)2-Operand
XOR-3Exclusive OR(B,W, L)3-Operand
ROTLRotate Longword
PUSHLPush Longword
Integer and Floating Point Arithmetic Instructions

INC-Increment(B,W,L)
DEC-Decrement(B,W,L)
ASH-Arithmetic Shift(L,Q)
ADD-2Add(B,W,L,F, D)2-Operand
ADD-3Add(B,W, L, F, D)3-Operand
ADWCAdd with Carry
ADAWIAdd Aligned Word Interlocked
†B = byte, W = word, L = longword, F = floating, D = double floating, Q = quadword, S = set, C = clear.

Maxicomputers
SUB-2 Subtract(B,W,L,F,D)2-Operand
SUB-3Subtract(B,W,L,F,D)3-Operand
SBWCSubtract with Carry
MUL-2Multiply(B,W, L,F, D)2-Operand
MUL-3Multiply(B,W, L, F, D)3-Operand
EMULExtended Multiply
DIV-2Divide(B,W, L, F, D)2-Operand
DIV-3Divide(B,W,L, F, D)3-Operand
EDIVExtended Divide
EMOD-Extended Modulus(F,D)
POLY-Polynomial Evaluation(F, D)
Index Instruction
 
 
INDEX Compute Index
Packed Decimal Instructions
 
 

MOVP Move Packed
CMPP3Compare Packed 3-Operand
CMPP4Compare Packed 4-Operand
ASHPArithmetic Shift Round and Packed
ADDP4Add Packed 4-Operand
ADDP6Add Packed 6-Operand
SUBP4Subtract Packed 4-Operand
SUBP6Subtract Packed 6-Operand
MULPMultiply Packed
DIVPDivide Packed
CVTLPConvert Long to Packed
CVTPLConvert Packed to Long
CVTPTConvert Packed to Trailing
CVTTPConvert Trailing to Packed
CVTPSConvert Packed to Separate
CVTSPConvert Separate to Packed
EDITPCEdit Packed to Character String
Character String Instructions
 
 

MOVC3 Move Character 3-Operand
MOVC5Move Character 5-Operand
MOVTCMove Translated Characters
MOVTUC Move Translated Unit Character
CMPC3Compare Characters 3-Operand
CMPC5Compare Characters 5-Operand
LOCCLocate Character
SKPCSkip Character
SCANCScan Characters
SPANCSpan Characters
MATCHC Match Characters
Variable-Length Bit Field Instructions
 
 

EXTVExtract Field
EXTZVExtract Zero-Extended Field
INSVInsert Field
CMPVCompare Field
CMPZVCompare Zero-Extended Field
FFSFind First Set
FFCFind First Clear
Branch on Bit Instructions
 
 

BLB-Branch on Low B(S,CI)
BB-Branch on Bit(S,Cl)
BBS-Branch on Bit Set and(S,CI)Bit
BBCBranch on Bit Clear and(Set,Clear)Bit
BBSSIBranch on Bit Set and Set Bit Interlocked
BBCCIBranch on Bit Clear and Clear Bit Interlocked
Queue Instructions
 
 

INSQUE Insert Entry in Queue
REMQUE Remove Entry from Queue
Address Manipulation Instructions
 
 

MOVA-Move Address(B,W,L=F,Q =D)
PUSHA-Push Address(B,W, L= F,Q = D)on Stack
Processor State Instructions
 
 

PUSHRPush Registers on Stack
POPRPop Registers from Stack
MOVPSLMove from Processor Status Longword
BISPSWBit Set Processor Status Word
BICPSWBit Clear Processor Status Word
Unconditional Branch and Jump Instructions
 
 

BR-Branch with(B,W)Displacement
JMPJump
Branch on Condition Code
 
 

BLSSLess Than
BLSSULess Than Unsigned
(BCS)(Carry Set)
BLEQLess Than or Equal
BLEQULess Than or Equal Unsigned
BEQLEqual
(BEQLU)(Equal Unsigned)
BNEQNot Equal
(BNEQU)(Not Equal Unsigned)
BGTRGreater Than
BGTRUGreater Than Unsigned
BGEQGreater Than or Equal
BGEQUGreater Than or Equal Unsigned
(BCC)(Carry Clear)
BVSOverflow Set
BVCOverflow Clear

Loop and Case Branch

ACB- Add, Compare and Branch(B,W,L,F,D)
AOBLEQ Add One and Branch Less Than or Equal
AOBLSSAdd One and Branch Less Than
SOBGEQSubtract One and Branch Greater Than or Equal
SOBGTR Subtract One and Branch Greater Than
CASE-Case on(B,W,L)
Subroutine Call and Return Instructions

BSBBranch to Subroutine with(B,W,) Displacement
JSBJump to Subroutine
RSBReturn from Subroutine
Procedure Call and Return Instructions

CALLGCall Procedure with General Argument List
CALLSCall Procedure with Stack Argument List
RETReturn from Procedure
Access Mode Instructions

CHMChange Mode to (Kernel, Executive, Supervisor, User)
REIReturn from Exception or Interrupt
PROBERProbe Read
PROBEWProbe Write
Privileged Processor Register Control Instructions

SVPCTXSave Process Context
LDPCTXLoad Process Context
MTPRMove to Process Register
MFPRMove from Processor Register
Special Function Instructions

CRCCyclic Redundancy Check
BPTBreakpoint Fault
XFCExtended Function Call
NOPNo Operation
HALTHalt


In the summer of 1960, Control Data began a project which culminated October, 1964 in the delivery of the first 6600 Computer. In 1960 it was apparent that brute force circuit performance and parallel operation were the two main approaches to any advanced computer.
This paper presents some of the considerations having to do with the parallel operations in the 6600. A most important and fortunate event coincided with the beginning of the 6600 project. This was the appearance of the high-speed silicon transistor, which survived early difficulties to become the basis for a nice jump in circuit performance.

System Organization

The computing system envisioned in that project, and now called the 6600, paid special attention to two kinds of use, the very large scientific problem and the time sharing of smaller problems. For the large problem, a high-speed floating point central processor with access to a large central memory was obvious. Not so obvious, but important to the 6600 system idea, was the isolation of this central arithmetic from any peripheral activity.
It was from this general line of reasoning that the idea of a multiplicity of peripheral processors was formed (Fig. 1). Ten such peripheral processors have access to the central memory on one side and the peripheral channels on the other. The executive control of the system is always in one of these peripheral processors, with the others operating on assigned peripheral or control tasks. All ten processors have access to twelve input-output channels and may "change hands," monitor channel activity, and perform other related jobs. These processors have access to central memory, and may pursue independent transfers to and from this memory.
Each of the ten peripheral processors contains its own memory for program and buffer areas, thereby isolating and protecting the more critical system control operations in the separate processors. The central processor operates from the central memory with relocating register and file protection for each program in central memory.

Peripheral and Control Processors
 
The peripheral and control processors are housed in one chassis of the main frame. Each processor contains 4096 memory words of 12 bits length. There are 12- and 24-bit instruction formats to provide for direct, indirect, and relative addressing. Instructions provide logical, addition, subtraction, and conditional branching. Instructions also provide single word or block transfers to and from any of twelve peripheral channels, and single word or block transfers to and from central memory. Central memory words of 60 bits length are assembled from five consecutive peripheral words. Each processor has instructions to interrupt the central processor and to monitor the central program address.
To get this much processing power with reasonable economy and space, a time-sharing design was adopted (Fig. 2). This design contains a register "barrel" around which is moving the dynamic information for all ten processors. Such things as program address, accumulator contents, and other pieces of information totalling 52 bits are shifted around the barrel. Each complete trip around requires one major cycle or one thousand nanoseconds. A "slot" in the barrel contains adders, assembly networks, distribution network, and interconnections to perform one step of any peripheral instruction. The time to perform this step or, in other words, the time through the slot, is one minor cycle or one hundred nanoseconds. Each of the ten processors, therefore, is allowed one minor cycle of every ten to perform one of its steps. A peripheral instruction may require one or more of these steps, depending on the kind of instruction.
In effect, the single arithmetic and the single distribution and assembly network are made to appear as ten. Only the memories are kept truly independent. Incidentally, the memory read-write cycle time is equal to one complete trip around the barrel, or one thousand nanoseconds.
Input-output channels are bi-directional, 12-bit paths. One 12-bit word may move in one direction every major cycle, or 1000 nanoseconds, on each channel. Therefore, a maximum burst rate of 120 million bits per second is possible using all ten peripheral processors. A sustained rate of about 50 million bits per second can be maintained in a practical operating system. Each channel may service several peripheral devices and may interface to other systems, such as satellite computers.
Peripheral and control processors access central memory through an assembly network and a dis-assembly network.


Parallel Operation in the Control Data 6600 731




Maxicomputers
five peripheral memory references are required to make up one central memory word, a natural assembly network of five levels is used. This allows five references to be "nested" in each network during any major cycle. The central memory is organized in independent banks with the ability to transfer central words every minor cycle. The peripheral processors, therefore, introduce at most about 2% interference at the central memory address control. A single real time clock, continuously running is available to all peripheral processors.
 
  Central Processor The 6600 central processor may be considered the high-speed arithmetic unit of the system (Fig. 3). Its program, operands, and results are held in the central memory. It has no connection to the peripheral processors except through memory and except for two single controls. These are the exchange jump, which starts or interrupts the central processor from a peripheral processor, and the central program address which can be monitored by a peripheral processor. A key description of the 6600 central processor, as you will see in later discussion, is "parallel by function." This means that a number of arithmetic functions may be performed concurrently. To this end, there are ten functional units within the central processor. These are the two increment units, floating add unit, fixed add unit, shift unit, two multiply units, divide unit, boolean unit, and branch unit. In a general way, each of these units is a three address unit. As an example, the floating add unit obtains two 60-bit operands from the central registers and produces a 60 bit result which is returned to a register. Information to and from these units is held in the central registers, of which there are twenty-four. Eight of these are considered index registers, are of 18 bits length, and one of which always contains zero. Eight are considered address registers, are of 18 bits length, and serve to address the five read central memory trunks and the two store central memory trunks. Eight are considered floating point
Parallel Operation in the Control Data 6600 733
registers, are of 60 bits length, and are the only central registers to access central memory during a central program. In a sense, just as the whole central processor is hidden behind central memory from the peripheral processors, so, too, the ten functional units are hidden behind the central registers from central memory. As a consequence, a considerable instruction efficiency is obtained and an interesting form of concurrency is feasible and practical. The fact that a small number of bits can give meaningful definition to any function makes it possible to develop forms of operand and unit reservations needed for a general scheme of concurrent arithmetic. Instructions are organized in two formats, a 15-bit format and a 30-bit format, and may be mixed in an instruction word (Fig. 4). As an example, a 15-bit instruction may call for an ADD, designated by the f and m octal digits, from registers designated by the j and k octal digits, the result going to the register designated by the i octal digit. In this example, the addresses of the three-address, floating add unit are only three bits in length, each address referring to one of the eight floating point registers. The 30-bit format follows this same form but substitutes for the k octal digit an 18-bit constant K which serves as one of the input operands. These two formats provide a highly efficient control of concurrent operations. As a background, consider the essential difference between a general purpose device and a special device in which high speeds are required. The designer of the special device can generally improve on the traditional general purpose device by introducing some form of concurrency. For example, some activities of a
housekeeping nature may be performed separate from the main sequence of operations in separate hardware. The total time to complete a job is then optimized to the main sequence and excludes the housekeeping. The two categories operate concurrently. It would be, of course, most attractive to provide in a genera purpose device some generalized scheme to do the same kind of thing. The organization of the 6600 central processor provides just this kind of scheme. With a multiplicity of functional units, and of operand registers and with a simple and highly efficient addressing system, a generalized queue and reservation scheme is practical. This is called the scoreboard. The scoreboard maintains a running file of each central register, of each functional unit, and of each of the three operand trunks to and from each unit. Typically, the scoreboard file is made up of two-, three-, and four-bit quantities identifying the nature of register and unit usage. As each new instruction is brought up, the conditions at the instant of issuance are set into the scoreboard. A snapshot is taken, so to speak, of the pertinent conditions. If no waiting is required, the execution of the instruction is begun immediately under control of the unit itself. If waiting is required (for example, an input operand may not yet be available in the central registers), the scoreboard controls the delay, and when released, allows the unit to begin its execution. Most important, this activity is accomplished in the scoreboard and the functional unit: and does not necessarily limit later instructions from being brought up and issued. In this manner, it is possible to issue a series of instructions, some related, some not, until no functional units are left free or until a specific register is to be assigned more than one result. With just those two restrictions on issuing (unit free and no double result), several independent chains of instructions may proceed concurrently. Instructions may issue every minor cycle in the absence of the two restraints. The instruction executions, in comparison, range from three minor cycles for fixed add, 10 minor cycles for floating multiply, to 29 minor cycles for floating divide. To provide a relatively continuous source of instructions, one buffer register of 60 bits is located at the bottom of an instruction stack capable of holding 32 instructions (Fig. 5). Instruction words from memory enter the bottom register of the stack pushing up the old instruction words. In straight line programs, only the bottom two registers are in use, the bottom being refilled as quickly as memory conflicts allow. In programs which branch back to an instruction in the upper stack registers, no refills are allowed after the branch, thereby holding the program loop completely in the stack. As a result, memory access or memory conflicts are no longer involved, and a considerable speed increase can be had.



floating point register). Any instruction calling for address register result implicitly initiates a memory reference on that trunk. These instructions are handled through the scoreboard and therefore tend to overlap memory access with arithmetic. For example, a new memory word to be loaded in a floating point register can be brought in from memory but may not enter the register until all previous uses of that register are completed. The central registers, therefore, provide all of the data to the ten functional units, and receive all of the unit results. No storage is maintained in any unit. Central memory is organized in 32 banks of 4096 words. Consecutive addresses call for a different bank; therefore, adjacent addresses in one bank are in reality separated by 32. Addresses may be issued every 100 nanoseconds. A typical central memory information transfer rate is about 250 million bits per second. As mentioned before, the functional units are hidden behind the registers. Although the units might appear to increase hardware duplication, a pleasant fact emerges from this design. Each unit may be trimmed to perform its function without regard to others. Speed increases are had from this simplified design. As an example of special functional unit design, the floating multiply accomplishes the coefficient multiplication in nine minor cycles plus one minor cycle to put away the result for a total of 10 minor cycles, or 1000 nanoseconds. The multiply uses layers of carry save adders grouped in two halves, Each half concurrently forms a partial product, and the two partial products finally merge while the long carries propagate. Although this is a fairly large complex of circuits, the resulting device was sufficiently smaller than originally planned to allow two multiply units to be included in the final design. To sum up the characteristics of the central processor, remember that the broadbrush description is "concurrent operation." In other words, any program operating within the central processor utilizes some of the available concurrency. The program need not be written in a particular way, although certainly some optimization can be done. The specific method of accomplishing this concurrency involves issuing as many instructions as possible while handling most of the conflicts during execution. Some of the essential requirements for such a scheme include:
1 Many functional units
2 Units with three address properties
3 Many transient registers with many trunks to and from the units 4 A simple and efficient instruction set

Construction Circuits in the 6600 computing system use all-transistor logic (Fig. 7). The silicon transistor operates in saturation when switched
"on" and averages about five nanoseconds of stage delay. Logic circuits are constructed in a cordwood plug-in module of about 21/2 inches by 21/2; inches by 0.8 inch. An average of about 50 transistors are contained in these modules. Memory circuits are constructed in a plug-in module of about six inches by six inches by 21/2 inches (Fig. 8). Each memory module contains a coincident current memory of 4096 12-bit
words. All read-write drive circuits and bit drive circuits plus address translation are contained in the module. One such module is used for each peripheral processor, and five modules make up one bank of central memory. Logic modules and memory modules are held in upright hinged chassis in an X shaped cabinet (Fig. 9). Interconnections between modules on the chassis are made with twisted pair transmission lines. Interconnections between chassis are made with coaxial cables. Both maintenance and operation are accomplished at a programmed display console (Fig. 10). More than one of these
consoles may be included in a system if desired. Dead start facilities bring the ten peripheral processors to a condition which allows information to enter from any chosen peripheral device. Such loads normally bring in an operating system which provides a highly sophisticated capability for multiple users, maintenance, and so on. The 6600 Computer has taken advantage of certain technology advances, but more particularly, logic organization advances which now appear to be quite successful. Control Data is exploring advances in technology upward within the same compatible structure, and identical technology downward, also within the same compatible structure.










The CRAY-1 Computer System

 
This paper describes the CRAY-1, discusses the evolution of its architecture, and gives an account of some of the problems that were overcome during its manufacture.
The CRAY-1 is the only computer to have been built to date that satisfies ERDA's Class VI requirement (a computer capable of processing from 20 to 60 million floating point operations per second) [Keller 1976].
The CRAY-1's Fortran compiler (CFT) is designed to give the scientific user immediate access to the benefits of the CRAY-1's vector processing architecture. An optimizing compiler, CET, "vectorizes" innermost DO loops. Compatible with the ANSI 1966 Fortran Standard and with many commonly supported Fortran extensions, CFT does not require any source program modifications or the use of additional nonstandard Fortran statements to achieve vectorization. Thus the user's investment of hundreds of man months of effort to develop Fortran programs for other contemporary computers is protected.

Introduction

Vector processors are not yet commonplace machines in the larger-scale computer market. At the time of this writing we know of only 12 non-CRAY-1 vector processor installations worldwide. Of these 12, the most powerful processor is the ILLIAC IV (1 installation), the most populous is the Texas Instruments Advanced Scientific Computer (7 installations) and the most publicized is Control Data's STAR 100 (4 installations). In its report on the CRAY-1, Auerbach Computer Technology Reports published a comparison of the CRAY-1, the ASC, and the STAR 100 [Auerbach, n.d.]. The CRAY-1 is shown to be a more powerful computer than any of its main competitors and is estimated to be the equivalent of five IBM 370/195s.
Independent benchmark studies have shown the CRAY-1 fully capable of supporting computational rates of 138 million floating- point operations per second (MFLOPS) for sustained periods and even higher rates of 250 MFLOPS in short bursts [Calahan, Joy, and Orbits, n.d.; Reeves 1975]. Such comparatively high performance results from the CRAY-1 internal architecture, which is designed to accommodate the computational needs of carrying out many calculations in discrete steps, with each step producing interim results used in subsequent steps. Through a technique called "chaining," the CRAY-1 vector functional units, in combination with scalar and vector registers, generate interim results and use them again immediately without additional memory references, which slow down the computational process in other contemporary computer systems.
Other features enhancing the CRAY-1's computational capabilities are: its small size, which reduces distances electrical signals must travel within the computer's framework and allows a 12.5 nanosecond clock period (the CRAY-1 is the world's fastest scalar processor); a one million word semiconductor memory equipped with error detection and correction logic (SECDED); its 64-bit word size; and its optimizing Fortran compiler.

Architecture

The CRAY-1 has been called "the world's most expensive love-seat" [Computer World, 1976]. Certainly, most people's first reaction to the CRAY-1 is that it is so small. But in computer design it is a truism that smaller means faster. The greater the separation of components, the longer the time taken for a signal to pass between them. A cyclindrical shape was chosen for the CRAY-1 in order to keep wiring distances small.
Figure 1 shows the physical dimensions of the machine. The mainframe is composed of 12 wedgelike columns arranged in a 270° arc. This leaves room for a reasonably trim individual to gain access to the interior of the machine. Note that the love-seat disguises the power supplies and some plumbing for the Freon cooling system. The photographs (Figs. 2 and 3) show the interior of a working CRAY-1 and an interior view of a column with one module in place. Figure 4 is a photograph of a single module.

An Analysis of the Architecture

Table 1 details important characteristics of the CRAY-1 Computer System. The CRAY-1 is equipped with 12 i/o channels, 16 memory banks, 12 functional units, and more than 4K bytes of register storage. Access to memory is shared by the i/o channels and high-speed registers. The most striking features of the CRAY-1 are: only four chip types, main memory speed, cooling system, and computation section.

Four Chip Types

Only four chip types are used to build the CRAY-1. These are 16 ´ 4 bit bipolar register chips (6 nanosecond cycle time), 1024 ´ 1 bit bipolar memory chips (50 nanosecond cycle time), and bipolar logic chips with subnanosecond propagation times. The logic chips are all simple low- or high-speed gates with both a 5 wide and a 4 wide gate (5/4 NAND). Emitter-coupled logic circuit (ECL) technology is used throughout the CRAY-1.

The printed circuit board used in the CRAY-1 is a 5-layer board with the two outer surfaces used for signal runs and the three inner layers for - 5.2V, - 2.0V, and ground power supplies. The boards are six inches wide, 8 inches long, and fit into the chassis, as shown in Fig. 3. All integrated circuit devices used in the CRAY-1 are packed in 16-pin hermetically sealed flat packs supplied by both Fairchild and Motorola. This type of package was chosen for its reliability and compactness. Compactness is of special importance; as many as 288 packages may be added to a board to fabricate a module (there are 113 module types), and as many as 72 modules may be inserted into a 28-inch-high chassis. Such component densities evitably lead to a mammoth cooling problem (to be described).
  Main Memory Speed CRAY-1 memory is organized in 16 banks, 72 modules per bank. Each module contributes 1 bit to a 64-bit word. The other 8 bits

Table 1 CRAY-1 CPU Characteristics Summary

Computation Section
Scalar and vector processing modes 12.5 nanosecond clock period operation 64-bit word size Integer and floating-point arithmetic Twelve fully segmented functional units Eight 24-bit address (A) registers Sixty-four 24-bit intermediate address (B) registers Eight 64-bit scalar (S) registers Sixty-four 64-bit intermediate scalar (T) registers Eight 64-element vector (V) registers (64-bits per element) Vector length and vector mask registers One 64-bit real time clock (RT) register Four instruction buffers of sixty-four 16-bit parcels each 128 basic instructions Prioritized interrupt controlMemory Section

1,048,576 64-bit words (plus 8 check bits per word) 16 independent banks of 65,536 words each 4 clock period bank cycle time 1 word per clock period transfer rate for B, T, and V registers 1 word per 2 clock periods transfer rate for A and S registers 4 words per clock period transfer rate to instruction buffers (up to16 instructions per clock period)i/o Section

24 i/o channels organized into four 6-channel groups Each channel group contains either 6 input or 6 output channels Each channel group served by memory every 4 clock periods Channel priority within each channel group 16 data bits, 3 control bits per channel, and 4 parity bits Maximum channel rate of one 64-bit word every 100 nanoseconds Maximum data streaming rate of 500,000 64-bit words/second Channel error detection

are used to store an 8-bit check byte required for single-bit error correction, double-bit error detection (SECDED). Data words are stored in 1-bank increments throughout memory. This organization allows 16-way interleaving of memory accesses and prevents bank conflicts except in the case of memory accesses that step through memory with either an 8 or 16-word increment.
  Cooling System The CRAY-1 generates about four times as much heat per cubic inch as the 7600. To cool the CRAY-1 a new cooling technology was developed, also based on Freon, but employing available metal conductors in a new way. Within each chassis vertical aluminum/stainless steel cooling bars line each column wall. The Freon refrigerant is passed through a stainless steel tube within the aluminum casing. When modules are in place, heat is dissipated through the inner copper heat transfer plate in the
module to the column walls and thence into the cooling bars. The modules are mated with the cold bar by using stainless steel pins to pinch the copper plate against the aluminum outer casing of the bar. To assure component reliability, the cooling system was designed to provide a maximum case temperature of 130° F (54° C). To meet this goal, the following temperature differentials are observed:
Temperature at center of module 
Temperature at edge of module 
Cold plate temperature at wedge 
Cold bar temperature 
Refrigerant tube temperature 
130° F(54° C) 118° F(48° C) 78° F(25° C) 70° F(21° C) 70° F(21° C)
Functional Units There are 12 functional units, organized in four groups: address, scalar, vector, and floating point. Each functional unit is pipelined
 
Table 2 CRAY-1 Functional Units
 
Register usageFunctional unit time (clock periods)
Address function units

address add unit 
A2

address multiply unit 
6
Scalar functional units

scalar add unit 
3

scalar shift unit 
S2 or 3 if double word shift

scalar logical unit 
1

population/leading zero count unit 
3
Vector functional units

vector add unit 
3

vector shift unit 
V4

vector logical unit 
2
Floating-point functional units

floating-point add unit 
S and V6

floating-point multiply unit 
S and V7

reciprocal approximation unit 
S and 14

  into single clock segments. Functional unit time is shown in Table 2. Note that all of the functional units can operate concurrently so that in addition to the benefits of pipelining (each functional unit can be driven at a result rate of 1 per clock period) we also have parallelism across the units too. Note the absence of a divide unit in the CRAY-i. In order to have a completely segmented divide operation the CBAY-1 performs floating-point division by the method of reciprocal approximation. This technique has been used before (e.g. IBM System/360 Model 91).
  Registers Figure 5 shows the CRAY-1 registers in relationship to the functional units, instruction buffers, i/o channel control registers, and memory. The basic set of programmable registers is as follows:
8 24-bit address (A) registers 64 24-bit address-save (B) registers 8 64-bit scalar (S) registers 64 64-bit scalar-save (T) registers 8 64-word (4096-bit) vector (V) registersExpressed in 8-bit bytes rather than 64-bit words, that's a total of 4,888 bytes of high-speed (6ns) register storage.
The functional units take input operands from and store result
operands only to A, S, and V registers. Thus the large amount of register storage is a crucial factor in the CRAY-1's architecture. Chaining could not take place if vector register space were not available for the storage of final or intermediate results. The B and T registers greatly assist scalar performance. Temporary scalar values can be stored from and reloaded to the A and S register in two clock periods. Figure 5 shows the CRAY-1's register paths in detail. The speed of the CFT Fortran IV compiler would be seriously impaired if it were unable to keep the many Pass 1 and Pass 2 tables it needs in register space. Without the register storage provided by the B, T, and V registers, the CRAY-1's bandwidth of only 80 million words/second would be a serious impediment to performance.
 

                                       XI PIC AN CELL  Microcomputer-based robot control
__________________________________________________________________________________



                     Hasil gambar untuk microcomputer and robotic

                    Classical hierarchical control structure of a robot microcomputer controller.


                             Gambar terkait
                                                        Hierarchical control system

          Gambar terkait



          Gambar terkait

The miniaturization of electronics, computers and sensors has created new opportunities for remote sensing applications.


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Hasil gambar untuk usa robot energy  Hasil gambar untuk usa robot energy Hasil gambar untuk usa flag robot energy
  
       


+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++