karnaugh map is a method for solving problem problems in digital combination circuits and sequential circuits in digital, when digital concepts are applied in a series of electronics it is necessary to simplify the simplification of the concept of the whole circuit especially in consideration:
1. efficient use of components
2. ease in soldering component
3. the fewer components are used but the output of the circuit is the same
4. Performance tool more flexible
5. make it easy in trouble shooting circuit
6. More easily damped oscillation system
7. Comparison of output and input more can be defined
8. Allows controlling K W H meter with karnaugh ( Car Now Ugh ) map concept
( Allows Difference Application To Re Steady State )
Einstein synchronisation
Einstein ( E = Intern / Saucer Technic Energy Input )
According to Albert Einstein's prescription from 1905, a light signal is sent at time from clock 1 to clock 2 and immediately back, e.g. by means of a mirror. Its arrival time back at clock 1 is . This synchronisation convention sets clock 2 so that the time of signal reflection is defined to be
The same synchronisation is achieved by "slowly" transporting a third clock from clock 1 to clock 2, in the limit of vanishing transport velocity. The literature discusses many other thought experiments for clock synchronisation giving the same result.
The problem is whether this synchronisation does really succeed in assigning a time label to any event in a consistent way. To that end one should find conditions under which
(a) clocks once synchronised remain synchronised,
(b1) the synchronisation is reflexive, that is any clock is synchronised with itself (automatically satisfied),
(b2) the synchronisation is symmetric, that is if clock A is synchronised with clock B then clock B is synchronised with clock A,
(b3) the synchronisation is transitive, that is if clock A is synchronised with clock B and clock B is synchronised with clock C then clock A is synchronised with clock C.
(b1) the synchronisation is reflexive, that is any clock is synchronised with itself (automatically satisfied),
(b2) the synchronisation is symmetric, that is if clock A is synchronised with clock B then clock B is synchronised with clock A,
(b3) the synchronisation is transitive, that is if clock A is synchronised with clock B and clock B is synchronised with clock C then clock A is synchronised with clock C.
If point (a) holds then it makes sense to say that clocks are synchronised. Given (a), if (b1)–(b3) hold then the synchronisation allows us to build a global time function t. The slices t=const. are called "simultaneity slices".
Einstein (1905) did not recognize the possibility of reducing (a) and (b1)–(b3) to easily verifiable physical properties of light propagation (see below). Instead he just wrote "We assume that this definition of synchronism is free from contradictions, and possible for any number of points; and that the following (that is b2–b3) relations are universally valid."
Max Von Laue [2] was the first to study the problem of the consistency of Einstein's synchronisation (for an account of the early history see [3]). L. Silberstein [4] presented a similar study although he left most of his claims as an exercise for the readers of his textbook on relativity. Max von Laue's arguments were taken up again by H. Reichenbach, and found a final shape in a work by A. Macdonald. The solution is that the Einstein synchronisation satisfies the previous requirements if and only if the following two conditions hold
(i) No redshift: If from point A two flashes are emitted separated by a time interval Dt as recorded by a clock at A, then they reach B separated by the same time interval Dt as recorded by a clock at B.
(ii) Reichenbach's round-trip condition: If a light beam is sent over the triangle ABC, starting from A (and through reflection with mirrors at B and C) then the event of return at A is independent of the direction followed (ABCA or ACBA).
(ii) Reichenbach's round-trip condition: If a light beam is sent over the triangle ABC, starting from A (and through reflection with mirrors at B and C) then the event of return at A is independent of the direction followed (ABCA or ACBA).
Once clocks are synchronised one can measure the one-way light speed. However, the previous conditions that guarantee the applicability of Einstein's synchronisation do not imply that the one-way light speed turns out to be the same all over the frame. Consider
(iii) Von Laue and Weyl's round-trip condition: The time needed by a light beam to traverse a closed path of length L is L/c, where L is the length of the path and c is a constant independent of the path.
A theorem[7] (whose origin can be traced back to von Laue and Weyl)[8] states that Laue-Weyl's round trip condition holds if and only if the Einstein synchronisation can be applied consistently (i.e. (a) and (b1)–(b3) hold) and the one-way speed of light with respect to the so synchronised clocks is a constant all over the frame. The importance of Laue-Weyl's condition stands on the fact that the time there mentioned can be measured with only one clock thus this condition does not rely on synchronisation conventions and can be experimentally checked. Indeed, it is experimentally verified that the Laue-Weyl round-trip condition holds throughout an inertial frame.
Since it is meaningless to measure a one-way velocity prior to the synchronisation of distant clocks, experiments claiming a measure of the one-way speed of light can often be reinterpreted as verifying the Laue-Weyl's round-trip condition.
The Einstein synchronisation looks this natural only in inertial frames. One can easily forget that it is only a convention. In rotating frames, even in special relativity, the non-transitivity of Einstein synchronisation diminishes its usefulness. If clock 1 and clock 2 are not synchronised directly, but by using a chain of intermediate clocks, the synchronisation depends on the path chosen. Synchronisation around the circumference of a rotating disk gives a non vanishing time difference that depends on the direction used. This is important in the Sagnac effect and the Ehrenfest paradox. TheGlobal Positioning System accounts for this effect.
A substantive discussion of Einstein synchronisation's conventionalism is due to Reichenbach. Most attempts to negate the conventionality of this synchronisation are considered refuted, with the notable exception of Malament's argument, that it can be derived from demanding a symmetrical relation of causal connectibility. Whether this settles the issue is disputed.
Poincaré
Some features of the conventionality of synchronization were discussed by Henri Poincaré. In 1898 (in a philosophical paper) he argued that the postulate of light speed constancy in all directions is useful to formulate physical laws in a simple way. He also showed that the definition of simultaneity of events at different places is only a convention.[11] Based on those conventions, but within the framework of the now superseded aether theory, Poincaré in 1900 proposed the following convention for defining clock synchronisation: 2 observers A and B, which are moving in the aether, synchronise their clocks by means of optical signals. Because of the relativity principle they believe to be at rest in the aether and assume that the speed of light is constant in all directions. Therefore, they have to consider only the transmission time of the signals and then crossing their observations to examine whether their clocks are synchronous.
In 1904 Poincaré illustrated the same procedure in the following way:
Relativity of simultaneity
In physics, the relativity of simultaneity is the concept that distant simultaneity – whether two spatially separated events occur at the same time – is notabsolute, but depends on the observer's reference frame.
Explanation
According to the special theory of relativity, it is impossible to say in an absolute sense that two distinct events occur at the same time if those events are separated in space. For example, a car crash in London and another in New York, which appear to happen at the same time to an observer on Earth, will appear to have occurred at slightly different times to an observer on an airplane flying between London and New York. The question of whether the events are simultaneous is relative: in the stationary Earth reference frame the two collisions may happen at the same time but in other frames (in a different state of motion relative to the events)
Roundtrip radar-time isocontours.
Einstein's experiment was similar, but included two lightning bolts striking both ends of the train simultaneously, in the stationary observer's inertial frame. In this experiment, the moving observer would conclude that the two lightning events were not simultaneous.
Event B is simultaneous with A in the green reference frame, but it occurred before in the blue frame, and will occur later in the red frame.
XXX . XXX Time to Clock so do Register Karnaugh Map
In digital logic, a hazard in a system is an undesirable effect caused by either a deficiency in the system or external influences. Logic hazards are manifestations of a problem in which changes in the input variables do not change the output correctly due to some form of delay caused by logic elements (NOT, AND, OR gates, etc.) This results in the logic not performing its function properly. The three different most common kinds of hazards are usually referred to as static, dynamic and function hazards.
Hazards are a temporary problem, as the logic circuit will eventually settle to the desired function. Therefore, in synchronous designs, it is standard practice to register the output of a circuit before it is being used in a different clock domain or routed out of the system, so that hazards do not cause any problems. If that is not the case, however, it is imperative that hazards be eliminated as they can have an effect on other connected systems.There are two types of hazards : 1)static hazard 2)dynamic hazard Static hazard is further has two types : 1)static-0 2)static-1
A static hazard is the situation where, when one input variable changes, the output changes momentarily before stabilizing to the correct value. There are two types of static hazards:
- Static-1 Hazard: the output is currently 1 and after the inputs change, the output momentarily changes to 0,1 before settling on 1
- Static-0 Hazard: the output is currently 0 and after the inputs change, the output momentarily changes to 1,0 before settling on 0
In properly formed two-level AND-OR logic based on a Sum Of Products expression, there will be no static-0 hazards. Conversely, there will be no static-1 hazards in an OR-AND implementation of a Product Of Sums expression.
The most commonly used method to eliminate static hazards is to add redundant logic (consensus terms in the logic expression).
Let us consider an imperfect circuit that suffers from a delay in the physical logic elements i.e. AND gates etc. The simple circuit performs the function noting:Example of a static hazard
f = X1 * X2 + X1' * X3
If we first look at the starting diagram, it is clear that if no delays were to occur, then the circuit would function normally. However, no two gates are ever manufactured exactly the same. Due to this imperfection, the delay for the first AND gate will be slightly different than its counterpart. Thus an error occurs when the input changes from 111 to 011. i.e. when X1 changes state.
Now we know roughly how the hazard is occurring, for a clearer picture and the solution on how to solve this problem, we would look to the Karnaugh map. The two gates are shown by solid rings, and the hazard can be seen under the dashed ring. A theorem proved by Huffman[1] tells us that by adding a redundant loop 'X2X3' this will eliminate the hazard.
So our original function is now: f = X1 * X2 + X1' * X3 + X2 * X3
Now we can see that even with imperfect logic elements, our example will not show signs of hazards when X1 changes state. This theory can be applied to any logic system. Computer programs deal with most of this work now, but for simple examples it is quicker to do the debugging by hand. When there are many input variables (say 6 or more) it will become quite difficult to 'see' the errors on a Karnaugh map.
A dynamic hazard is the possibility of an output changing more than once as a result of a single input change. Dynamic hazards often occur in larger logic circuits where there are different routes to the output (from the input). If each route has a different delay, then it quickly becomes clear that there is the potential for changing output values that differ from the required / expected output. e.g. A logic circuit is meant to change output state from 1 to 0, but instead changes from 1 to 0 then 1 and finally rests at the correct value 0. This is a dynamic hazard.Dynamic hazards
As a rule, dynamic hazards are more complex to resolve, but note that if all static hazards have been eliminated from a circuit, then dynamic hazards cannot occur.
computer architecture
In the domain of central processing unit (CPU) design, hazards are problems with the instruction pipeline in CPU microarchitectures when the next instruction cannot execute in the following clock cycle,[1] and can potentially lead to incorrect computation results. Three common types of hazards are data hazards, structural hazards, and control flow hazards (branching hazards).
There are several methods used to deal with hazards, including pipeline stalls/pipeline bubbling, operand forwarding, and in the case of out-of-order execution, the scoreboardingmethod and the Tomasulo algorithm.
Instructions in a pipelined processor are performed in several stages, so that at any given time several instructions are being processed in the various stages of the pipeline, such as fetch and execute. There are many different instruction pipeline microarchitectures, and instructions may be executed out-of-order. A hazard occurs when two or more of these simultaneous (possibly out of order) instructions conflict.
Data hazards occur when instructions that exhibit data dependence modify data in different stages of a pipeline. Ignoring potential data hazards can result in race conditions (also termed race hazards). There are three situations in which a data hazard can occur:Types
Data hazards
- read after write (RAW), a true dependency
- write after read (WAR), an anti-dependency
- write after write (WAW), an output dependency
Consider two instructions i1 and i2, with i1 occurring before i2 in program order.
(i2 tries to read a source before i1 writes to it) A read after write (RAW) data hazard refers to a situation where an instruction refers to a result that has not yet been calculated or retrieved. This can occur because even though an instruction is executed after a prior instruction, the prior instruction has been processed only partly through the pipeline.Read after write (RAW)
For example:Example
i1. R2 <- R1 + R3
i2. R4 <- R2 + R3
i2. R4 <- R2 + R3
The first instruction is calculating a value to be saved in register R2, and the second is going to use this value to compute a result for register R4. However, in a pipeline, when operands are fetched for the 2nd operation, the results from the first will not yet have been saved, and hence a data dependency occurs.
A data dependency occurs with instruction i2, as it is dependent on the completion of instruction i1.
(i2 tries to write a destination before it is read by i1) A write after read (WAR) data hazard represents a problem with concurrent execution.Write after read (WAR)
For example:Example
i1. R4 <- R1 + R5
i2. R5 <- R1 + R2
i2. R5 <- R1 + R2
In any situation with a chance that i2 may finish before i1 (i.e., with concurrent execution), it must be ensured that the result of register R5 is not stored before i1 has had a chance to fetch the operands.
(i2 tries to write an operand before it is written by i1) A write after write (WAW) data hazard may occur in a concurrent execution environment.Write after write (WAW)
For example:Example
i1. R2 <- R4 + R7
i2. R2 <- R1 + R3
i2. R2 <- R1 + R3
The write back (WB) of i2 must be delayed until i1 finishes executing.
A structural hazard occurs when a part of the processor's hardware is needed by two or more instructions at the same time. A canonical example is a single memory unit that is accessed both in the fetch stage where an instruction is retrieved from memory, and the memory stage where data is written and/or read from memory.[3] They can often be resolved by separating the component into orthogonal units (such as separate caches) or bubbling the pipeline.Structural hazards
Control hazards (branch hazards)
Branching hazards (also termed control hazards) occur with branches. On many instruction pipeline microarchitectures, the processor will not know the outcome of the branch when it needs to insert a new instruction into the pipeline (normally the fetch stage).
Pipeline bubblingEliminating hazards
Generic
Bubbling the pipeline, also termed a pipeline break or pipeline stall, is a method to preclude data, structural, and branch hazards. As instructions are fetched, control logic determines whether a hazard could/will occur. If this is true, then the control logic inserts no operations (NOPs) into the pipeline. Thus, before the next instruction (which would cause the hazard) executes, the prior one will have had sufficient time to finish and prevent the hazard. If the number of NOPs equals the number of stages in the pipeline, the processor has been cleared of all instructions and can proceed free from hazards. All forms of stalling introduce a delay before the processor can resume execution.
Flushing the pipeline occurs when a branch instruction jumps to a new memory location, invalidating all prior stages in the pipeline. These prior stages are cleared, allowing the pipeline to continue at the new instruction indicated by the branch.
There are several main solutions and algorithms used to resolve data hazards:Data hazards
- insert a pipeline bubble whenever a read after write (RAW) dependency is encountered, guaranteed to increase latency, or
- use out-of-order execution to potentially prevent the need for pipeline bubbles
- use operand forwarding to use data from later stages in the pipeline
In the case of out-of-order execution, the algorithm used can be:
- scoreboarding, in which case a pipeline bubble is needed only when there is no functional unit available
- the Tomasulo algorithm, which uses register renaming, allowing continual issuing of instructions
The task of removing data dependencies can be delegated to the compiler, which can fill in an appropriate number of NOP instructions between dependent instructions to ensure correct operation, or re-order instructions where possible.
Operand forwarding
In the following examples, computed values are in bold, while Register numbers are not.Examples
For example, to write the value 3 to register 1, (which already contains a 6), and then add 7 to register 1 and store the result in register 2, i.e.:
- Instruction 0: Register 1 = 6
- Instruction 1: Register 1 = 3
- Instruction 2: Register 2 = Register 1 + 7 = 10
Following execution, register 2 should contain the value 10. However, if Instruction 1 (write 3 to register 1) does not fully exit the pipeline before Instruction 2 starts executing, it means that Register 1 does not contain the value 3 when Instruction 2 performs its addition. In such an event, Instruction 2 adds 7 to the old value of register 1 (6), and so register 2 contains 13 instead, i.e.:
- Instruction 0: Register 1 = 6
- Instruction 2: Register 2 = Register 1 + 7 = 13
- Instruction 1: Register 1 = 3
This error occurs because Instruction 2 reads Register 1 before Instruction 1 has committed/stored the result of its write operation to Register 1. So when Instruction 2 is reading the contents of Register 1, register 1 still contains 6, not 3.
Forwarding (described below) helps correct such errors by depending on the fact that the output of Instruction 1 (which is 3) can be used by subsequent instructions before the value 3 is committed to/stored in Register 1.
Forwarding applied to the example means that there is no wait to commit/store the output of Instruction 1 in Register 1 (in this example, the output is 3) before making that output available to the subsequent instruction (in this case, Instruction 2). The effect is that Instruction 2 uses the correct (the more recent) value of Register 1: the commit/store was made immediately and not pipelined.
With forwarding enabled, the Instruction Decode/Execution (ID/EX) stage of the pipeline now has two inputs: the value read from the register specified (in this example, the value 6 from Register 1), and the new value of Register 1 (in this example, this value is 3) which is sent from the next stage Instruction Execute/Memory Access (EX/MEM). Added control logic is used to determine which input to use.
To avoid control hazards microarchitectures can:Control hazards (branch hazards)
- insert a pipeline bubble (discussed above), guaranteed to increase latency, or
- use branch prediction and essentially make educated guesses about which instructions to insert, in which case a pipeline bubble will only be needed in the case of an incorrect prediction
In the event that a branch causes a pipeline bubble after incorrect instructions have entered the pipeline, care must be taken to prevent any of the wrongly-loaded instructions from having any effect on the processor state excluding energy wasted processing them before they were discovered to be loaded incorrectly.
Memory latency is another factor that designers must attend to, because the delay could reduce performance. Different types of memory have different accessing time to the memory. Thus, by choosing a suitable type of memory, designers can improve the performance of the pipelined data pathOther techniques
Memory latency is another factor that designers must attend to, because the delay could reduce performance. Different types of memory have different accessing time to the memory. Thus, by choosing a suitable type of memory, designers can improve the performance of the pipelined data path
In computer architecture, register renaming is a technique that eliminates the false data dependencies arising from the reuse of architectural registers by successive instructionsthat do not have any real data dependencies between them. The elimination of these false data dependencies reveals more instruction-level parallelism in an instruction stream, which can be exploited by various and complementary techniques such as superscalar and out-of-order execution for better performance.
In a register machine, programs are composed of instructions which operate on values. The instructions must name these values in order to distinguish them from one another. A typical instruction might say, add X and Y and put the result in Z. In this instruction, X, Y, and Z are the names of storage locations.
In order to have a compact instruction encoding, most processor instruction sets have a small set of special locations which can be directly named. For example, the x86 instruction set architecture has 8 integer registers, x86-64 has 16, many RISCs have 32, and IA-64 has 128. In smaller processors, the names of these locations correspond directly to elements of a register file.
Different instructions may take different amounts of time; for example, a processor may be able to execute hundreds of instructions while a single load from the main memory is in progress. Shorter instructions executed while the load is outstanding will finish first, thus the instructions are finishing out of the original program order. Out-of-order execution has been used in most recent high-performance CPUs to achieve some of their speed gains.
Consider this piece of code running on an out-of-order CPU:
# | Instruction |
---|---|
1 | R1 = M[1024] |
2 | R1 = R1 + 2 |
3 | M[1032] = R1 |
4 | R1 = M[2048] |
5 | R1 = R1 + 4 |
6 | M[2056] = R1 |
Instructions 4, 5, and 6 are independent of instructions 1, 2, and 3, but the processor cannot finish 4 until 3 is done, otherwise instruction 3 would write the wrong value. This restriction can be eliminated by changing the names of some of the registers:
# | Instruction | # | Instruction | |
---|---|---|---|---|
1 | R1 = M[1024] | 4 | R2 = M[2048] | |
2 | R1 = R1 + 2 | 5 | R2 = R2 + 4 | |
3 | M[1032] = R1 | 6 | M[2056] = R2 |
Now instructions 4, 5, and 6 can be executed in parallel with instructions 1, 2, and 3, so that the program can be executed faster.
When possible, the compiler would detect the distinct instructions and try to assign them to a different register. However, there is a finite number of register names that can be used in the assembly code. Many high performance CPUs have more physical registers than may be named directly in the instruction set, so they rename registers in hardware to achieve additional parallelism.
When more than one instruction references a particular location for an operand, either by reading it (as an input) or by writing to it (as an output), executing those instructions in an order different from the original program order can lead to three kinds of data hazards:
- Read-after-write (RAW) – a read from a register or memory location must return the value placed there by the last write in program order, not some other write. This is referred to as a true dependency or flow dependency, and requires the instructions to execute in program order.
- Write-after-write (WAW) – successive writes to a particular register or memory location must leave that location containing the result of the second write. This can be resolved by squashing (synonyms: cancelling, annulling, mooting) the first write if necessary. WAW dependencies are also known as output dependencies.
- Write-after-read (WAR) – a read from a register or memory location must return the last prior value written to that location, and not one written programmatically after the read. This is a sort of false dependency that can be resolved by renaming. WAR dependencies are also known as anti-dependencies.
Instead of delaying the write until all reads are completed, two copies of the location can be maintained, the old value and the new value. Reads that precede, in program order, the write of the new value can be provided with the old value, even while other reads that follow the write are provided with the new value. The false dependency is broken and additional opportunities for out-of-order execution are created. When all reads that need the old value have been satisfied, it can be discarded. This is the essential concept behind register renaming.
Anything that is read and written can be renamed. While the general-purpose and floating-point registers are discussed the most, flag and status registers or even individual status bits are commonly renamed as well.
Memory locations can also be renamed, although it is not commonly done to the extent practiced in register renaming. The Transmeta Crusoe processor's gated store buffer is a form of memory renaming.
If programs refrained from reusing registers immediately, there would be no need for register renaming. Some instruction sets (e.g., IA-64) specify very large numbers of registers for specifically this reason. However, there are limitations to this approach:
- It is very difficult for the compiler to avoid reusing registers without large code size increases. In loops, for instance, successive iterations would have to use different registers, which requires replicating the code in a process called loop unrolling.
- Large numbers of registers require more bits for specifying a register as an operand in an instruction, resulting in increased code size.
- Many instruction sets historically specified smaller numbers of registers and cannot be changed now.
Code size increases are important because when the program code is larger, the instruction cache misses more often and the processor stalls waiting for new instructions.
Machine language programs specify reads and writes to a limited set of registers specified by the instruction set architecture (ISA). For instance, the Alpha ISA specifies 32 integer registers, each 64 bits wide, and 32 floating-point registers, each 64 bits wide. These are the architectural registers. Programs written for processors running the Alpha instruction set will specify operations reading and writing those 64 registers. If a programmer stops the program in a debugger, they can observe the contents of these 64 registers (and a few status registers) to determine the progress of the machine.
One particular processor which implements this ISA, the Alpha 21264, has 80 integer and 72 floating-point physical registers. There are, on an Alpha 21264 chip, 80 physically separate locations which can store the results of integer operations, and 72 locations which can store the results of floating point operations. (In fact, there are even more locations than that, but those extra locations are not germane to the register renaming operation.)
The following text describes two styles of register renaming, which are distinguished by the circuit which holds the data ready for an execution unit.
In all renaming schemes, the machine converts the architectural registers referenced in the instruction stream into tags. Where the architectural registers might be specified by 3 to 5 bits, the tags are usually a 6 to 8 bit number. The rename file must have a read port for every input of every instruction renamed every cycle, and a write port for every output of every instruction renamed every cycle. Because the size of a register file generally grows as the square of the number of ports, the rename file is usually physically large and consumes significant power.
In the tag-indexed register file style, there is one large register file for data values, containing one register for every tag. For example, if the machine has 80 physical registers, then it would use 7 bit tags. 48 of the possible tag values in this case are unused.
In this style, when an instruction is issued to an execution unit, the tags of the source registers are sent to the physical register file, where the values corresponding to those tags are read and sent to the execution unit.
In the reservation station style, there are many small associative register files, usually one at the inputs to each execution unit. Each operand of each instruction in an issue queue has a place for a value in one of these register files.
In this style, when an instruction is issued to an execution unit, the register file entries corresponding to the issue queue entry are read and forwarded to the execution unit.
- Architectural Register File or Retirement Register File (RRF)
- The committed register state of the machine. RAM indexed by logical register number. Typically written into as results are retired or committed out of a reorder buffer.
- Future File
- The most speculative register state of the machine. RAM indexed by logical register number.
- Active Register File
- The Intel P6 group's term for Future File.
- History Buffer
- Typically used in combination with a future file. Contains the "old" values of registers that have been overwritten. If the producer is still in flight it may be RAM indexed by history buffer number. After a branch misprediction must use results from the history buffer—either they are copied, or the future file lookup is disabled and the history buffer is CAM indexed by logical register number.
- Reorder Buffer (ROB)
- A structure that is sequentially (circularly) indexed on a per-operation basis, for instructions in flight. It differs from a history buffer because the reorder buffer typically comes after the future file (if it exists) and before the architectural register file.
Reorder buffers can be data-less or data-ful.
In Willamette's ROB, the ROB entries point to registers in the physical register file (PRF), and also contain other book keeping. This was also the first Out of Order design done by Andy Glew, at Illinois with HaRRM.
P6's ROB, the ROB entries contain data; there is no separate PRF. Data values from the ROB are copied from the ROB to the RRF at retirement.
One small detail: if there is temporal locality in ROB entries (i.e., if instructions close together in the von Neumann instruction sequence write back close together in time, it may be possible to perform write combining on ROB entries and so have fewer ports than a separate ROB/PRF would). It is not clear if it makes a difference, since a PRF should be banked.
ROBs usually don't have associative logic, and certainly none of the ROBs designed by Andy Glew have CAMs. Keith Diefendorff insisted that ROBs have complex associative logic for many years. The first ROB proposal may have had CAMs.
This is the renaming style used in the MIPS R10000, the Alpha 21264, and in the FP section of the AMD Athlon.
In the renaming stage, every architectural register referenced (for read or write) is looked up in an architecturally-indexed remap file. This file returns a tag and a ready bit. The tag is non-ready if there is a queued instruction which will write to it that has not yet executed. For read operands, this tag takes the place of the architectural register in the instruction. For every register write, a new tag is pulled from a free tag FIFO, and a new mapping is written into the remap file, so that future instructions reading the architectural register will refer to this new tag. The tag is marked as unready, because the instruction has not yet executed. The previous physical register allocated for that architectural register is saved with the instruction in the reorder buffer, which is a FIFO that holds the instructions in program order between the decode and graduation stages.
The instructions are then placed in various issue queues. As instructions are executed, the tags for their results are broadcast, and the issue queues match these tags against the tags of their non-ready source operands. A match means that the operand is ready. The remap file also matches these tags, so that it can mark the corresponding physical registers as ready. When all the operands of an instruction in an issue queue are ready, that instruction is ready to issue. The issue queues pick ready instructions to send to the various functional units each cycle. Non-ready instructions stay in the issue queues. This unordered removal of instructions from the issue queues can make them large and power-consuming.
Issued instructions read from a tag-indexed physical register file (bypassing just-broadcast operands) and then execute. Execution results are written to tag-indexed physical register file, as well as broadcast to the bypass network preceding each functional unit. Graduation puts the previous tag for the written architectural register into the free queue so that it can be reused for a newly decoded instruction.
An exception or branch misprediction causes the remap file to back up to the remap state at last valid instruction via combination of state snapshots and cycling through the previous tags in the in-order pre-graduation queue. Since this mechanism is required, and since it can recover any remap state (not just the state before the instruction currently being graduated), branch mispredictions can be handled before the branch reaches graduation, potentially hiding the branch misprediction latency.
This is the style used in the integer section of the AMD K7 and K8 designs.
In the renaming stage, every architectural register referenced for reads is looked up in both the architecturally-indexed future file and the rename file. The future file read gives the value of that register, if there is no outstanding instruction yet to write to it (i.e., it's ready). When the instruction is placed in an issue queue, the values read from the future file are written into the corresponding entries in the reservation stations. Register writes in the instruction cause a new, non-ready tag to be written into the rename file. The tag number is usually serially allocated in instruction order—no free tag FIFO is necessary.
Just as with the tag-indexed scheme, the issue queues wait for non-ready operands to see matching tag broadcasts. Unlike the tag-indexed scheme, matching tags cause the corresponding broadcast value to be written into the issue queue entry's reservation station.
Issued instructions read their arguments from the reservation station, bypass just-broadcast operands, and then execute. As mentioned earlier, the reservation station register files are usually small, with perhaps eight entries.
Execution results are written to the reorder buffer, to the reservation stations (if the issue queue entry has a matching tag), and to the future file if this is the last instruction to target that architectural register (in which case register is marked ready).
Graduation copies the value from the reorder buffer into the architectural register file. The sole use of the architectural register file is to recover from exceptions and branch mispredictions.
Exceptions and branch mispredictions, recognised at graduation, cause the architectural file to be copied to the future file, and all registers marked as ready in the rename file. There is usually no way to reconstruct the state of the future file for some instruction intermediate between decode and graduation, so there is usually no way to do early recovery from branch mispredictions.
In both schemes, instructions are inserted in-order into the issue queues, but are removed out-of-order. If the queues do not collapse empty slots, then they will either have many unused entries, or require some sort of variable priority encoding for when multiple instructions are simultaneously ready to go. Queues that collapse holes have simpler priority encoding, but require simple but large circuitry to advance instructions through the queue.
Reservation stations have better latency from rename to execute, because the rename stage finds the register values directly, rather than finding the physical register number, and then using that to find the value. This latency shows up as a component of the branch misprediction latency.
Reservation stations also have better latency from instruction issue to execution, because each local register file is smaller than the large central file of the tag-indexed scheme. Tag generation and exception processing are also simpler in the reservation station scheme, as discussed below.
The physical register files used by reservation stations usually collapse unused entries in parallel with the issue queue they serve, which makes these register files larger in aggregate, and consume more power, and more complicated than the simpler register files used in a tag-indexed scheme. Worse yet, every entry in each reservation station can be written by every result bus, so that a reservation-station machine with, e.g., 8 issue queue entries per functional unit will typically have 9 times as many bypass networks as an equivalent tag-indexed machine. Consequently, result forwarding consumes much more power and area than in a tag-indexed design.
Furthermore, the reservation station scheme has four places (Future File, Reservation Station, Reorder Buffer and Architectural File) where a result value can be stored, whereas the tag-indexed scheme has just one (the physical register file). Because the results from the functional units, broadcast to all these storage locations, must reach a much larger number of locations in the machine than in the tag-indexed scheme, this function consumes more power, area, and time. Still, in machines equipped with very accurate branch prediction schemes and if execute latencies are a major concern, reservation stations can work remarkably well.
The IBM System/360 Model 91 was an early machine that supported out-of-order execution of instructions; it used the Tomasulo algorithm, which uses register renaming.
The POWER1 is the first microprocessor that used register renaming and out-of-order execution in 1990.
The original R10000 design had neither collapsing issue queues nor variable priority encoding, and suffered starvation problems as a result—the oldest instruction in the queue would sometimes not be issued until both instruction decode stopped completely for lack of rename registers, and every other instruction had been issued. Later revisions of the design starting with the R12000 used a partially variable priority encoder to mitigate this problem.
Early out-of-order machines did not separate the renaming and ROB/PRF storage functions. For that matter, some of the earliest, such as Sohi's RUU or the Metaflow DCAF, combined scheduling, renaming, and storage all in the same structure.
Most modern machines do renaming by RAM indexing a map table with the logical register number. E.g., P6 did this; future files do this, and have data storage in the same structure.
However, earlier machines used content-addressable memory (a type of hardware which provides the functionality of an associative array) in the renamer. E.g., the HPSM RAT, or Register Alias Table, essentially used a CAM on the logical register number in combination with different versions of the register.
In many ways, the story of out-of-order microarchitecture has been how these CAMs have been progressively eliminated. Small CAMs are useful; large CAMs are impractical.[citation needed]
The P6 microarchitecture was the first microarchitecture by Intel to implement both out-of-order execution and register renaming. The P6 microarchitecture was used in Pentium Pro, Pentium II, Pentium III, Pentium M, Core, and Core 2 microprocessors. The Cyrix M1, released on October 2, 1995,[1] was the first x86 processor to use register renaming and out-of-order execution. Other x86 processors (such as NexGen Nx686 and AMD K5) released in 1996 also featured register renaming and out-of-order execution of RISC μ-operations (rather than native x86 instructions).
Tidak ada komentar:
Posting Komentar