CO-SIMULATION OF A PROCESSOR DESIGN
The present invention relates to processor simulation, and more particularly, but not exclusively, relates to co-simulation of a processor with different models. Simulation techniques have been embraced to debug and otherwise evaluate the performance of electronic circuitry. The simulation of complex programmable processor logic is often performed with several different simulation models. Unfortunately, these models sometimes fail to provide the desired trade-off between accuracy and speed of execution. Thus, there is an ongoing need for further contributions in this area of technology.
One embodiment of the present application is a unique processor simulation technique. Other embodiments include unique methods, systems, devices, and apparatus to simulate a processor.
A further embodiment of the present application includes providing an instruction set architecture simulation and a processor microarchitecture simulation that can be interlaced together. In one particular form, the instruction set architecture simulator stores respective instruction execution information in a first-in, first-out queue, and the processor microarchitecture simulation accesses the queue in accordance with a sequence of instructions being simulated. Another embodiment of the present application includes: providing a processor design including an instruction set architecture and a processor microarchitecture to implement the instruction set architecture, simulating the processor design by operating a first simulator to simulate execution of a sequence of processor instructions in accordance with the instruction set architecture and a second simulator to simulate performance of each of the instructions in accordance with the microarchitecture. In one particular form, this simulation further includes storing respective instruction execution information determined with the first simulator in a queue and accessing the queue with the second simulator to evaluate execution timing of the instructions.
Still another embodiment includes a device with computer-executable logic operable to perform a simulation of a processor design that includes an instruction set architecture and a processor microarchitecture to implement the instruction set architecture. The simulation includes a first simulator to simulate execution of a sequence of instructions in accordance with the instruction set architecture and a second simulator to simulate
performance of each of the instructions in accordance with the microarchitecture. The simulation stores respective instruction execution information determined with the first simulator in a queue for each of the instructions and accesses this queue to evaluate execution timing behavior of the sequence of instructions with the second simulator. Yet another embodiment is directed to a system that includes an operator input device, a computer processor, and an output device. The processor is responsive to the input device to execute simulation logic to evaluate a processor design. The simulation logic defines a first simulator to simulate execution of a sequence of processor instructions in accordance with an instruction set architecture model of the processor design and a second simulator to simulate performance of each of the instructions with a microarchitecture model of the processor design; where the microarchitecture model is effective to implement the instruction set architecture model in the processor design. The simulation logic defines a queue, stores respective instruction execution information in the queue for each of the instructions, and accesses the queue to evaluate execution timing behavior of the instructions as a function of the respective instruction execution information. The output device provides an output representative of execution timing behavior of the instructions.
In yet a further embodiment, an apparatus includes means for performing a simulation of a processor executing a sequence of instructions based on an instruction set architecture model of the processor, means for storing respective instruction information in a queue for each one of the instructions simulated with the instruction set architecture model during the simulation, and means for determining execution timing behavior of the sequence of instructions by simulating instruction execution in accordance with a microarchitecture model of the processor as a function of the respective instruction information read from the queue for each one of the instructions.
One object of the present invention is to provide a unique processor simulation technique.
Other objects include providing a unique method, system, device, or apparatus to simulate a processor. Further objects, embodiments, forms, aspects, benefits, advantages, and features of the present application and its inventions will become apparent from the figures and description provided herewith.
Fig. 1 is a schematic view of a computer system.
Fig. 2 is a diagrammatic view of a processor co-simulation model that is executed with the computer system of Fig. 1.
Figs. 3 and 4 depict processor simulation flowcharts corresponding to the processor model of Fig. 2. For the purpose of promoting an understanding of the principles of the invention, reference will now be made to the embodiments illustrated in the drawings and specific language will be used to describe the same. It will nevertheless be understood that no limitation of the scope of the invention is thereby intended. Any alterations and further modifications in the described embodiments, and any further applications of the principles of the invention as described herein are contemplated as would normally occur to one skilled in the art to which the invention relates.
One embodiment of the present application is a microprocessor simulation model that combines a instruction set architecture simulator with a microarchitecture simulator for fast and accurate hardware and software co-simulation. In one form, a queue is utilized to pass trace records generated with the instruction set architecture simulator to the microarchitecture simulator.
Fig. 1 diagrammatically depicts computer system 20 of a further embodiment of the present application. System 20 includes computer 21 with processor 22. Processor 22 performs operations in accordance with programming instructions and/or another form of operating logic, and more particularly is of a type suitable to perform simulation techniques described hereinafter. In one form, processor 22 is integrated circuit based, including one or more digital, solid-state central processing units each in the form of a microprocessor. It should be understood that while a single processor 22 is depicted, it is representative of multiprocessor arrangements as well as single processor arrangements. Further, processor 22 can be of a reduced instruction set (RISC) type, a complex instruction set (CISC) type, or a combination of both. For multiple processor forms, parallel and/or pipeline processing can be utilized as appropriate. Alternatively or additionally, processor 22 can be provided in the form of one or more components in a single unit or as multiple units. In one embodiment, processor 22 is in the form of one or more highly integrated, digital semiconductor devices.
System 20 also includes operator input devices 24 and operator output devices 26 operatively coupled to processor 22. Input devices 24 include a conventional mouse 24a and keyboard 24b, and alternatively or additionally can include a trackball, light pen, voice
recognition subsystem, and/or different input device type as would occur to those skilled in the art. Output devices 26 include a conventional graphic display 26a, such as a color or noncolor plasma, Cathode Ray Tube (CRT), or Liquid Crystal Display (LCD) type, and printer 26b. Alternatively or additionally output devices 26 can include an aural output system and/or different output device type as would occur to those skilled in the art.
Further, in other embodiments, more or fewer operator input devices 24 or operator output devices 26 may be utilized.
System 20 also includes memory 28 operatively coupled to processor 22. Memory 28 can be of one or more types, such as solid-state electronic memory, magnetic memory, optical memory, or a combination of these. As illustrated in Fig. 1, memory 28 includes a removable/portable memory device 28a that can be an optical disk (such as a CD ROM or DVD); a magnetically encoded hard disk, floppy disk, tape, or cartridge; and/or a different form as would occur to those skilled in the art. In one embodiment, at least a portion of memory 28 is operable to store operating logic for processor 22 in the form of programming instructions. Alternatively or additionally, memory 28 can be arranged to store data other than programming instructions for processor 22. In still other embodiments, memory 28 and/or portable memory device 28a may not be present.
System 20 also includes computer network 30, which can be a Local Area Network (LAN); Municipal Area Network (MAN); Wide Area Network (WAN), such as the Internet; another type as would occur to those skilled in the art; or a combination of these. Network 30 couples computer 31 to computer 21; where computer 31 is remotely located relative to computer 21. Computer 31 can include a processor, input devices, output devices, and/or memory as described in connection with computer 21; however these features of computer 31 are not shown to preserve clarity. Computer 31 and computer 21 can be arranged as client and server, respectively, in relation to some or all of the processing performed with system 20. For this arrangement, it should be understood that many other remote computers 31 could be included as clients of computer 21, but are not shown to preserve clarity. In another embodiment, computer 21 and computer 31 can both be participating members of a distributed processing arrangement with one or more processing units at a different site relative to others. The distributed processors of such an arrangement can be used collectively to execute routines according to the present invention. In still other embodiments, remote computer 31 may be absent.
Operating logic for processor 22 is arranged to facilitate performance of various routines, subroutines, computer models, simulations, procedures, stages, operations, and/or conditionals described hereinafter. This operating logic can be of a dedicated, hardwired variety and/or in the form of programming instructions as is appropriate for the particular processor arrangement. Such logic can be at least partially encoded on device 28a for storage and/or transport to another computer. Alternatively or additionally, the operating logic of computer 21 can be in the form of one or more signals carried by a transmission medium, such as network 30.
Hardware/software co-simulation is used to evaluate digital processor-based systems in which the components exist at various levels of modeling abstraction. In a hardware/software co-simulation environment, embedded software can be simulated together with components modeled in a Hardware Description Language (HDL) (such as Verilog or the like) or components modeled in a high-level programming language, (such as SystemC, C++, or the like). Processor simulation with components modeled at a high abstraction level is typically performed when a lower-level model (such as a detailed HDL model) is not yet available, or when it is desired to execute the simulation faster than would be possible with a standard lower-level model. On the other hand, low level HDL modelling often reveals design shortcomings and provides greater accuracy that a high level model does not. Accordingly, the ability to combine elements of both approaches is sometimes desirable.
One such combination is described in connection with Fig. 2. Fig. 2 depicts processor model 40 in diagrammatic form. Model 40 is developed in accordance with a processor design to be simulated and is encoded as corresponding operating logic executable by computer system 20. Typically, the processor design is directed to execution of programming instructions over one or more execution or cycles determined relative to a system clock. For example, the design can be of a RISC variety. In one particular form, a pipelined, scalar RISC processor design is simulated that is of an "unblocked" type. For this type as well as some others, under certain circumstances multiple instructions may be executed at the same time for one or more execution cycles or otherwise executed out of order relative to the programmed sequence. Simulation is commonly performed to evaluate a processor or processor-based system design under development and make appropriate changes in response to simulation results. This evaluation often includes benchmark or other performance testing.
Processor model 40 includes logic execution model 50. Model 50 includes HDL memory model 52 and HDL peripheral hardware model 54. Generally model 50 is representative of logic and hardware that is independent of the processor design being simulated and may be optional. In other embodiments, model 50 may be provided in the form of actual operational hardware that is interfaced with computer system 20 in such a manner that the processor design is effectively emulated in accordance with model 40.
Model 40 further includes processor software execution model 60. Model 60 is co- simulated with instruction set architecture simulator (ISS) 62 and microarchitecture simulator (MAS) 70. Simulator 62 is directed to the simulation of program instructions according to the instruction set architecture (ISA) specified for the processor design under evaluation. The operation of simulator 62 generally mimics the operation visible to and expected by a programmer. This ISA simulation is defined at a higher level of abstraction and typically is coded in a high level programming language. ISS 62 is operatively connected to model 50 by a bus interface 64 with an HDL-modeled system bus. Interface 64 provides for information transmission between simulator 62 and model 50. In a typical simulation, programming instructions and data are stored in memory model 52 and input/output information is provided via peripheral model 54.
MAS simulator 70 is operatively linked to simulator 62 by an instruction queue 66. Instruction queue 66 is arranged as a first-in, first-out buffer between simulator 62 and simulator 70. MAS simulator 70 is modeled to simulate selected microarchitectural features of the processor design undergoing evaluation at a higher level of extraction relative to typical low level HDL processor models. For this approach, typical features selected for MAS simulation include caches, register dependency checking, branch prediction, and/or other features that tend to significantly impact execution performance. Unlike the operation of simulator 62, the operation of simulator 70 is generally not of the type visible to a programmer making use of a device with the processor design. Further, simulator 70 synchronously operates in a sequential state machine fashion relative to system simulation clock 80. Clock 80 is also coupled to model 50 to synchronize its operation. While simulator 62 interfaces with model 50 subject to the timing imposed by clock 80, simulator 62 asynchronously performs instruction execution simulation internal to the new design with respect to clock 80. Queue 66 is utilized to buffer timing differences between simulator 62 and simulator 70, and to loosely synchronize simulator 62 to clock 80, as is more fully described in connection with the flowcharts of Figs. 3 and 4 as follows.
Referring to Fig. 3, simulation 120 describes one mode of operating simulator 62. Simulation 120 begins with initiation of simulator 62 in operation 122. This initiation includes identification of the sequence of instructions to simulate, such as a benchmark program. In operation 124, simulator 62 fetches and simulates execution of the next instruction, following the order of the instruction sequence. To the extent operation 124 includes access to model 50, simulator 62 is subject to any timing constraints imposed by the synchronous HDL modeling. For example, various instructions in the simulated ISA may require one or more clocked processor interface (PI) cycles to load, store, send, and/or receive information with respect to model 50; where PI cycles are timed relative to clock 80. Aside from such external accesses, simulation performed within simulator 62 is asynchronous relative to clock 80.
Simulation 120 proceeds in operation 125 with generation of an instruction trace record for each instruction simulated with simulator 62. A trace record is generated each time a given instruction is actually executed. Accordingly, a separate trace record results for each time an instruction is executed in a repeated loop, and does not include a record for any instructions present in the program that are not actually executed, such as may occur with conditional branching, jump instructions, or the like. Table I below depicts a few representative instructions and the corresponding trace record information:
Table I
Table I also depicts the instruction sequence order in the first column, assembly- level coding in the second column, corresponding instruction description in the third column, Processor Interface (PI) cycle count in the fourth column, and internal processor execution cycle count in the fifth column for each of four different instructions in the sequence. In one nonlimiting example, the Table I information corresponds to the pipelined, scalar, nonblocked RISC kind of design previously considered.
Next, conditional 130 tests if queue 66 is full. If this test is true (yes), simulation 120 returns via loop 132 to repeat conditional 120. If the test is false (no), simulation 120 proceeds to operation 130. In operation 130, the instruction trace record is stored in queue 66 on a first-in, first-out basis. Conditional 140 tests whether to continue simulation 120, if the test is true (yes), simulation 120 returns via loop 142 to instruction 124 to simulate the next instruction. If the test of conditional 140 is false (no), then simulation 120 halts.
It should be appreciated that asynchronous simulation performed with simulator 62 may temporarily halt when simulating access to synchronous models, such as model 50, and/or when queue 66 is full per conditional 130. Otherwise, simulator 62 performs the instructions in sequence independent of clock 80.
However, this high-level simulation may sacrifice a degree of cycle accuracy depending on the specifics of the instructions simulated and/or the processor design being evaluated. In one example, the subject processor design is of the pipelined, scalar, nonblocked RISC architecture type. The corresponding ISA includes a "load" instruction that requires multicycle, external memory access (memory model 52) and simultaneously executes subsequent instructions in a processor instruction cache as long as these subsequent instructions do not need the result of this load instruction. In other words, the architecture can fully utilize any available Instruction Level Parallelism (ILP) in the software application. For such an example, simulator 62 does not model the out-of-order execution behavior of the design ~ instead delaying the execution of the subsequent instructions for the amount of cycles it takes the load instruction to finish. In this instance, simulator 62 provides a relatively pessimistic view of design performance behavior for instructions that have to access the Processor Interface (PI), as is the case, for example, for load and store instructions.
On the other hand, because of the high abstraction level at which the ISS is modeled, the ISS can execute complex arithmetic instructions much faster (in terms of required clock cycles) than a processor modeled in a lower abstraction level. For example the multiply
(MuI) instruction from table 1 requires 6 cycles to execute on a processor modeled in HDL. If simulator 62 can fetch one instruction per cycle from 50, it is able to execute this multiply instruction in the same cycle, resulting in just one cycle execution time. This way, the ISS provides a relatively optimistic view of design performance behavior for arithmetic, multi- cycle instructions.
To improve cycle accuracy, at least a partial cycle-based model of the processor design microarchitecture is implemented in simulator 70. One mode of operating simulator 70 is described in flowchart form as simulation 220. Simulation 220 starts with operation 222 that initializes operation. Simulation 220 continues with conditional 230 that tests if queue 66 is empty. If the test of conditional 230 is true (yes) simulation 220 repeats conditional 230 via loop 232 until the test is negative (no). From conditional 230, simulation 220 proceeds to operation 240. In operation 240, simulator 70 reads the next corresponding instruction trace record in queue 66 on a fϊrst-in, first-out basis. In operation 242, simulator 70 evaluates the trace record information to determine to what extent (if any) two or more instructions are executed simultaneously during one or more execution cycles for the processor design. This evaluation includes determining if a current instruction depends on completing execution of one or more prior instructions (i.e. instruction dependencies) and/or if there are other constraints on concurrent execution, such as limited processing hardware, etc. In operation 246, simulator 70 updates and provides execution timing behavior of the processor design based on this modeling. This timing behavior may be provided an operator with one or more of devices 26 ~ along with any other simulation information of interest. Further, based on the cumulative timing behavior evaluation of all simulated instructions, the total execution time of the instruction sequence based on the MAS model is determined. Simulation 220 proceeds from operation 246 to conditional 250. Conditional 250 tests if simulation 220 should continue. If the test of conditional 250 is true (yes) simulation 220 loops back to conditional 230 via loop 252. If the test of conditional 250 is false (no), then simulation 220 halts.
By coupling simulator 62 and simulator 70 with queue 66, instruction trace information is passed to the modeled microarchitectural features. This co-simulation approach provides the ability to flexibly blend high-level ISA simulation (simulator 62) with cycle-specific microarchitectural simulation (simulator 70). By focusing MAS at a higher abstraction level than standard HDL modeling, a better cycle accuracy than ISA modeling is possible without the complex intricacies of a complete low level HDL
processor model. Correspondingly, simulation performance time is commensurately less. Moreover, adjustment to the processor design simulation can be readily translated into simulator 70 as the processor design is being developed, and/or before a complete HDL processor model is available. Indeed, one embodiment of co-simulation model 40 includes interative performance of a design, test, and development sequence until the design achieves desired objectives. Nonetheless, in other embodiments, such features may be absent or differently realized.
Relative to the representative instructions of Table I, one nonlimiting example is next described in detail for a pipelined, scalar, nonblocked RISC processor design type; however, such instructions could be applicable to other designs, as well. The last column of Table I represents four trace record entries in queue 66 from first-in (no. 1 in the first column) to last-in (no. 4 in the first column). In this example, each trace record includes 3 types of data: (a) processor register updates resulting from the simulated execution of the instruction with simulator 62; (b) the instruction type/category executed such as a load (1st entry), arithmetic-logic unit (ALU) (2nd and 4th entries), or multiplication dedicated operation (MAD) (3rd entry); and the amount of execution cycles the instruction was delayed (execution cycle count) because of processor interfacing access (e.g. a load instruction that requires external HDL memory model access to obtain data).
Execution of the four instructions in the second column of Table I is simulated with simulator 62 and the corresponding trace records shown in the last column are stored in queue 66 as space becomes available. For this example, the instructions are all executed out of a dedicated instruction cache of the processor design, and the only microarchitectural features simulated by simulator 70 are for register dependency checking. In other embodiments, different MAS features, dependencies, and/or constraints can be modeled as desired.
At the first available execution cycle, MAS simulator 70 processes the first trace record from queue 66. Simulator 70 is defined with data about the cycle count for each instruction type ~ in instant case, simulator 10 accesses cycle count data for a LOAD type instruction and determines it does not need to wait before processing the next record. Because a nonblocking architecture is being modeled, simulator 70 need not yet take into account the four PI cycles it took the LOAD instruction to obtain its data from memory model 52 in this example. In the second execution cycle, simulator 70 then processes trace record 2, accessing cycle count data for the corresponding ALU instruction type to
determine that an execution cycle count of one applies. Simulator 70 also tracks the PI cycle quantity for prior instructions, such as the LOAD instruction for the first trace record, to evaluate timing. Because there are no register dependencies between the ALU and the LOAD instructions, simulator 70 determines there is no need to wait before processing the next record.
In the third cycle, simulator 70 processes trace record 3, and again there are no pending dependencies, so it proceeds immediately with trace record 4. The cycle count for a MAD-type instruction is 6, as determined by accessing the cycle count data with simulator 70. Accordingly, simulator 70 detects a register dependency on register r5, and although the ALU type instruction of trace record 4 takes only one cycle, simulator 70 waits six cycles before processing the next record to assure use of the proper value for register r5 in the fourth instruction. After waiting six cycles, the processing of trace records 1-4 results in a cumulative cycle count of nine. In the ninth cycle, simulator 70 flags the MAD-type instruction as being complete, and proceeds with processing the next trace record from ISS simulator 62. Accordingly, for this example, based on simulator 70 the total execution time equates to nine cycles for the four instructions tabulated. In contrast, a processor model 40 based only on ISS 62 results in a more optimistic execution time of seven cycles. It should be noted, that the application of the MAS 70 and the trace queue 66, improves the cycle accuracy of the processor model 50. It does so by inserting penalty cycles for instructions that were executed by the ISS 62 with an optimistic view of the design behavior, as is the case for trace record 3, and by ignoring penalty cycles for instructions that were executed by the ISS 62 with an pessimistic view of the design behavior, as is the case for trace record 1. It should be appreciated that in other embodiments, different processor design types, ISAs, cycle counts, different MAS features, dependencies, and/or constraints are simulated as desired with corresponding adaptations to simulator 62 and/or simulator 70.
While the invention has been illustrated and described in detail in the drawings and foregoing description, the same is to be considered as illustrative and not restrictive in character, it being understood that only selected embodiments have been shown and described and that all changes, modifications and equivalents that come within the spirit of the inventions described heretofore and/or defined by the following claims are desired to be protected.