EP1159693A2 - Automated processor generation system & method for designing a configurable processor - Google Patents

Automated processor generation system & method for designing a configurable processor

Info

Publication number
EP1159693A2
EP1159693A2 EP00913380A EP00913380A EP1159693A2 EP 1159693 A2 EP1159693 A2 EP 1159693A2 EP 00913380 A EP00913380 A EP 00913380A EP 00913380 A EP00913380 A EP 00913380A EP 1159693 A2 EP1159693 A2 EP 1159693A2
Authority
EP
European Patent Office
Prior art keywords
user
processor
instruction
instructions
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
EP00913380A
Other languages
German (de)
English (en)
French (fr)
Inventor
Earl A. Killian
Ricardo E. Gonzalez
Ashish B. Dixit
Monica Lam
Walter D. Lichtenstein
Christopher Rowen
John Ruttenberg
Robert P. Wilson
Albert Ren-Rui Wang
Dror Eliezer Mayden
Weng Kiang Tjiang
Richard Rudell
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tensilica Inc
Original Assignee
Tensilica Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US09/246,047 external-priority patent/US6477683B1/en
Priority claimed from US09/323,161 external-priority patent/US6701515B1/en
Priority claimed from US09/322,735 external-priority patent/US6477697B1/en
Application filed by Tensilica Inc filed Critical Tensilica Inc
Publication of EP1159693A2 publication Critical patent/EP1159693A2/en
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/30Circuit design
    • G06F30/32Circuit design at the digital level
    • G06F30/33Design verification, e.g. functional simulation or model checking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/28Error detection; Error correction; Monitoring by checking the correct order of processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/362Software debugging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/10Program control for peripheral devices
    • G06F13/12Program control for peripheral devices using hardware independent of the central processor, e.g. channel or peripheral processor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/30Circuit design
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/30Circuit design
    • G06F30/32Circuit design at the digital level
    • G06F30/33Design verification, e.g. functional simulation or model checking
    • G06F30/3308Design verification, e.g. functional simulation or model checking using simulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/20Software design
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/37Compiler construction; Parser generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines

Definitions

  • the present invention is directed to microprocessor systems; more particularly, the invention is directed to the design of an application solution containing one or more processors where the processors in the system are configured and enhanced at the time of their design to improve their suitability to a particular application.
  • the invention is additionally directed to a system in which application developers can rapidly develop instruction extensions, such as new instructions, to an existing instruction set architecture, including new instruction which manipulate user-defined processor state, and immediately measure the impact of the extension to the application run time and to the processor cycle time.
  • processors have traditionally been difficult to design and to modify. For this reason, most systems that contain processors use ones that were designed and verified once for general-purpose use, and then used by multiple applications over time. As such, their suitability for a particular application is not always ideal. It would often be appropriate to modify the processor to execute a particular application's code better (e.g., to run faster, consume less power, or cost less). However, the difficulty, and therefore the time, cost, and risk of even modifying an existing processor design is high, and this is not typically done.
  • the instruction set architecture (ISA) is developed. This is a step which is essenti- ally done once and used for decades by many systems.
  • the Intel Pentium® processor can trace the legacy of its instruction set back to the 8008 and 8080 microprocessors introduced in the mid- 1970's.
  • the ISA instructions, syntax, etc. are developed, and software development tools for that ISA such as assemblers, debuggers, compilers and the like are developed.
  • a simulator for that particular ISA is developed and various bench- marks are run to evaluate the effectiveness of the ISA and the ISA is revised according to the results of the evaluation.
  • the ISA will be considered satisfactory, and the ISA process will end with a fully developed ISA specification, an ISA simulator, an ISA verification suite and a development suite including, e.g., an assembler, debugger, compiler, etc.
  • processor design commences. Since processors can have useful lives of a number of years, this process is also done fairly infrequently — typically, a processor will be designed once and used for many years by several systems. Given the ISA, its verification suite and simulator and various processor development goals, the microarchitecture of the processor is designed, simulated and revised. Once the microarchitecture is finalized, it is implemented in a hardware description language (HDL) and a microarchitecture verification suite is developed and used to verify the HDL implementation (more on this later). Then, in contrast to the manual processes described to this point, automated design tools may synthesize a circuit based on the HDL description and place and route its components. The layout may then be revised to optimize chip area usage and timing.
  • HDL hardware description language
  • processor design Another difficulty with prior art processor design stems from the fact that it is not appropriate to simply design traditional processors with more features to cover all applications, because any given application only requires a particular set of features, and a processor with features not required by the application is overly costly, consumes more power and is more difficult to fabricate. In addition it is not possible to know all of the application targets when a processor is initially designed. If the processor modification process could be automated and made reliable, then the ability of a system designer to create application solutions would be significantly enhanced.
  • a device designed to transmit and receive data over a channel using a complex protocol Because the protocol is complex, the processing cannot be reasonably accomplished entirely in hard-wired, e.g., combinatorial, logic, and instead a programmable processor is introduced into the system for protocol processing. Programmability also allows bug fixes and later upgrades to protocols to be done by loading the instruction memories with new software.
  • the traditional processor was probably not designed for this particular application (the application may not have even existed when the processor was designed), and there may be operations that it needs to perform that require many instructions to accomplish which could be done with one or a few instructions with additional processor logic. Because the processor cannot easily be enhanced, many system designers do not attempt to do so, and instead choose to execute an inefficient pure-software solution on an available general-purpose processor.
  • the inefficiency results in a solution that may be slower, or require more power, or be costlier (e.g., it may require a larger, more powerful processor to execute the program at sufficient speed).
  • Other designers choose to provide some of the processing requirements in special-purpose hardware that they design for the application, such as a coprocessor, and then have the programmer code up access to the special-purpose hardware at various points in the program.
  • the time to transfer data between the processor and such special-purpose hardware limits the utility of this approach to system optimization because only fairly large units of work can be sped up enough so that the time saved by using the special-purpose hardware is greater than the additional time required to transfer data to and from the specialized hardware.
  • the protocol might require encryption, error- correction, or compression/decompression processing. Such processing often operates on individual bits rather than a processor's larger words.
  • the circuitry for a computation may be rather modest, but the need for the processor to extract each bit, sequentially process it and then repack the bits adds considerable overhead.
  • TABLE I length must be computed, so that length bits can be shifted off to find the start of the next element to be decoded in the stream.
  • a possible solution to the problem of accommodating specific application requirements in processors is to use configurable processors having instruction sets and architectures which can be easily modified and extended to enhance the functionality of the processor and customize that functionality. Configurability allows the designer to specify whether or how much additional functionality is required for her product.
  • the simplest sort of configurability is a binary choice: either a feature is present or absent. For example, a processor might be offered either with or without floating-point hardware.
  • Flexibility may be improved by configuration choices with finer gradation.
  • the processor might, for example, allow the system designer to specify the number of registers in the register file, memory width, the cache size, cache associativity, etc. However, these options still do not reach the level of customizability desired by system designers.
  • the system designer might like to include a specific instruction to perform the decode, e.g., huff ⁇ tl , tO where the most significant eight bits in the result are the decoded value and the least significant eight bits are the length.
  • a direct hardware implementation of the Huffman decode is quite simple ⁇ the logic to decode the instruction represents roughly thirty gates for just the combinatorial logic function exclusive of instruction decode, etc., or less than 0.1% of a typical processor's gate count, and can be computed by a special-purpose processor instruction in a single cycle, thus representing an improvement factor of 4-20 over using general -purpose instructions only.
  • the Synopsys DW8051 includes a binary-compatible implementation of an existing processor architecture; and a small number of synthesis parameters, e.g., 128 or 256 bytes of internal RAM, a ROM address range determined by a parameter rom_addr_size, an optional interval timer, a variable number (0-2) of serial ports, and an interrupt unit which supports either six or thirteen sources.
  • synthesis parameters e.g., 128 or 256 bytes of internal RAM, a ROM address range determined by a parameter rom_addr_size, an optional interval timer, a variable number (0-2) of serial ports, and an interrupt unit which supports either six or thirteen sources.
  • the ARM/Synopsys ARM7-S processor includes a binary-compatible implementation of existing architecture and microarchitecture. It has two configurable parameters: the selection of a high- performance or low-performance multiplier, and inclusion of debug and in-circuit emulation logic.
  • the Lexra LX-4080 processor has a configurable variant of the standard MIPS architecture and has no software support for instruction set extensions. Its options include a custom engine interface which allows extension of MIPS ALU opcodes with application-specific operations; an internal hardware interface which includes a register source and a register or 16 bit-wide immediate source, and destination and stall signals; a simple memory management unit option; three MIPS coprocessor interfaces; a flexible local memory interface to cache, scratchpad RAM or ROM; a bus controller to connect peripheral functions and memories to the processor's own local bus; and a write buffer of configurable depth.
  • the ARC configurable RISC core has a user interface with on-the-fly gate count estimation based on target technology and clock speed, instruction cache configuration, instruction set extensions, a timer option, a scratch-pad memory option, and memory controller options; an instruction set with selectable options such as local scratchpad RAM with block move to memory, special registers, up to sixteen extra condition code choices, a 32 x 32 bit scoreboarded multiply block, a single cycle 32 bit barrel-shifter/rotate block, a normalize (find first bit) instruction, writing results directly to a command buffer (not to the register file), a 16 bit MUL/MAC block and 36 bit accumulator, and sliding pointer access to local SRAM using linear arithmetic; and user instructions defined by manual editing of NHDL source code.
  • the ARC design has no facility for implementing an instruction set description language, nor does it generate software tools specific to the configured processor.
  • the Synopsys configurable PCI interface includes a GUI or command line interface to installation, configuration and synthesis activities; checking that prerequisite user actions are taken at each step; installation of selected design files based on configuration (e.g., Verilog vs. VHDL); selective configuration such as parameter setting and prompting of users for configuration values with checking of combination validity, and HDL generation with user updating of HDL source code and no editing of HDL source files; and synthesis functions such as a user interface which analyzes a technology library to select I/O pads, technology-independent constraints and synthesis script, pad insertion and prompts for technology-specific pads, and translation of technology-independent formulae into technology- dependent scripts.
  • the configurable PCI bus interface is notable because it implements consistency checking of parameters, configuration-based installation, and automatic modification of HDL files.
  • the second category of prior art work in the area of configurable processor generation i.e., automatic retargetting of compilers and assemblers
  • the second category of prior art work in the area of configurable processor generation encompasses a rich area of academic research; see, e.g., Hanono et al., "Instruction Selection, Resource Allocation and Scheduling in the AVIV Retargetable Code Generator” (representation of machine instructions used for automatic creation of code generators); Fauth et al., "Describing Instruction Set Processors Using nML”; Ramsey et al., “Machine Descriptions to Build Tools for Embedded Systems”; Aho et al, "Code Generation Using Tree Matching and Dynamic Programming” (algorithms to match up transformations associated with each machine instruction, e.g., add, load, store, branch, etc., with a sequence of program operations represented by some machine-independent intermediate form using methods such as pattern matching); and Cattell, "Formal
  • processors generally execute instructions from a stored program using a pipeline with each stage suited to one phase of the instruction execution. Therefore, changing or adding an instruction or changing the configuration may require widespread changes in the processor's logic so each of the multiple pipeline stages can perform the appropriate action on each such instruction.
  • Configuration of a processor requires that it be re-verified, and that this verification adapt to the changes and additions. This is not a simple task.
  • Processors are complex logic devices with extensive internal data and control state, and the combinatorics of control and data and program make processor verification a demanding art. Adding to the difficulty of processor verification is the difficulty in developing appropriate verification tools. Since verification is not automated in prior art techniques, its flexibility, speed and reliability is less than optimal.
  • processors are generally programmed with the aid of extensive software tools, including compilers, assemblers, linkers, debuggers, simulators and profilers.
  • software tools including compilers, assemblers, linkers, debuggers, simulators and profilers.
  • the software tools must change as well. It does no good to add an instruction if that instruction cannot be compiled, assembled, simulated or debugged.
  • the cost of software changes associated with processor modifications and enhancements has been a major impediment to flexible processor design in the prior art.
  • prior art processor design is of a level of difficulty that processors generally are not typically designed or modified for a specific application. Also, it can be seen that considerable improvements in system efficiency are possible if processors could be configured or extended for specific applications. Further, the efficiency and effectiveness of the design process could be enhanced if it were able to use feedback on implementation characteristics such as power consumption, speed, etc. in refining a processor design. Moreover, in the prior art once a processor is modified, a great deal of effort is required to verify the correct operation of the processor after modification. Finally, although prior art techniques provide for limited processor configurability, they fail to provide for the generation of software development tools tailored for use with the configured processor.
  • the user often keeps multiple versions of the compiled application: one original version and another version containing the potential improvement.
  • potential improvements might interact, and the user might keep more than two copies of the application, each using a different subset of the potential improvements.
  • the user can easily test the different versions repeatedly under different circumstances.
  • the software development system can be very large. Keeping many versions can become unmanageable.
  • the software development system is configured for the entire processor. That makes it difficult to separate the development process among different engineers.
  • One developer might be responsible for deciding on cache characteristics of the processor and another responsible for adding customized instructions. While the work of the two developers is related, each piece is sufficiently separable so that each developer can work on her task in isolation.
  • the cache developer might initially propose a particular configuration. The other developer starts with that configuration and tries out several instructions, building a software development system for each potential instruction. Now, the cache developer modifies the proposed cache configuration. The other developer must now rebuild every one of her configurations, since each of her configurations assumed the original cache configuration. With many developers working on a project, organizing the different configurations can quickly become unmanageable.
  • the present invention overcomes these problems of the prior art and has an object of providing a system which can automatically configure a processor by generating both a description of a hardware implementation of the processor and a set of software development tools for programming the processor from the same configuration specification.
  • the above objects are achieved by providing an automated processor generation system which uses a description of customized processor instruction set options and extensions in a standardized language to develop a configured definition of a target instruction set, a Hardware Description Language description of circuitry necessary to implement the instruction set, and development tools such as a compiler, assembler, debugger and simulator which can be used to generate software for the processor and to verify the processor.
  • Implementation of the processor circuitry can be optimized for various criteria such as area, power consumption and speed. Once a processor configuration is developed, it can be tested and inputs to the system modified to iteratively optimize the processor implementation.
  • an instruction set architecture description language is defined and configurable processor/system configuration tools and development tools such as assemblers, linkers, compilers and debuggers are developed. This is part of the development process because although large portions of the tools are standard, they must be made to be automatically configured from the ISA description. This part of the design process is typically done by the designer or manufacturer of the automated processor design tool itself.
  • An automated processor generation system operates as follows.
  • a user e.g., a system designer, develops a configured instruction set architecture. That is, using the ISA definition and tools previously developed, a configurable instruction set architecture following certain ISA design goals is developed. Then, the development tools and simulator are configured for this instruction set architecture. Using the configured simulator, benchmarks are run to evaluate the effectiveness of the configurable instruction set architecture, and the core revised based on the evaluation results. Once the configurable instruction set architecture is in a satisfactory state, a verification suite is developed for it. Along with these software aspects of the process, the system attends to hardware aspects by developing a configurable processor.
  • the system designs an overall system architecture which takes configurable ISA options, extensions and processor feature selection into account.
  • the processor ISA, HDL implementation, software and simulator are configured by the system and system HDL is designed for system-on-a-chip designs.
  • a chip foundry is chosen based on an evaluation of foundry capabilities with respect to the system HDL (not related to processor selection as in the prior art).
  • the configuration system synthesizes circuitry, places and routes it, and provides the ability to re- optimize the layout and timing. Then, circuit board layouts are designed if the design is not of the single-chip type, chips are fabricated, and the boards are assembled.
  • the first technique used to address these issues is to design and implement specific mechanisms that are not as flexible as an arbitrary modification or extension, but which nonetheless allow significant functionality improvements. By constraining the arbitrariness of the change, the problems associated with it are constrained.
  • the second technique is to provide a single description of the changes and automatically generate the modifications or extensions to all affected components Processors designed with prior art techniques have not done this because it is often cheaper to do something once manually than to write a tool to do it automatically and use the tool once The advantage of automation applies when the task is repeated many times.
  • a third technique employed is to build a database to assist in estimation and automatic configuration for subsequent user evaluation.
  • a fourth technique is to provide hardware and software in a form that lends itself to configuration.
  • some of the hardware and software are not written directly in standard hardware and software languages, but in languages enhanced by the addition of a preprocessor that allows queries of the configuration database and the generation of standard hardware and software language code with substitutions, conditionals, replication, and other modifications.
  • the core processor design is then done with hooks that allow the enhancements to be linked m.
  • To illustrate these techniques consider the addition of application-specific instructions. By constraining the method to instructions that have register and constant operands and which produce a register result, the operation of the instructions can be specified with only combinatorial (stateless, feedback free) logic. This input specifies the opcode assignments, instruction name, assembler syntax and the combinatorial logic for the instructions, from which tools generate: ⁇ instruction decode logic for the processor to recognize the new opcodes;
  • simulator modifications to accept the new opcodes and to perform the specified logic function; and — diagnostic generators which generate both direct and random code sequences that contain and check the results of the added instructions.
  • ISA encoding, software tools and high-level simulation may be included in a configurable package, and flow may be designed for iteration to find an optimal combination of configuration values. Further, while previous methods focused only on hardware configuration or software configuration alone without a single user interface for control, or a measurement system for user-directed redefinition, the present invention contributes to complete flow for configuration of processor hardware and software, including feedback from hardware design results and software performance to aid selection of optimal configuration.
  • an automated processor design tool which uses a description of customized processor instruction set extensions in a standardized language to develop a configurable definition of a target instruction set, a Hardware Description Language description of circuitry necessary to implement the instruction set, and development tools such as a compiler, assembler, debugger and simulator which can be used to develop applications for the processor and to verify it.
  • the standardized language is capable of handling instruction set extensions which modify processor state or use configurable processors.
  • the user selects and builds a base processor configuration using the methods described herein.
  • the user creates a new set of user-defined processor enhancements and places them in a file directory.
  • the user then invokes a tool that processes the user enhancements and transforms them into a form usable by the base software development tools. This transformation is very quick since it involves only the user-defined enhancements and does not build an entire software system.
  • the user then invokes the base software development tools, telling the tools to dynamically use the processor enhancements created in the new directory.
  • the location of the directory is given to the tools either via a command line option or via an environment variable.
  • the user can use standard software makefiles. These enable the user to modify their processor instructions and then via a single make command, process the enhancements and use the base software development system to rebuild and evaluate their application in the context of the new processor enhancements.
  • the invention overcomes the three limitations of the prior art approach. Given a new set of potential enhancements, the user can evaluate the new enhancements in a matter of minutes. The user can keep many versions of potential enhancements by creating new directories for each set. Since the directory only contains descriptions of the new enhancements and not the entire software system, the storage space required is minimal. Finally, the new enhancements are decoupled from the rest of the configuration. Once the user has created a directory with a potential set of new enhancements, she can use that directory with any base configuration.
  • FIGURE 1 is a block diagram of a processor implementing an instruction set according to a preferred embodiment of the present invention
  • FIGURE 2 is a block diagram of a pipeline used in the processor according to the embodiment
  • FIGURE 3 shows a configuration manager in a GUI according to the embodiment
  • FIGURE 4 shows a configuration editor in the GUI according to the embodiment
  • FIGURE 5 shows different types of configurability according to the embodiment
  • FIGURE 6 is a block diagram showing the flow of processor configuration in the embodiment
  • FIGURE 7 is a block diagram of an instruction set simulator according to the embodiment.
  • FIGURE 8 is a block diagram of an emulation board for use with a processor configured according to the present invention.
  • FIGURE 9 is a block diagram showing the logical architecture of a configurable processor according to the embodiment.
  • FIGURE 10 is a block diagram showing the addition of a multiplier to the architecture of FIG. 9
  • FIGURE 11 is a block diagram showing the addition of a multiply-accumulate unit to the architecture of FIG. 9;
  • FIGURES 12 and 13 are diagrams showing the configuration of a memory in the embodiment.
  • FIGURES 14 and 15 are diagrams showing the addition of user-defined functional units in the architecture of FIG. 8.
  • FIGURE 16 is a block diagram showing the flow of information between system components in another preferred embodiment
  • FIGURE 17 is a block diagram showing how custom code is generated for the software development tools in the embodiment.
  • FIGURE 18 is a block diagram showing the generation of various software modules used in another preferred embodiment of the present invention.
  • FIGURE 19 is a block diagram of a pipeline structure in a configurable processor according to the embodiment.
  • FIGURE 20 is a ⁇ tate register implementation according to the embodiment.
  • FIGURE 21 is a diagram of additional logic needed to implement the state register implementation in the embodiment.
  • FIGURE 22 is a diagram showing the combination of the next-state output of a state from several semantic blocks and selection one to input to a state register according to the embodiment;
  • FIGURE 23 shows logic corresponding to semantic logic according to the embodiment;
  • FIGURE 24 shows the logic for a bit of state when it is mapped to the a bit of a user register in the embodiment.
  • the automated processor generation process begins with a configurable processor definition and user-specified modifications thereto, as well as a user-specified application to which the processor is to be configured. This information is used to generate a configured processor taking the user modifications into account and to generate software development tools, e.g., compiler, simulator, assembler and disassembler, etc., for it. Also, the application is recompiled using the new software development tools. The recompiled application is simulated using the simulator to generate a software profile describing the configured processor's performance running the application, and the configured processor is evaluated with respect to silicon chip area usage, power consumption, speed, etc. to generate a hardware profile characterizing the processor circuit implementation. The software and hardware profile are fed back and provided to the user to enable further iterative configuration so that the processor can be optimized for that particular application.
  • software development tools e.g., compiler, simulator, assembler and disassembler, etc.
  • An automated processor generation system 10 has four major components as shown in FIG. 1: a user configuration interface 20 through which a user wishing to design a processor enters her configurability and extensibility options and other design constraints; a suite of software development tools 30 which can be customized for a processor designed to the criteria chosen by the user; a parameterized, extensible description of a hardware implementation of the processor 40; and a build system 50 receiving input data from the user interface, generating a customized, synthesizable hardware description of the requested processor, and modifying the software development tools to accommodate the chosen design.
  • the build system 50 additionally generates diagnostic tools to verify the hardware and software designs and an estimator to estimate hardware and software characteristics.
  • Hardware implementation description means one or more descriptions which describe aspects of the physical implementation of a processor design and, alone or in conjunction with one or more other descriptions, facilitate production of chips according to that design.
  • components of the hardware implementation description may be at varying levels of abstraction, from relatively high levels such as hardware description languages through netlists and microcoding to mask descriptions. In this embodiment, however, the primary components of the hardware implementation description are written in an HDL, netlists and scripts.
  • HDL as used herein and in the appended claims is intended to refer to the general class of hardware description languages which are used to describe microarchitectures and the like, and it is not intended to refer to any particular example of such languages.
  • the basis for processor configuration is the architecture 60 shown in FIG. 2.
  • a number of elements of the architecture are basic features which cannot be directly modified by the user. These include the processor controls section 62, the align and decode section 64 (although parts of this section are based on the user-specified configuration), the ALU and address generation section 66, the branch logic and instruction fetch, 68 and the processor interface 70. Other units are part of the basic processor but are user-configurable. These include the interrupt control section 72, the data and instruction address watch sections 74 and 76, the window register file 78, the data and instruction cache and tags sections 80, the write buffers 82 and the timers 84. The remaining sections shown in FIG. 2 are optionally included by the user.
  • a central component of the processor configuration system 10 is the user configuration interface
  • GUI graphical user interface
  • ISS disassembler and instruction set simulator
  • GUI graphical user interface
  • the GUI also accesses a configuration database to get default values and do error checking on user input.
  • a user inputs design parameters into the user configuration interface 20.
  • the automated processor generation system 10 may be a stand-alone system running on a computer system under the control of the user; however, it preferably runs primarily on a system under the control of the manufacturer of the automated processor generation system 10.
  • User access may then be provided over a communication network.
  • the GUI may be provided using a web browser with data input screens written in HTML and Java. This has several advantages, such as maintaining confidentiality of any proprietary back-end software, simplifying maintenance and updating of the back end software, and the like.
  • the user may first log on to the system 10 to prove his identity. Once the user has access, the system displays a configuration manager screen 86 as shown in
  • the configuration manager 86 is a directory listing all of the configurations accessible by the user.
  • the configuration manager 86 in FIG. 3 shows that the user has two configurations, "just intr" and "high prio", the first having already been built, i.e., finalized for production, and the second yet to be built. From this screen 86 the user may build a selected configuration, delete it, edit it, generate a report specifying which configuration and extension options have been chosen for that configuration, or create a new configuration. For those configurations which have been built, such as "just intr", a suite of software development tools 30 customized for it can be downloaded.
  • the configuration editor 88 has an "Options" section menu on the left showing the various general aspects of the processor 60 which can be configured and extended.
  • Options When an option section is selected, a screen with the configuration options for that section appears on the right, and these options can be set with pull-down menus, memo boxes, check boxes, radio buttons and the like as is known in the art.
  • the user can select options and enter data at random, preferably data is entered into each sequentially, since there are logical dependencies between the sections; for example, to properly display options in the "Interrupts" section, the number of interrupts must have been chosen in the "ISA Options" section.
  • the following configuration options are available for each section:
  • Target ASIC technology .18, .25, .35 micron
  • Target operating condition typical, worst-case Implementation Goals
  • Target speed arbitrary Gate count: arbitrary Target power: arbitrary
  • Goal prioritization speed, area power; speed, power, area ISA Options
  • Processor interface read width (bits): 32, 64, 128 Write-buffer entries (address/value pairs): 4, 8, 16, 32 Processor Cache
  • Instruction/Data cache size (kB): 1, 2, 4, 8, 16 Instruction/Data cache line size (kB): 16, 32, 64 Peripheral Components Timers Timer interrupt numbers
  • User exception vector arbitrary Kernel Exception vector: arbitrary Register window over/underflow vector base: arbitrary Reset vector: arbitrary XTOS start address: arbitrary
  • Design CompilerTM yes, no Place & Route ApolloTM: yes, no
  • system 10 may provide options for adding other functional units such as a 32-bit integer multiply/divide unit or a floating point arithmetic unit; a memory management unit; on-chip RAM and ROM options; cache associativity; enhanced DSP and coprocessor instruction set; a write- back cache; multiprocessor synchronization; compiler-directed speculation; and support for additional CAD packages.
  • a definition file such as the one shown in Appendix A which the system 10 uses for syntax checking and the like once the user has selected appropriate options.
  • the automated processor configuration system 10 provides two broad types of configurability 300 to the user as shown in FIG. 5: extensibility 302, which permits the user to define arbitrary functions and structures from scratch, and modifiability 304, which permits the user to select from a predetermined, constrained set of options.
  • extensibility 302 which permits the user to define arbitrary functions and structures from scratch
  • modifiability 304 which permits the user to select from a predetermined, constrained set of options.
  • modifiability the system permits binary selection 306 of certain features, e.g., whether a MAC16 or a DSP should be added to the processor 60) and parametric specification 308 of other processor features, e.g., number of interrupts and cache size.
  • the RAM and ROM options allow the designer to include scratch pad or firmware on the processor 10 itself.
  • the processor 10 can fetch instructions or read and write data from these memories.
  • the size and placement of the memories is configurable.
  • each of these memories is accessed as an additional set in a set-associative cache. A hit in the memory can be detected by comparison with a single tag entry.
  • the system 10 provides separate configuration options for the interrupt (implementing level 1 interrupts) and the high-priority interrupt option (implementing level 2-15 interrupts and non-maskable interrupts) because each high-priority interrupt level requires three special registers, and these are thus more expensive.
  • the MAC 16 with 40-bit accumulator option adds a 16-bit multiplier/add function with a 40-bit accumulator, eight 16-bit operand registers and a set of compound instructions that combine multiply, accumulate, operand load and address update instructions.
  • the operand registers can be loaded with pairs of 16-bit values from memory in parallel with multiply/accumulate operations.
  • This unit can sustain algorithms with two loads and a multiply/accumulate per cycle.
  • the on-chip debug module (shown at 92 in FIG. 2) is used to access the internal, software- visible state of the processor 60 through the JTAG port 94.
  • the module 92 provides support for exception generation to put the processor 60 in the debug mode; access to all program-visible registers or memory locations; execution of any instruction that the processor 60 is configured to execute; modification of the PC to jump to a desired location in the code; and a utility to allow return to a normal operation mode, triggered from outside the processor 60 via the JTAG port 94.
  • the processor 10 Once the processor 10 enters debug mode, it waits for an indication from the outside world that a valid instruction has been scanned in via the JTAG port 94. The processor then executes this instruction and waits for the next valid instruction.
  • this module 92 can be used to debug the system. Execution of the processor 10 can be controlled via a debugger running on a remote host. The debugger interfaces with the processor via the JTAG port 94 and uses the capability of the on-chip debug module 92 to determine and control the state of the processor 10 as well as to control execution of the instructions.
  • Up to three 32-bit counter/timers 84 may be configured. This entails the use of a 32-bit register which increments each clock cycle, as well as (for each configured timer) a compare register and a comparator which compares the compare register contents with the current clocked register count, for use with interrupts and similar features.
  • the counter/timers can be configured as edge-triggered and can generate normal or high-priority internal interrupts.
  • the speculation option provides greater compiler scheduling flexibility by allowing loads to be speculatively moved to control flows where they would not always be executed. Because loads may cause exceptions, such load movement could introduce exceptions into a valid program that would not have occurred in the original. Speculative loads prevent these exceptions from occurring when the load is executed, but provide an exception when the data is required. Instead of causing an exception for a load error, speculative loads reset the valid bit of the destination register (new processor state associated with this option).
  • the core processor 60 preferably has some basic pipeline synchronization capability, when multiple processors are used in a system, some sort of communication and synchronization between processors is required. In some cases self-synchronizing communication techniques such as input and output queues are used. In other cases, a shared memory model is used for communication and it is necessary to provide instruction set support for synchronization because shared memory does not provide the required semantics. For example, additional load and store instructions with acquire and release semantics can be added. These are useful for controlling the ordering of memory references in multiprocessor systems where different memory locations may be used for synchronization and data so that precise ordering between synchronization references must be maintained. Other instructions may be used to create semaphore systems known in the art. In some cases, a shared memory model is used for communication, and it is necessary to provide instruction set support for synchronization because shared memory does not provide the required semantics. This is done by the multiprocessor synchronization option.
  • a TIE description uses a number of building blocks to delineate the attributes of new instructions as follows:
  • Instruction field statements field are used to improve the readability of the TIE code. Fields are subsets or concatenations of other fields that are grouped together and referenced by a name. The complete set of bits in an instruction is the highest-level superset field inst, and this field can be divided into smaller fields. For example, fieldx inst [11: 8] fieldy inst[15:12] field xy ⁇ x, y ⁇ defines two 4-bit fields, x and y, as sub-fields (bits 8-11 and 12-15, respectively) of a highest-level field inst and an 8-bit field xy as the concatenation of the x and y fields.
  • the statements opcode define opcodes for encoding specific fields. Instruction fields that are intended to specify operands, e.g., registers or immediate constants, to be used by the thus-defined opcodes, must first be defined with field statements and then defined with operand statements.
  • opcode acs op2 4 ' b0000 CUST0
  • the value of the constant can be generated from the operand, or it can be taken from a previously defined constant table defined as described below.
  • the wire statement defines a set of logical wires named t thirty-two bits wide.
  • the first assign statement after the wire statement specifies that the logical signals driving the logical wires are the of fset s 4 constant shifted to the right, and the second assign statement specifies that the lower eighteen bits of t are put into the of fset field.
  • the very first assign statement directly specifies the value of the offsets 4 operand as a concatenation of offset and fourteen replications of its sign bit (bit 17) followed by a shift-left of two bits.
  • makes use of the table statement to define an array prime of constants (the number following the table name being the number of elements in the table) and uses the operand s as an index into the table prime to encode a value for the operand prime_s (note the use of VerilogTM statements in defining the indexing).
  • the instruction class statement i class associates opcodes with operands in a common format. All instructions defined in an i class statement have the same format and operand usage. Before defining an instruction class, its components must be defined, first as fields and then as opcodes and operands.
  • the iclass statement iclass viterbi ⁇ adsel, acs ⁇ ⁇ out arr, in art, in ars ⁇ specifies that the operands adsel and acs belong to a common class of instructions viterbi which take two register operands art and ars as input and writes output to a register operand arr.
  • the instruction semantic statement semantic describes the behavior of one or more instructions using the same subset of VerilogTM used for coding operands. By defining multiple instructions in a single semantic statement, some common expressions can be shared and the hardware implementation can be made more efficient.
  • the variables allowed in semantic statements are operands for opcodes defined in the statement's opcode list, and a single-bit variable for each opcode specified in the opcode list. This variable has the same name as the opcode and evaluates to 1 when the opcode is detected. It is used in the computation section (the VerilogTM subset section) to indicate the presence of the corresponding instruction.
  • TIE code defining a new instruction ADD8_4 which performs additions of four 8- bit operands in a 32-bit word with respective 8-bit operands in another 32-bit word and a new instruction
  • op2, CUSTO, arr, art and ars are predefined operands as noted above, and the opcode and iclass statements function as described above.
  • the semantic statement specifies the computations performed by the new instructions.
  • the second line within the semantic statement specifies the computations performed by the new ADD8_4 instruction
  • the third and fourth lines therein specify the computations performed by the new MIN16_2 instruction
  • the last line within the section specifies the result written to the arr register.
  • the build system 50 takes over. As shown in FIG. 5, the build system 50 receives a configuration specification constituted by the parameters set by the user and extensible features designed by the user, and combines them with additional parameters defining the core processor architecture, e.g., features not modifiable by the user, to create a single configuration specification 100 describing the entire processor. For example, in addition to the configuration settings 102 chosen by the user, the build system 50 might add parameters specifying the number of physical address bits for the processor's physical address space, the location of the first instruction to be executed by the processor 60 after reset, and the like.
  • ISA Instruction Set Architecture
  • the configuration specification 100 also includes an ISA package containing TIE language statements specifying the base ISA, any additional packages which might have been selected by the user such as a coprocessor package 98 (see FIG. 2) or a DSP package, and any TIE extensions supplied by the user. Additionally, the configuration specification 100 may have a number of statements setting flags indicative of whether certain structural features are to be included in the processor 60. For example,
  • IsaUselnterrupt 1 IsaUseHighPrioritylnterrupt 0
  • IsaUseException 1 indicates that the processor will include the on-chip debugging module 92, interrupt facilities 72 and exception handling, but not high-priority interrupt facilities. Using the configuration specification 100, the following can be automatically generated as will be shown below:
  • the build system 50 determines the configuration automatically.
  • the designer can specify goals for the processor 60. For example, clock rate, area, cost, typical power consumption, and maximum power consumption might be goals. Since some of the goals conflict (e.g., often performance can be increased only by increasing area or power consumption or both), the build system 50 also takes a priority ordering for the goals.
  • the build system 50 then consults a search engine 106 to determine the set of configuration options available and determines how to set each option from an algorithm that attempts to simultaneously achieve the input goals.
  • the search engine 106 includes a database that has entries that describe the effect on the various metrics. Entries can specify that a particular configuration setting has an additive, multiplicative, or limiting effect on a metric. Entries can also be marked as requiring other configuration options as prerequisites, or as being incompatible with other options.
  • the simple branch prediction option can specify a multiplicative or additive effect on Cycles Per Instruction (CPI ⁇ a determinant of performance), a limit on clock rate, an additive effect on area, and an additive effect on power. It can be marked as incompatible with a fancier branch predictor, and dependent on setting the instruction fetch queue size to at least two entries. The value of these effects may be a function of a parameter, such as branch prediction table size.
  • the database entries are represented by functions that can be evaluated.
  • a simple knapsack packing algorithm considers each option in sorted order of value divided by cost and accepts any option specification that increases value while keeping cost below a specified limit. So, for example, to maximize performance while keeping power below a specified value, the options would be sorted by performance divided by power and each option that increases performance that can be configured without exceeding the power limit is accepted. More sophisticated knapsack algorithms provide some amount of backtracking.
  • a very different sort of algorithm for determining the configuration from goals and the design database is based on simulated annealing. A random initial set of parameters is used as the starting point, and then changes of individual parameters are accepted or rejected by evaluating a global utility function. Improvements in the utility function are always accepted while negative changes are accepted probabilistically based on a threshold that declines as the optimization proceeds. In this system the utility function is constructed from the input goals. For example, given the goals Performance > 200,
  • the examples we have given have used hardware goals that are general and not dependent on the particular algorithm being run on the processor 60.
  • the algorithms described can also be used to select configurations well suited for specific user programs.
  • the user program can be run with a cache accurate simulator to measure the number of cache misses for different types of caches with different characteristics such as different sizes, different line sizes and different set associativities.
  • the results of these simulations can be added to the database used by the search algorithms 106 described to help select the hardware implementation description 40.
  • the user algorithm can be profiled for the presence of certain instructions that can be optionally implemented in hardware. For example, if the user algorithm spends a significant time doing multiplications, the search engine 106 might automatically suggest including a hardware multiplier
  • Such algorithms need not be limited to considering one user algorithm
  • the user can feed a set of algorithms into the system, and the search engine 106 can select a configuration that is useful on average to the set of user programs.
  • the search algorithms can also be used to automatically select or suggest to the users possible TIE extensions Given the input goals and given examples of user programs written perhaps in the C programming language, these algo ⁇ thms would suggest potential TIE extensions.
  • the pattern matcher would discover that the user in two different locations adds two numbers and shifts the result two bits to the left.
  • the system would add to a database the possibility of generating a TIE instruction that adds two numbers and shifts the result two bits to the left.
  • the build system 50 keeps track of many possible TIE instructions along with a count of how many times they appear. Using a profiling tool, the system 50 also keeps track of how often each instruction is executed during the total execution of the algorithm. Using a hardware estimator, the system 50 keeps track of how expensive in hardware it would be to implement each potential TIE instruction These numbers are fed into the search heuristic algorithm to select a set of potential TIE instructions that maximize the input goals; goals such as performance, code size, hardware complexity and the like.
  • Similar but more powerful algorithms are used to discover potential TIE instructions with state.
  • Several different algorithms are used to detect different types of opportunities.
  • One algorithm uses a compiler-like tool to scan the user program and detect if the user program requires more registers than are available on the hardware. As known to practitioners m the art, this can be detected by counting the number of register spills and restores m the compiled version of the user code.
  • the compiler-like tool suggests to the search engine a coprocessor with additional hardware registers 98 but supporting only the operations used in the portions of the user's code that has many spills and restores.
  • the tool is responsible for informing the database used by the search engine 106 of an estimate of the hardware cost of the coprocessor as well as an estimate of how the user's algorithm performance is improved.
  • the search engine 106 makes a global decision of whether or not the suggested coprocessor 98 leads to a better configuration.
  • a compiler-like tool checks if the user program uses bit-mask operations to insure that certain variables are never larger than certain limits. In this situation, the tool suggests to the search engine 106 a co-processor 98 using data types conforming to the user limits (for example, 12 bit or 20 bit or any other size integers).
  • a compiler-like tool discovers that much time is spent operating on user defined abstract data types. If all the operations on the data type are suitable for TIE, the algorithm proposes to the search engine 106 implementing all the operations on the data type with a
  • one signal is generated for each opcode defined in the configuration specification.
  • the generation of register interlock and pipeline stall signals has also been automated. This logic is also generated based on the information in the configuration specification. Based on register usage information contained in the iclass statement and the latency of the instruction the generated logic inserts a stall (or bubble) when the source operand of the current instruction depends on the destination operand of a previous instruction which has not completed. The mechanism for implementing this stall functionality is implemented as part of the core hardware.
  • the instruction decode signals and the illegal instruction signal are available as outputs of the decode module and as inputs to the hand-written processor logic.
  • this embodiment uses a VerilogTM description of the configurable processor 60 enhanced with a Perl-based preprocessor language.
  • Perl is a full-featured language including complex control structures, subroutines, and I/O facilities.
  • the preprocessor which in an embodiment of the present invention is called TPP (as shown in the source listing in Appendix B, TPP is itself a Perl program), scans its input, identifies certain lines as preprocessor code (those prefixed by a semicolon for TPP) written in the preprocessor language (Perl for TPP), and constructs a program consisting of the extracted lines and statements to generate the text of the other lines.
  • the non- preprocessor lines may have embedded expressions in whose place expressions generated as a result of the TPP processing are substituted.
  • the resultant program is then executed to produce the source code, i.e., VerilogTM code for describing the detailed processor logic 40 (as will be seen below, TPP is also used to configure the software development tools 30).
  • VerilogTM code for describing the detailed processor logic 40
  • TPP is also used to configure the software development tools 30.
  • TPP is a powerful preprocessing language because it permits the inclusion of constructs such as configuration specification queries, conditional expressions and iterative structures in the VerilogTM code, as well as implementing embedded expressions dependent on the configuration specification 100 in the VerilogTM code as noted above.
  • I saMemoryOrder is a flag set in the configuration specification 100, and $endian is a TPP variable to be used later in generating the VerilogTM code.
  • a TPP conditional expression might be ; i f ( conf ig_get_value (" IsaMemoryOrder” ) eq " LittleEndian” )
  • TPP constructs such as
  • TPP code can be embedded into VerilogTM expressions such as wire [ $ninterrupts-l s : 0] srlnterruptEn; xtscenflop # ( ⁇ $ninterrupts ) srintrenreg (srlnterruptEn, srDataln W[ * $ninterrupts-l s : 0] , srlntrEnWEn, ! cReset,CLK) ; where:
  • Sninterrupts defines the number of interrupts and determines the width (in terms of bits) of the xtscenflop module (a flip-flop primitive module); srlnterruptEn is the output of the flip-flop, defined to be a wire of appropriate number of bits; srDataIn_W is the input to the flip-flop, but only relevant bits are input based on number of interrupts; srlntrEnWEn is the write enable of the flip-flop; cReset is the clear input to the flip-flop ; and CLK is the input clock to the flip-flop. For example, given the following input to TPP:
  • TPP generates wire [31:0] srCCount; wire ccountWEn; till CCOUNT Register
  • the HDL description 114 thus generated is used to synthesize hardware for processor implementation using, e.g., the DesignCompilerTM manufactured by Synopsys Corporation in block 122.
  • the result is then placed and routed using, e.g., Silicon EnsembleTM by Cadence Corporation or ApolloTM by Avant! Corporation in block 128. Once the components have been routed, the result can be used for wire back-annotation and timing verification in block 132 using, e.g., PrimeTimeTM by Synopsys.
  • the product of this process is a hardware profile 134 which can be used by the user to provide further input to the configuration capture routine 20 for further configuration iterations.
  • one of the outcomes of configuring the processor 60 is a set of customized HDL files from which specific gate-level implementation can be obtained by using any of a number of commercial synthesis tools.
  • One such a tool is Design CompilerTM from Synopsys.
  • this embodiment provides scripts necessary to automate the synthesis process in the customer environment. The challenge in providing such scripts is to support a wide variety of synthesis methodologies and different implementation objectives of users. To address the first challenge, this embodiment breaks the scripts into smaller and functionally complete scripts.
  • One such example is to provide a read script that can read all HDL files relevant to the particular processor configuration 60, a timing constraint script to set the unique timing requirement in the processor 60, and a script to write out synthesis results in a way that can be used for the placement and routing of the gate-level netlist.
  • this embodiment provides a script for each implementation objective.
  • One such example is to provide a script for achieving fastest cycle time, a script for achieving minimum silicon area, and a script for achieving minimum power consumption.
  • Scripts are used in other phases of processor configuration as well.
  • a simulator can be used to verify the correct operation of the processor 60 as described above in connection with block 132. This is often accomplished by running many test programs, or diagnostics, on the simulated processor 60.
  • Running a test program on the simulated processor 60 can require many steps such as generating an executable image of the test program, generating a representation of this executable image which can be read by the simulator 112, creating a temporary place where the results of the simulation can be gathered for future analysis, analyzing the results of the simulation, and so on. In the prior art this was done with a number of throw- away scripts.
  • scripts had some built-in knowledge of the simulation environment, such as which HDL files should be included, where those files could be found in the directory structure, which files are required for the test bench, and so on.
  • the preferred mechanism is to write a script template which is configured by parameter substitution.
  • the configuration mechanism also uses TPP to generate a list of the files that are required for simulation.
  • the final step in the process of converting an RTL description to a hardware implementation is to use a place and route (P&R) software to convert the abstract netlist into a geometrical representation.
  • P&R place and route
  • the P&R software analyzes the connectivity of the netlist and decides upon the placement of the cells. It then tries to draw the connections between all the cells.
  • the clock net usually deserves special attention and is routed as a last step. This process can be both helped by providing the tools with some information, such as which cells are expected to be close together (known as soft grouping), relative placement of cells, which nets are expected to have small propagation delays, and so on.
  • the configuration mechanism produces a set of scripts or input files for the P&R software.
  • These scripts contain information as described above such as relative placements for cells.
  • the scripts also contain information such as how many supply and ground connections are required, how these should be distributed along the boundary, etc.
  • the scripts are generated by querying a database that contains information on how many soft groups to create and what cells should be contained in them, which nets are timing critical, etc. These parameters change based on which options have been selected.
  • These scripts must be configurable depending on the tools to be used to do the place and route.
  • the configuration mechanism can request more information from the user and pass it to the P&R scripts.
  • the interface can ask the user the desired aspect ratio of the final layout, how many levels of buffering should be inserted in the clock tree, which side the input and output pins should be located on, relative, or absolute, placement of these pins, width and location of the power and ground straps, and so on. These parameters would then be passed on to the P&R scripts to generate the desired layout.
  • This script reads gated clock information produced by the configuration agent based on which options are selected.
  • the Perl script is run once the design has been placed and routed but before final clock tree synthesis is done. Further improvement can be made to the profile process described above. Specifically, we will describe a process by which the user can obtain the similar hardware profile information almost instantaneously without spending hours running those CAD tools. This process has several steps.
  • the first step in this process is to partition the set of all configuration options into groups of orthogonal options such that effect of an option in a group on the hardware profile is independent of options in any other group. For example, the impact of MAC 16 unit to the hardware profile is independent of any other options. So, an option group with only the MAC 16 option is formed. A more complicated example is an option group containing interrupt options, high-level interrupt options and timer options, since the impact on the hardware profile is determined by the particular combination of these options.
  • the second step is to characterize the hardware profile impact of each option groups. The characterization is done by obtaining hardware profile impact for various combinations of options in the group. For each combination, the profile is obtained using a previously-described process in which an actual implementation is derived and its hardware profile is measured.
  • Such information is stored in an estimation database.
  • the last step is to derive specific formulae for computing hardware profile impact by particular combinations of options in the option groups using curve fitting and interpolation techniques. Depending on the nature of the options, different formulae are used. For example, since each additional interrupt vector adds about the same logic to the hardware, we use linear function to model its hardware impact. In another example, having a timer unit requires the high-priority interrupt option, so the formula for hardware impact of the timer option is conditional formulae involving several options.
  • the quick evaluation system can be easily extended to provide the user with suggestions on how to modify a configuration to further optimize the processor.
  • One such example is to associate each configuration option with a set of numbers representing the incremental impact of the option on various cost metrics such as area, delay and power.
  • Computing the incremental cost impact for a given option is made easy with the quick evaluation system. It simply involves two calls to the evaluation system, with and without the option. The difference in the costs for the two evaluations represents the incremental impact of the option.
  • the incremental area impact of the MAC 16 option is computed by evaluating the area cost of two configurations, with and without the MAC 16 option. The difference is then displayed with the MAC 16 option in the interactive configuration system.
  • Such a system can guide the user toward an optimal solution through a series of single-step improvements.
  • this embodiment of this invention configures software development tools 30 so that they are specific to the processor.
  • the configuration process begins with software tools 30 that can be ported to a variety of different systems and instruction set architectures. Such retargetable tools have been widely studied and are well-known in the art.
  • This embodiment uses the GNU family of tools, which is free software, including for example, the GNU C compiler, GNU assembler, GNU debugger, GNU linker, GNU profiler, and various utility programs. These tools 30 are then automatically configured by generating portions of the software directly from the ISA description and by using TPP to modify portions of the software that are written by hand.
  • the GNU C compiler is configured in several different ways.
  • the ISA description defines the sets of constant values that can be used in immediate fields of various instructions. For each immediate field, a predicate function is generated to test if a particular constant value can be encoded in the field. The compiler uses these predicate functions when generating code for the processor 60. Automating this aspect of the compiler configuration eliminates an opportunity for inconsistency between the ISA description and the compiler, and it enables changing the constants in the ISA with minimal effort.
  • TPP Several aspects of the compiler are configured via preprocessing with TPP.
  • corresponding parameters in the compiler are set via TPP.
  • the compiler has a flag variable to indicate whether the target processor 60 uses big endian or little endian byte ordering, and this variable is set automatically using a TPP command that reads the endianness parameter from the configuration specification 100.
  • TPP is also used to conditionally enable or disable hand-coded portions of the compiler which generate code for optional ISA packages, based on whether the corresponding packages are enabled in the configuration specification 100. For example, the code to generate multiply/accumulate instructions is only included in the compiler if the configuration specification includes the MAC 16 option 90.
  • the compiler is also configured to support designer-defined instructions specified via the TIE language. There are two levels of this support. At the lowest level, the designer-defined instructions are available as macros, intrinsic functions, or inline (extrinsic) functions in the code being compiled.
  • This embodiment of this invention generates a C header file defining inline functions as "inline assembly" code (a standard feature of the GNU C compiler). Given the TIE specification of the designer-defined opcodes and their corresponding operands, generating this header file is a straightforward process of translating to the GNU C compiler's inline assembly syntax.
  • An alternative implementation creates a header file containing C preprocessor macros that specify the inline assembly instructions.
  • Yet another alternative uses TPP to add intrinsic functions directly into the compiler.
  • the second level of support for designer-defined instructions is provided by having the compiler automatically recognize opportunities for using the instructions.
  • These TIE instructions could be directly defined by the user or created automatically during the configuration process.
  • the TIE code Prior to compiling the user application, the TIE code is automatically examined and converted into C equivalent functions. This is the same step used to allow fast simulation of TIE instructions.
  • the C equivalent functions are partially compiled into a tree-based intermediate representation used by the compiler.
  • the representation for each TIE instruction is stored in a database.
  • part of the compilation process is a pattern matcher.
  • the user application is compiled into the tree-based intermediate representation.
  • the pattern matcher walks bottom-up every tree in the user program.
  • the pattern matcher checks if the intermediate representation rooted at the current point matches any of the TIE instructions in the database. If there is a match, the match is noted. After finishing to walk each tree, the set of maximally sized matches are selected. Each maximal match in the tree is replaced with the equivalent TIE instruction.
  • the algorithm described above will automatically recognize opportunities to use stateless TIE instructions. Additional approaches can also be used to automatically recognize opportunities to use TIE instructions with state. A previous section described algorithms for automatically selecting potential TIE instructions with state. The same algorithms are used to automatically use the TIE instructions in C or C++ applications.
  • a TIE coprocessor When a TIE coprocessor has been defined to have more registers but a limited set of operations, regions of code are scanned to see if they suffer from register spilling and if those regions only use the set of available operations. If such regions are found, the code in those regions is automatically changed to use the coprocessor instructions and registers 98. Conversion operations are generated at the boundaries of the region to move the data in and out of the coprocessor 98. Similarly, if a TIE coprocessor has been defined to work on different size integers, regions of the code are examined to see if all data in the region is accessed as if it were the different size. For matching regions, the code is changed and glue code is added at the boundaries. Similarly if a TIE coprocessor 98 has been defined to implement a C++ abstract data type, all the operations in that data type are replaced with the TIE coprocessor instructions.
  • TIE instructions automatically and utilizing TIE instructions automatically are both useful independently.
  • Suggested TIE instructions can also be manually used by the user via the intrinsic mechanism and utilizing algorithms can be applied to TIE instructions or coprocessors 98 designed manually.
  • the compiler needs to know the potential side effects of the designer-defined instructions so that it can optimize and schedule these instructions.
  • traditional compilers optimize user codes in order to maximize desired characteristics such as run-time performance, code size or power consumption.
  • optimizations include things such as rearranging instructions or replacing certain instructions with other, semantically equivalent instructions.
  • the compiler In order to perform optimizations well, the compiler must know how every instruction affects different portions of the machine. Two instructions that read and write different portions of the machine state can be freely reordered. Two instructions that access the same portion of the machine state can not always be reordered.
  • TIE instructions are conservatively assumed to read and write all the state of the processor 60. This allows the compiler to generate correct code but limits the ability of the compiler to optimize code in the presence of ⁇ E instructions.
  • a tool automatically reads the TIE definition and for each ⁇ E instruction discovers which state is read or written by said instruction. This tool then modifies the tables used by the compiler's optimizer to accurately model the effect of each TIE instruction.
  • the machine-dependent portions of the assembler 110 include both automatically generated parts and hand-coded parts configured with TPP. Some of the features common to all configurations are supported with code written by hand. However, the primary task of the assembler 110 is to encode machine instructions, and instruction encoding and decoding software can be generated automatically from the ISA description.
  • this embodiment of this invention groups the software to perform those tasks into a separate software library.
  • This library is generated automatically using the information in the ISA description.
  • the library defines an enumeration of the opcodes, a function to efficiently map strings for opcode mnemonics onto members of the enumeration (stringToOpcode), and tables that for each opcode specify the instruction length (instructionLength), number of operands (numberOf Operands), operand fields, operand types (i.e., register or immediate) (operandType), binary encoding (encodeOpcode), and mnemonic string (opcodeName).
  • the library For each operand field, the library provides accessor functions to encode (f ieldSetFunction) and decode (f ieldGetFunction) the corresponding bits in the instruction word. All of this information is readily available in the ISA description; generating the library software is merely a matter of translating the information into executable C code. For example, the instruction encodings are recorded in a C array variable where each entry is the encoding for a particular instruction, produced by setting each opcode field to the value specified for that instruction in the ISA description; the encodeOpcode function simply returns the array value for a given opcode.
  • the library also provides a function to decode the opcode in a binary instruction decodelnstruction).
  • This function is generated as a sequence of nested switch statements, where the outermost switch tests the subopcode field at the top of the opcode hierarchy, and the nested switch statements test the subopcode fields progressively lower in the opcode hierarchy.
  • the generated code for this function thus has the same structure as the opcode hierarchy itself.
  • the assembler 110 is easily implemented.
  • the instruction encoding logic in the assembler is quite simple:
  • This disassembler algorithm is used in a standalone disassembler tool and also in the debugger 130 to support debugging of machine code.
  • the linker is less sensitive to the configuration than the compiler and assembler 110. Much of the linker is standard and even the machine-dependent portions depend primarily on the core ISA description and can be hand-coded for a particular core ISA. Parameters such as endianness are set from the configuration specification 100 using TPP.
  • the memory map of the target processor 60 is one other aspect of the configuration that is needed by the linker. As before, the parameters that specify the memory map are inserted into the linker using TPP.
  • the GNU linker is driven by a set of linker scripts, and it is these linker scripts that contain the memory map information.
  • This embodiment includes a tool to configure new linker scripts with different memory map parameters.
  • the debugger 130 provides mechanisms to observe the state of a program as it runs, to single- step the execution one instruction at a time, to introduce breakpoints, and to perform other standard debugging tasks.
  • the program being debugged can be run either on a hardware implementation of the configured processor or on the ISS 126.
  • the debugger presents the same interface to the user in either case.
  • a small monitor program is included on the target system to control the execution of the user's program and to communicate with the debugger via a serial port.
  • the simulator 126 itself performs those functions.
  • the debugger 130 depends on the configuration in several ways.
  • the part of the debugger 130 that displays the processor's register state, and the parts of the debug monitor program and ISS 126 that provide that information to the debugger 130, are generated by scanning the ISA description to find which registers exist in the processor 60.
  • the configuration specification is also used to configure a simulator called the ISS 126 shown in FIG. 13.
  • the ISS 126 is a software application that models the functional behavior of the configurable processor instruction set. Unlike its counterpart processor hardware model simulators such as Synopsys VCS and Cadence Verilog XL and NC simulators, the ISS HDL model is an abstraction of the CPU during its instruction execution. The ISS 126 can run much faster than a hardware simulation because it does not need to model every signal transition for every gate and register in the complete processor design.
  • the ISS 126 allows programs generated for the configured processor 60 to be executed on a host computer. It accurately reproduces the processor's reset and interrupt behavior allowing low-level programs such as device drivers and initialization code to be developed. This is particularly useful when porting native code to an embedded application.
  • the ISS 126 can be used to identify potential problems such as architectural assumptions, memory ordering considerations and the like without needing to download the code to the actual embedded target.
  • ISS semantics are expressed textually using a C-like language to build C operator building blocks that turn instructions into functions.
  • a C-like language to build C operator building blocks that turn instructions into functions.
  • the rudimentary functionality of an interrupt e.g., interrupt register, bit setting, interrupt level, vectors, etc., is modeled using this language.
  • the configurable ISS 126 is used for the following four purposes or goals as part of the system design and verification process:
  • debugging software applications before hardware becomes available debugging software applications before hardware becomes available; debugging system software (e.g., compilers and operating system components);
  • ISS serves as a reference implementation of the ISA ⁇ the ISS and processor HDL are both run for diagnostics and applications during processor design verification and traces from the two are compared;
  • the ISS 126 All the goals require that the ISS 126 be able to load and decode programs produced with the configurable assembler 110 and linker. They also require that ISS execution of instructions be semantically equivalent to the corresponding hardware execution and to the compiler's expectations. For these reasons, the ISS 126 derives its decode and execution behavior from the same ISA files used to define the hardware and system software.
  • the ISS 126 therefore permits dynamic control of the level of detail of the simulation. For example, cache details are not modeled unless requested, and cache modeling can be turned off and on dynamically.
  • parts of the ISS 126 e.g., cache and pipeline models
  • the ISS 126 makes very few configuration-dependent choices of behavior at runtime. In this way, all ISS configurable behavior is derived from well-defined sources related to other parts of the system.
  • the ISS 126 For the first and third goals listed above, it is important for the ISS 126 to provide operating system services to applications when these services are not yet available from the OS for the system under design (the target). It is also important for these services to be provided by the target OS when that is a relevant part of the debugging process. In this way the system provides a design for flexibly moving these services between ISS host and simulation target.
  • the current design relies on a combination of ISS dynamic control (trapping SYSCALL instructions may be turned on and off) and the use of a special SIMCALL instruction to request host OS services.
  • the last goal requires the ISS 126 to model some aspects of processor and system behavior that are below the level specified by the ISA.
  • the ISS cache models are constructed by generating C code for the models from Perl scripts which extract parameters from the configuration database 100.
  • details of the pipeline behavior of instructions e.g., interlocks based on register use and functional-unit availability requirements
  • a special pipeline description file specifies this information in a lisp-like syntax.
  • the third goal requires precise control of interrupt behavior. For this purpose, a special non- architectural register in the ISS 126 is used to suppress interrupt enables.
  • the ISS 126 provides several interfaces to support the different goals for its use:
  • a batch or command line mode (generally used in connection with the first and last goals);
  • a command loop mode, which provides non-symbolic debug capabilities, e.g. breakpoints, watchpoints, step, etc. — frequently used for all four goals;
  • a socket interface which allows the ISS 126 to be used by a software debugger as an execution backend (this must be configured to read and write the register state for the particular configuration selected).
  • this interface may be used to compare application behavior on different configurations. For example, at any breakpoint the state from a run on one configuration may be compared with or transferred to the state from a run on another configuration.
  • the simulator 126 also has both hand-coded and automatically generated portions.
  • the hand- coded portions are conventional, except for the instruction decode and execution, which are created from tables generated from the ISA description language.
  • the tables decode the instruction by starting from the primary opcode found in the instruction word to be executed, indexing into a table with the value of that field, and continuing until a leaf opcode, i.e., an opcode which is not defined in terms of other opcodes, is found.
  • the tables then give a pointer to the code translated from the TIE code specified in the semantics declaration for the instruction. This code is executed to simulate the instruction.
  • the ISS 126 can optionally profile the execution of the program being simulated. This profiling uses a program counter sampling technique known in the art. At regular intervals, the simulator 126 samples the PC (program counter) of the processor being simulated. It builds a histogram with the number of samples in each region of code. The simulator 126 also counts the number of times each edge in the call graph is executed by incrementing a counter whenever a call instruction is simulated.
  • PC program counter
  • the simulator 126 When the simulation is complete, the simulator 126 writes an output file containing both the histogram and call graph edge counts in a format that can be read by a standard profile viewer. Because the program 118 being simulated need not be modified with instrumentation code (as in standard profiling techniques), the profiling overhead does not affect the simulation results and the profiling is totally non- invasive.
  • the system make available hardware processor emulation as well as software processor emulation.
  • this embodiment provides an emulation board.
  • the emulation board 200 uses a complex programmable logic device 202 such as the Altera Flex 10K200E to emulate, in hardware, a processor configuration 60.
  • the CPLD device 202 is functionally equivalent to the final ASIC product. It provides the advantage that a physical implementation of the processor 60 is available that can run much faster than other simulation methods (like the ISS 126 or HDL) and is cycle accurate. However, it cannot reach the high frequency targets that the final ASIC device can get to.
  • This board enables the designer to evaluate various processor configuration options and start software development and debugging early in the design cycle. It can also be used for the functional verification of the processor configuration.
  • the emulation board 200 has several resources available on it to allow for easy software development, debugging and verification. These include the CPLD device 202 itself, EPROM 204, SRAM 206, synchronous SRAM 208, flash memory 210 and two RS232 serial channels 212.
  • the serial channels 212 provide a communication link to UNLX or PC hosts for downloading and debugging user programs.
  • the configuration of a processor 60 in terms of the CPLD netlist, is downloaded into the CPLD 202 through a dedicated serial link to device's configuration port 214 or through dedicated configuration ROMs 216.
  • the resources available on the board 200 are configurable to a degree as well.
  • the memory map of the various memory elements on the board can be easily changed, because the mapping is done through a Programmable Logic Device (PLD) 217 which can be easily changed.
  • PLD Programmable Logic Device
  • the caches 218 and 228 that the processor core uses are expandable by using larger memory devices and appropriately sizing the tag busses 222 and 224 that connect to the caches 218 and 228.
  • the board to emulate a particular processor configuration involves several steps.
  • the first step is to obtain a set of RTL files which describe the particular configuration of the processor.
  • the next step is to synthesize a gate-level netlist from the RTL description using any of a number of commercial synthesis tools.
  • One such example is FPGA Express from Synopsys.
  • the gate-level netlist can then be used to obtain a CPLD implementation using tools typically provided by vendors.
  • One such tool is
  • the final step is to download the implementation onto the CPLD chip on the emulation board using programmers provided again by the CPLD vendors.
  • the files delivered to users are customized by grouping all relevant files into a single directory. Then, a fully customized synthesis script is provided to be able to synthesize the particular processor configuration to the particular FPGA device selected by the customer. A fully customized implementation script to be used by the vendor tools is also generated. Such synthesis and implementation scripts guarantee functionally correct implementation with optimal performance.
  • the functional correctness is achieved by including appropriate commands in the script to read in all RTL files relevant to the specific processor configuration by including appropriate commands to assign chip-pin locations based on I/O signals in the processor configuration and by including commands to obtain specific logic implementation for certain critical portions of the processor logic such at gated clocks.
  • the script also improves the performance of the implementation by assigning detailed timing constraint to all processor I/O signals and by special processing of certain critical signals.
  • One such example for timing constraints is assigning a specific input delay to a signal by taking into account the delay of that signal on the board.
  • An example of critical signal treatment is to assign the clock signal to a dedicated global wire in order to achieve low clock skews on the CPLD chip.
  • the system also configures a verification suite for the configured processor 60.
  • Most verification of complex designs like microprocessors consists of a flow as follows:
  • the present invention uses a flow that is somewhat similar, but all components of the flow are modified to account for the configurability of the design.
  • This methodology consists of the following steps:
  • build a testbench for a particular configuration.
  • Configuration of the testbench uses a similar approach as that described for the HDL and supports all options and extensions supported therein, i.e., cache sizes, bus interface, clocking, interrupt generation etc.; — run self-checking diagnostics on a particular configuration of the HDL. Diagnostics themselves are configurable to tailor them for a particular piece of hardware. The selection of which diagnostics to run is also dependent on the configuration;
  • All of the verification components are configurable.
  • the configurability is implemented using TPP.
  • test bench is a VerilogTM model of a system in which the configured processor 60 is placed.
  • these test benches include:
  • test bench itself needs to support configurability. So, for example, the cache size and width and number of external interrupts are automatically adjusted based on configuration.
  • the testbench provides stimulus to the device under test - the processor 60. It does this by providing assembly level instructions (from diagnostics) that are preloaded into memory. It also generates signals that control the behavior of the processor 60 - for example, interrupts. Also, the frequency and timing of these external signals is controllable and is automatically generated by the testbench.
  • diagnostics use TPP to determine what to test. For example, a diagnostic has been written to test software interrupts. This diagnostic will need to know how many software interrupts there are in order to generate the right assembly code.
  • the processor configuration system 10 must decide which diagnostics are suitable for this configuration. For example, a diagnostic written to test the MAC unit is not applicable to a processor 60 which does not include this unit. In this embodiment this is accomplished through the use of a database containing information about each diagnostic.
  • the database may contain for each diagnostic the following information:
  • test generation tools are tools that create a series of processor instructions in an intelligent fashion. They are sequences of pseudo-random test generators. This embodiment uses two types internally - a specially-developed one called RTPG and another which is based on an external tool called VERA (VSG). Both have configurability built around them. Based on valid instructions for a configuration, they will generate a series of instructions. These tools will also be able to deal with newly defined instructions from TIE - so that these newly defined instructions are randomly generated for testing.
  • This embodiment includes monitors and checkers that measure the coverage of the design verification.
  • Monitors and coverage tools are tools that are run alongside a regression run. Coverage tools monitor what the diagnostic is doing and the functions and logic of the HDL that it is exercising. All this information is collected throughout the regression run and is later analyzed to get some hints of what parts of the logic need further testing.
  • This embodiment uses several functional coverage tools that are configurable. For example, for a particular finite state machine not all states are included depending on a configuration. So, for that configuration the functional coverage tool must not try to check for those states or transitions. This is accomplished by making the tool configurable through TPP. Similarly, there are monitors that check for illegal conditions occurring within the HDL simulation. These illegal conditions could show up as bugs. For example on a three-state bus, 2 drivers should not be on simultaneously. These monitors are configurable - adding or removing checks based on whether a particular logic is included or not for that configuration.
  • the cosimulation mechanism connects the HDL to the ISS 126. It is used to check that the state of the processor at the end of the instruction is identical in the HDL and the ISS 126. It too is configurable to the extent that it knows what features are included for each configuration and what state needs to be compared. So, for example, the data breakpoint feature adds a special register. This mechanism needs to know to compare this new special register.
  • Instruction semantics specified via TIE can be translated to functionally equivalent C functions for use in the ISS 126 and for system designers to use for testing and verification.
  • the semantics of an instruction in the configuration database 106 are translated to a C function by tools that build a parse tree using standard parser tools, and then code that walks the tree and outputs the corresponding expressions in the C language.
  • the translation requires a prepass to assign bit widths to all expressions and to rewrite the parse tree to simplify some translations.
  • These translators are relatively simple compared to other translators, such as HDL to C or C to assembly language compilers, and can be written by one skilled in the art starting from the TIE and C language specification.
  • benchmark application source code 118 is compiled and assembled and, using a sample data set 124, simulated to obtain a software profile 130 which also is provided to the user configuration capture routine for feedback to the user.
  • Having the ability to obtain both the hardware and software cost/benefit characterizations for any configuration parameter selections opens up new opportunities for further optimization of the system by the designers. Specifically, this will enable designers to select the optimal configuration parameters which optimize the overall systems according to some figure of merit.
  • One possible process is based on a greedy strategy, by repeatedly selecting or de-selecting a configuration parameter. At each step, the parameter that has the best impact on the overall system performance and cost is selected. This step is repeated until no single parameter can be changed to improve the system performance and cost.
  • this process can also be used to construct optimal processor extensions. Because of the large number of possibilities in the processor extensions, it is important to restrict the number of extension candidates.
  • One technique is to analyze the application software and only look at the instruction extensions that can improve the system performance or cost.
  • Motion estimation is an important component of many image compression algorithms, including MPEG video and H263 conference applications.
  • Video image compression attempts to use the similarities from one frame to the next to reduce the amount of storage required for each frame.
  • each block of an image to be compressed can be compared to the corresponding block (the same X,Y location) of the reference image (one that closely precedes or follows the image being compressed).
  • the compression of the image differences between frames is generally more bit-efficient than compression of the individual images.
  • the distinctive image features often move from frame to frame, so the closest correspondence between blocks in different frames is often not at exactly the same X,Y location, but at some offset. If significant parts of the image are moving between frames, it may be necessary to identify and compensate for the movement, before computing the difference.
  • the densest representation can be achieved by encoding the difference between successive images, including, for distinctive features, an X, Y offset in the sub- images used in the computed difference.
  • the offset in the location used for computing the image difference is called the motion vector.
  • the most computationally intensive task in this kind of image compression is the determination of the most appropriate motion vector for each block.
  • the common metric for selecting the motion vector is to find the vector with the lowest average pixel-by-pixel difference between each block of the image being compressed and a set of candidate blocks of the previous image.
  • the candidate blocks are the set of all the blocks in a neighborhood around the location of the block being compressed.
  • the size of the image, the size of the block and size of the neighborhood all affect the running time of the motion estimation algorithm.
  • Simple block-based motion estimation compares each sub-image of the image to be compressed against a reference image.
  • the reference image may precede or follow the subject image in the video sequence. In every case, the reference image is known to be available to the decompression system before the subject image is decompressed. The comparison of one block of an image under compression with candidate blocks of a reference image is illustrated below.
  • each block in the subject image a search is performed around the corresponding location in the reference image. Normally each color component (e.g., YUV) of the images is analyzed separately.
  • each color component e.g., YUV
  • the average pixel -by-pixel difference is computed between that subject block and every possible block in the search zone of the reference image.
  • the difference is the absolute value of the difference in magnitude of the pixel values.
  • the average is proportional to the sum over the N 2 pixels in the pair of blocks (where N is the dimension of the block).
  • the block of the reference image that produces the smallest average pixel difference defines the motion vector for that block of the subject image.
  • the following example shows a simple form of a motion estimation algorithm, then optimizes the algorithm using TIE for a small application-specific functional unit. This optimization yields a speed-up of more than a factor of 10, making processor-based compression feasible for many video applications. It illustrates the power of a configurable processor that combines the ease of programming in a high-level language with the efficiency of special-purpose hardware.
  • This example uses two matrices, OldB and NewB, to respectively represent the old and new images.
  • the size of the image is determined by NX and NY.
  • the block size is determined by BLOCKX and BLOCKY. Therefore, the image is composed of NX/ BLOCKX by NY/BLOCKY blocks.
  • the search region around a block is determined by SEARCHX and SEARCH Y.
  • the best motion vectors and values are stored in VectX, VectY, and VectB.
  • the best motion vectors and values computed by the base (reference) implementation are stored in BaseX, BaseY, and BaseB. These values are used to check against the vectors computed by the implementation using instruction extensions.
  • the motion estimation algorithm is comprised of three nested loops:
  • the instruction set architecture includes powerful funnel shifting primitives to permit rapid extraction of unaligned fields in memory. This allows the inner loop of the pixel comparison to fetch groups of adjacent pixels from memory efficiently. The loop can then be rewritten to operate on four pixels (bytes) simultaneously.
  • the instruction set architecture includes powerful funnel shifting primitives to permit rapid extraction of unaligned fields in memory. This allows the inner loop of the pixel comparison to fetch groups of adjacent pixels from memory efficiently. The loop can then be rewritten to operate on four pixels (bytes) simultaneously.
  • N (unsigned *) & (NewB [bx*BLOCKX+x]
  • This implementation uses the following SAD function to emulate the eventual new instruction: /***************************************************** *
  • ⁇ E permits rapid specification of new instructions.
  • the configfurable processor generator can fully implement these instructions in both the hardware implementation and the software development tools.
  • Hardware synthesis creates an optimal integration of the new function into the hardware datapath.
  • the configurable processor software environment fully supports the new instructions in the C and C++ compilers, the assembler, the symbolic debugger, the profiler and the cycle-accurate instruction set simulator.
  • the rapid regeneration of hardware and software makes application-specific instructions a quick and reliable tool for application acceleration.
  • This example uses TIE to implement a simple instruction to perform pixel differencing, absolute value and accumulation on four pixels in parallel.
  • This single instruction does eleven basic operations (which in a conventional process might require separate instructions) as an atomic operation. The following is the complete description:
  • the new opcode SAD is defined as a sub-opcode of CUSTO.
  • QRST is the top-level opcode.
  • CUSTO is a sub-opcode of QRST and SAD in turn is a sub-opcode of CUSTO.
  • This hierarchical organization of opcodes allow logical grouping and management of the opcode spaces.
  • CUSTO (and CUST1) are defined as reserved opcode space for users to add new instructions. It is preferred that users stay within this allocated opcode space to ensure future re-usability of TIE descriptions.
  • the second step in this TIE description is to define a new instruction class containing the new instruction SAD. This is where the operands of SAD instruction is defined.
  • SAD consists of three register operands, destination register arr and source registers ars and art.
  • arr is defined as the register indexed by the r field of the instruction
  • ars and art are defined as registers indexed by the s and t fields of the instruction.
  • the description is using a subset of Verilog HDL language for describing combination logic. It is this block that defines precisely how the ISS will simulate the SAD instruction and how an additional circuitry is synthesized and added to the configurable processor hardware to support the new instruction.
  • the TIE description is debugged and verified using the tools previously described.
  • the next step is to estimate the impact of the new instruction on the hardware size and performance. As noted above, this can be done using, e.g., Design CompilerTM. When Design Compiler finishes, the user can look at the output for detailed area and speed reports.
  • the motion estimation code is compiled into code for the configurable processor which uses the instruction set simulator to verify the correctness of the program and more importantly to measure the performance. This is done in three steps: run the test program using the simulator; run just the base implementation to get the instruction count; and run just the new implementation to get the instruction count. The following is the simulation output of the second step:
  • the next step is to run the test program using a Verilog simulator as described above.
  • Verilog simulator as described above.
  • Those skilled in the art can glean the details of this process from the makefile of Appendix C (associated files also are shown in Appendix C).
  • the purpose of this simulation is to further verify the correctness of the new implementation and more importantly to make this test program as part of the regression test for this configured processor.
  • processor logic can be synthesized using, e.g., Design CompilerTM and placed and routed using, e.g., ApolloTM.
  • MPEG 2 typically does motion estimation and compensation with sub-pixel resolution.
  • Two adjacent rows or columns of pixels can be averaged to create a set of pixels interpolated to an imaginary position halfway between the two rows or columns.
  • the configurable processor's user-defined instructions are again useful here, since a parallel pixel averaging instruction is easily implemented in just three or four lines of ⁇ E code. Averaging between pixels in a row again uses the efficient alignment operations of the processor's standard instruction set.
  • the incorporation of a simple sum-of-absolute-differences instruction adds just a few hundred gates, yet improves motion estimation performance by more than a factor of ten.
  • This acceleration represents significant improvements in cost and power efficiency of the final system.
  • the seamless extension of the software development tools to include the new motion- estimation instruction allows for rapid prototyping, performance analysis and release of the complete software application solution.
  • the solution of the present invention makes application-specific processor configuration simple, reliable and complete, and offers dramatic enhancement of the cost, performance, functionality and power-efficiency of the final system product.
  • FIG. 6 which includes the processor control function, program counter (PC), branch selection, instruction memory or cache and instruction decoder, and the basic integer datapath including the main register file, bypassing multiplexers, pipeline registers, ALU, address generator and data memory for the cache.
  • PC program counter
  • branch selection branch selection
  • instruction memory or cache and instruction decoder branch selection
  • basic integer datapath including the main register file, bypassing multiplexers, pipeline registers, ALU, address generator and data memory for the cache.
  • the HDL is written with the presence of the multiplier logic being conditional upon the "multiplier" parameter being set, and a multiplier unit is added as a new pipeline stage as shown in FIG. 7 (changes to exception handling may be required if precise exceptions are to be supported).
  • instructions for making use of the multiplier are preferably added concomitantly with the new unit.
  • a full coprocessor may be added to the base configuration as shown in
  • FIG. 8 for a digital signal processor such as a multiply/accumulate unit.
  • This entails changes in processor control such as adding decoding control signals for multiply-accumulate operations, including decoding of register sources and destinations from extended instructions; adding appropriate pipeline delays for control signals; extending register destination logic; adding control for a register bypass multiplexer for moves from accumulate registers, and the inclusion of a multiply-accumulate unit as a possible source for an instruction result. Additionally, it requires addition of a multiply-accumulate unit which entails additional accumulator registers, a multiply-accumulate array and source select multiplexers for main register sources.
  • addition of the coprocessor entails extension of the register bypass multiplexer from the accumulate registers to take a source from the accumulate registers, and extension of the load/alignment multiplexer to take a source from the multiplier result.
  • the system preferably adds instructions for using the new functional unit along with the actual hardware.
  • Another option that is particularly useful in connection with digital signal processors is a floating point unit.
  • Such a functional unit implementing, e.g., the IEEE 754 single-precision floating point operation standard may be added along with instructions for accessing it.
  • the floating point unit may be used, e.g., in digital signal processing applications such as audio compression and decompression.
  • As yet another example of the system's flexibility consider the 4 kB memory interface shown in
  • coprocessor registers and datapaths may be wider or narrower than the main integer register files and datapaths, and the local memory width may be varied so that the memory width is equal to the widest processor or coprocessor width (addressing of memory on reads and writes being adjusted accordingly).
  • FIG. 10 shows a local memory system for a processor that supports loads and stores of 32 bits to a processor/coprocessor combination addressing the same array, but where the coprocessor supports loads and stores of 128 bits.
  • Datal $Widel * ?DI1 : ⁇ s $Banks " ⁇ DI1 ⁇ ⁇ ; wire r$Max * *8-l:0]
  • Data2 $Widel ?DI2 : ⁇ $Banks ' ⁇ DI2 ⁇ ⁇ ) ; wire r$Max ⁇ *8-l:0]
  • $ Bytes is the total memory size accessed either as width Bl bytes at byte address Al with data bus DI under control of write signal Wl, or using corresponding parameters B2, A2, D2 and W2. Only one set of signals, defined by Select, is active in a given cycle.
  • the TPP code implements the memory as a collection of memory banks. The width of each bank is given by the minimum access width and the number of banks by the ratio of the maximum and minimum access widths.
  • a for loop is used to instantiate each memory bank and its associated write signals, i.e., write enable and write data.
  • a second for loop is used to gather the data read from all the banks into a single bus.
  • FIG. 11 shows an example of the inclusion of user-defined instructions in the base configuration.
  • simple instructions may be added to the processor pipeline with timing and interface similar to that of the ALU. Instructions added in this way must generate no stalls or exceptions, contain no state, use only the two normal source register values and the instruction word as inputs, and generate a single output value. If, however, the ⁇ E language has provisions for specifying processor state, such constraints are not necessary.
  • FIG. 12 shows another example of implementation of a user-defined unit under this system.
  • the functional unit shown in the Figure an 8/16 parallel data unit extension of the ALU, is generated from the following ISA code:
  • TIE-defined instructions including those modifying processor state
  • TIE-defined instructions including those modifying processor state
  • TIE-defined instructions including those modifying processor state
  • a number of building blocks have been added to the language to make it possible to declare additional processor states which can be read and written by the new instructions.
  • These "state" statements are used to declare the addition processor states.
  • the declaration begins with the keyword state.
  • the next section of the state statements describes the size, number of bits, of the state and how the bits of the states are indexed.
  • the last section of the "state” statement is a list of attributes associated with the state.
  • State DATA is 64-bits wide and the bits are indexed from 63 to 0.
  • KEYC and KEYD are both 28-bit states.
  • DATA has a coprocessor-number attribute cpn indicating to which coprocessor data DATA belongs.
  • the attribute "autopack” indicate that the state DATA will be automatically mapped to some registers in the user-register file so that the value of DATA can be read and written by software tools.
  • the user_register section is defined to indicate the mapping of states to registers in the user register file.
  • a user_register section starts with a keyword user_register, followed by a number indicating the register number, and ends with an expression indicating the state bits to be mapped onto the register. For example, user_register 0 DATA [ 31 : 0 ] user_register 1 DATA [63: 32] user_register 2 KEYC user_register 3 KEYD user_register 4 ⁇ X, Y, Z ⁇
  • such an assignment of state bits to user register file entries is derived automatically using bin-packing algorithms.
  • a combination of manual and automatic assignments can be used, for example, to ensure upward compatibility.
  • Instruction field statements field are used to improve the readability of the TIE code. Fields are subsets or concatenations of other fields that are grouped together and referenced by a name. The complete set of bits in an instruction is the highest-level superset field inst, and this field can be divided into smaller fields. For example, field x inst [ ll : 8 ] field y inst [ 15 : 12 ] field xy ⁇ x, y ⁇ defines two 4-bit fields, x and y, as sub-fields (bits 8-11 and 12-15, respectively) of a highest-level field inst and an 8-bit field xy as the concatenation of the x and y fields.
  • defines an 18-bit field named offset which holds a signed number and an operand offsets 4 which is four times the number stored in the offset field.
  • the last part of the operand statement actually describes the circuitry used to perform the computations in a subset of the VerilogTM HDL for describing combinatorial circuits, as will be apparent to those skilled in the art.
  • the wire statement defines a set of logical wires named t thirty-two bits wide.
  • the first assign statement after the wire statement specifies that the logical signals driving the logical wires are the of f sets 4 constant shifted to the right, and the second assign statement specifies that the lower eighteen bits of t are put into the of fset field.
  • the very first assign statement directly specifies the value of the o f f s e t s 4 operand as a concatenation of o f f s e t and fourteen replications of its sign bit (bit 17) followed by a shift-left of two bits.
  • makes use of the table statement to define an array prime of constants (the number following the table name being the number of elements in the table) and uses the operand s as an index into the table prime to encode a value for the operand prime_s (note the use of VerilogTM statements in defining the indexing).
  • the instruction class statement iclass associates opcodes with operands in a common format.
  • operands adsel and acs belong to a common class of instructions viterbi which take two register operands art and ars as input and writes output to a register operand arr.
  • the instruction class statement "iclass" is modified to allow the specification of state-access information of instructions. It starts with a keyword “iclass”, is followed by the name of the instruction class, the list of opcodes belonging to the instruction class and a list of operand access information, and ends with a newly-defined list for state access information.
  • iclass lddata ⁇ LDDATA ⁇ ⁇ out arr in imm4 ⁇ ⁇ in DATA ⁇ iclass stdata ⁇ STDATA ⁇ ⁇ in ars, in art ⁇ ⁇ out DATA ⁇ iclass stkey ⁇ STKEY ⁇ ⁇ in ars, in art ⁇ ⁇ out KEYC, out KEYD ⁇ iclass des ⁇ DES ⁇ ⁇ out arr, in imm4 ⁇ ⁇ inout KEYC, inout DATA, inout KEYD ⁇
  • the instruction semantic statement semantic describes the behavior of one or more instructions using the same subset of VerilogTM used for coding operands. By defining multiple instructions in a single semantic statement, some common expressions can be shared and the hardware implementation can be made more efficient.
  • the variables allowed in semantic statements are operands for opcodes defined in the statement's opcode list, and a single-bit variable for each opcode specified in the opcode list. This variable has the same name as the opcode and evaluates to 1 when the opcode is detected. It is used in the computation section (the VerilogTM subset section) to indicate the presence of the corresponding instruction.
  • the first section of the above code defines an opcode for the new instruction, called BYTESWAP.
  • the new opcode BYTESWAP is defined as a sub-opcode of CUSTO. From the XtensaTM
  • Opcodes are typically organized in a hierarchical fashion.
  • QRST is the top-level opcode
  • CUSTO is a sub-opcode of QRST
  • BYTESWAP is in turn a sub-opcode of CUSTO. This hierarchical organization of opcodes allows logical grouping and management of the opcode spaces.
  • the second declaration declares additional processor states needed by the BYTESWAP instruction: // declare state SWAP and COUNT state COUNT 32 state SWAP 1
  • COUNT is declared as a 32-bit state and SWAP as a 1-bit state.
  • the TIE language specifies that the bits in COUNT are indexed from 31 to 0 with bit 0 being least significant.
  • the XtensaTM ISA provides two instructions, RSR and WSR, for saving and restoring special system registers. Similarly, it provides two other instructions, RUR and WUR (described in greater detail below) for saving and restoring states which are declared in TIE.
  • RUR and WUR described in greater detail below
  • the nest section in the TIE description is the definition of a new instruction class containing the new instruction BYTESWAP:
  • SWAP ⁇ where iclass is the keyword and bs is the name of the iclass.
  • the next clause lists the instruction in this instruction class (BYTESWAP).
  • the clause after than specifies the operands used by the instructions in this class (in this case an input operand ars and an output operand arr).
  • the last clause in the iclass definition specifies the states which are accessed by the instruction in this class (in this case the instruction will read state SWAP and read and write state COUNT).
  • the description uses a subset for Verilog HDL for describing combination logic. It is this block that defines precisely how the instruction set simulator will simulate theBYTESWAP instruction and how the additional circuitry is synthesized and added to the XtensaTM processor hardware to support the new instruction.
  • the declared states can be used just like any other variables for accessing information stored in the states.
  • a state identifier appearing on the right hand side of an expression indicates the read from the state.
  • Writing to a state is done by assigning the state identifier with a value or an expression.
  • TIE Instruction Extension Language
  • new hardware implementing the instructions can be generated using, e.g., a program similar to the one shown in Appendix D.
  • Appendix E shows the code for header files needed to support new instructions as intrinsic functions.
  • FIG. 16 is a diagram of how the ISA-specific portions of these software tools are generated.
  • a TIE parser program 410 From a user-created TIE description file 400, a TIE parser program 410 generates C code for several programs, each of which produces a file accessed by one or more of the software development tools for information about the user-defined instructions and state.
  • the program tie2gcc 420 generates a C header file 470 called xtensa-tie . h which contains intrinsic function definitions for new instructions.
  • the program tie2isa 430 generates a dynamic linked library (DLL) 480 which contains information on user-defined instruction format (in the Wilson et al.
  • DLL dynamic linked library
  • the program tie2iss 440 generates performance modeling routines and produces a DLL 490 containing instruction semantics which, as discussed in the Wilson et al. application, is used by a host compiler to produce a simulator DLL used by the simulator.
  • the program tie2ver 450 produces necessary descriptions 500 for user-defined instructions in an appropriate hardware description language.
  • the program tie2xtos 460 produces save and restore code 510 for use by RUR and WUR instructions.
  • a state register is typically duplicated several times, each instantiation representing the value of the state at a particular pipeline stage.
  • a state is translated into multiple copies of registers consistent with the underlying core processor implementation. Additional bypass and forward logic are also generated, again in a manner consistent with the underlying core processor implementation. For example, to target a core processor implementation that consists of three execution stages, this embodiment would translate a state into three registers connected as shown in FIG. 18.
  • each register 610 - 630 represents the value of the state in at one of the three pipeline stages, ctrl-1, Ctrl -2, and ctrl-3 are control signals used to enable the data latching in the corresponding flip-flops 610- 630.
  • the execution unit consists of multiple pipeline stages.
  • the computation of an instruction is carried out in multiple stages in this pipeline. Instruction streams flow through the pipeline in sequence as directed by the control logic. At any given time, there can be up to n instructions being executed in the pipeline, where n is the number of stages. In a superscalar processor, also implementable using the present invention, the number of instructions in the pipeline can be n»w, wherein w is the issue width of the processor.
  • the role of the control logic is to make sure the dependencies between the instructions are obeyed and any interference between instructions is resolved. If an instruction uses data computed by an earlier instruction, special hardware is needed to forward the data to the later instruction without stalling the pipeline. If an interrupt occurred, all instructions in the pipeline need to be killed and later on re- executed.
  • a value computed in a stage will be forwarded to the next instructions immediately without waiting for the value to reach the end of the pipeline in order to reduce the number of pipeline stalls introduced by the data dependencies. This is accomplished by sending the output of the first flip-flop 610 directly to the semantic block such that it can be used immediately by the next instruction.
  • the implementation requires the following control signals: Kill_l, Kill_all and Valid_3.
  • Signal “Kill_l” indicates that the instruction currently in the first pipeline stage 110 must be killed due to reasons such as not having the data it needs to proceeds. Once the instruction is killed, it will be retried in the next cycle. Signal “Kill_all” indicates that all the instructions currently in the pipeline must be killed for reasons such as an instruction ahead of them has generated an exception or an interrupt has occurred. Signal “Valid_3” indicates whether the instruction currently in the last stage 630 is valid or not. Such a condition is often the result of killing an instruction in the first pipeline stage 610 and causing a bubble (invalid instruction) in the pipeline. “Valid_3” simply indicates whether the instruction in the third pipeline stage is valid or a bubble. Clearly, only valid instructions should be latched.
  • FIG. 20 shows the additional logic and connections needed to implement the state register. It also shows how to construct the control logic to drive the signals "ctrl-1", “ctrl-2", and "ctrl- 3" such that this state-register implementation meets the above requirements.
  • the following is sample HDL code automatically generated to implement the state register as shown in FIG. 19.
  • tie_enflop #(size) state_WX ( . tie_out (sx) , . tie_in(sw), .en(ew), ⁇ .elk (elk)); endmodule
  • the present state value of the state is passed to the semantic block as an input variable if the semantic block specifies the state as its input. If the semantic block has the logic to generate the new value for a state, an output signal is created. This output signal is used as the next-state input to the pipelined state register.
  • This embodiment allows multiple semantic description blocks each of which describes the behavior for multiple instructions. Under this unrestricted description style, it is possible that only a subset of the semantic blocks produce next-state output for a given state. Furthermore, it is also possible that a given semantic block produces the next-state output conditionally depending on what instruction it is executing at a given time. Consequently, additional hardware logic is needed to combine the next- state outputs from all semantic blocks to form the input to the pipelined state register. In this embodiment of the invention, a signal is automatically derived for each semantic block indicating whether this block has produced a new value for the state. In another embodiment, such a signal can be left to the designer to specify. FIG. 20 shows how to combine the next-state output of a state from several semantic blocks s 1
  • opl_l and opl_2 are opcode signals for the first semantic block
  • op2_l and op2_2 are opcode signals for the second semantic block
  • the next-state output of semantic block i is si (there are multiple next-state outputs for the block if there are multiple state registers).
  • the signal indicating that semantic blocki has produced a new value for the state is si_we.
  • Signal s_we indicates whether any of the semantic blocks produce a new value for the state, and is used as an input to the pipelined state register as the write-enable signal.
  • the expressive power of the multiple semantic block is no more than that of a single one, it does provide a way for implementing more structured descriptions, typically by grouping related instructions into a single block. Multiple semantic blocks can also lead to simpler analysis of instructions effects because of the more restricted scope in which the instructions are implemented. On the other hand, there are often reasons for a single semantic block to describe the behavior of multiple instructions. Most often, it is because the hardware implementation of these instructions share common logic. Describing multiple instructions in a single semantic block usually leads to more efficient hardware design hardware design.
  • the logic for the restore and load instructions is automatically generated as two semantic blocks which can then be recursively translated into actual hardware just like any other blocks.
  • FIG. 21 shows the block diagram of the logic corresponding to this kind of semantic logic.
  • the input signal "st" is compared with various constants to form various selection signals which are used to select certain bits from the state registers in a way consistent with theuser__register specification.
  • bit 32 of DATA maps to bit 0 of the second user register. Therefore, the second input of the MUX in this diagram should be connected to the 32nd bit of the DATA state.
  • FIG. 22 shows the logic for the jth bit of state S when it is mapped to the kth bit of the ith user register. If the user_register number "st" in a WUR instruction is "i”, the kth bit of "ars” is loaded into the S[j] register; otherwise, the original value of S[j] is recirculated. In addition, if any bit of the state S is reloaded, the signal S_we is enabled.
  • the TIE user_register declaration specifies a mapping from additional processor state defined by state declarations to an identifier used by these RUR and WUR instructions to read and write this state independent of the TIE instructions.
  • Appendix F shows the code for generating RUR and WUR instructions.
  • the primary purpose for RUR and WUR is for task switching.
  • the multiple software tasks share the processor, running according to some scheduling algorithm.
  • the task's state resides in the processor registers.
  • the scheduling algorithm decides to switch to another task, the state held in the processor registers is saved to memory, and another task's state is loaded from memory to the processor registers.
  • the XtensaTM Instruction Set Architecture includes the RSR and WSR instructions to read and write the state defined by the ISA.
  • the following code is part of the task "save to memory”: // save special registers rsr aO, SAR rsr al, LCOUNT s32i aO, a3, UEXCSAVE + 0 s32i al, a3, UEXCSAVE + 4 rsr aO, LBEG rsr al, LEND s32i aO, a3, UEXCSAVE + 8 s32i al, a3, UEXCSAVE + 12 if (config get value ("IsaU; 3eMAC16") ) rsr aO, ACCLO rsr al, ACCHI s32i aO, a3, UEXCSAVE + 16 s32i al, a3, UEXCSAVE + 20 rsr aO, MR_0 rsr al, MR_1 s32i aO, a3, UEXCSAVE + 24 s32i al, a3, UEX
  • the task state area in memory must have additional space allocated for the user register storage, and the offset of this space from the base of the task save pointer is defined as the assembler constant UEXCUREG. This save area was previously defined by the following code
  • UEXCREGSIZE+UEXCPARMSIZE+UEXCSAVESIZE+UEXCMISCSIZE which is changed to ttdefine UEXCREGSIZE (16*4)
  • This code is dependent on there being a tpp variable @user_registers with a list of the user register numbers. This is simply a list created from the first argument of everyuser_register statement.
  • a state can be computed in different pipeline states. Handling this requires several extensions (albeit simple ones) to the process described here.
  • the specification language needs to be extended to be able to associate a semantic block with a pipeline stage. This can be accomplished in one of several ways.
  • the associated pipeline stage can be specified explicitly with each semantic block.
  • a range of pipeline stages can be specified for each semantic block.
  • the pipeline stage for a given semantic block can be automatically derived depending on the required computational delay.
  • the second task in supporting state generation at different pipeline stages is to handle interrupts, exceptions, and stalls. This usually involves adding appropriate bypass and forward logic under the control of pipeline control signals.
  • a generate-usage diagram can be generated to indicate the relationship between when the state is generated and when it is used. Based on application analysis, appropriate forward logic can be implemented to handle the common situation and interlock logic can be generated to stall the pipeline for the cases not handled by the forwarding logic.
  • the method for modifying the instruction issue logic of the base processor dependent on the algorithms employed by the base processor. However, generally speaking, instruction issue logic for most processors, whether single-issue or superscalar, whether for single-cycle or multi-cycle instructions, depends only on for the instruction being tested for issue: 1. signals that indicate for each processor state element whether the instruction uses the states as a source;
  • TIE contains all the necessary information to augment the signals and their equations for the new instructions.
  • each TIE state declaration cause a new signal to be created for the instruction issue logic.
  • Each in or inout operand or state listed in the third or fourth argument to the iclass declaration adds the instruction decode signal for the instructions listed in the second argument to the first set of equations for the specified processor state element.
  • each out or inout operand or state listed in the third or fourth argument to the iclass declaration adds the instruction decode signal for the instructions listed in the second argument to the second set of equations for the specified processor state element.
  • the logic created from each TIE semantic blocks represents a new functional unit, so a new unit signals are created, and the decode signals for the TIE instructions specified for the semantic block are OR'd together to form the third set of equations.
  • the pipeline status must provide the following status back to the issue logic:
  • the embodiment described herein is a single-issue processor where the designer-defined instructions are limited to a single cycle of logic computation. In this case the above simplifies considerably. There is no need for the functional unit checks or cross-issue checks, and no single-cycle instruction can make a processor state element to be not pipeready for the next instruction.
  • the TIE specification would be augmented with a latency specification for each instruction giving the number of cycles over which to pipeline the computation.
  • the fourth set of signals would be generated in each semantic block pipe stage by OR'ing together the instruction decode signals for each instruction that completes in that stage according to the specification.
  • the generated logic will be fully pipelined, and so the TIE generated functional units will always be ready one cycle after accepting an instruction.
  • the fifth set of signals for TIE semantic blocks is always asserted.
  • the fifth set of signals would be generated in each semantic block pipe stage by OR'ing together the instruction decode signals for each instruction that finishes with the specified cycle count in that stage.
  • Appendix G is an example of implementation of an instruction using the TIE language
  • Appendix H shows what the TIE compiler generates for the compiler using such code.
  • Appendix I shows what the TIE compiler generates for the simulator
  • Appendix J shows what the TIE compiler generates for macro expanding the ⁇ E instructions in a user application
  • Appendix K shows what tie compiler generates to simulate TIE instructions in native mode
  • Appendix L shows what tie compiler generates as Verilog HDL description for the additional hardware
  • Appendix M shows what the TIE compiler generates as Design Compiler script to optimize the Verilog HDL description above to estimate the area and speed impact of the TIE instruction on the total CPU size and performance.
  • a user begins by selecting a base processor configuration via the GUI described above.
  • a software development system 30 is built and delivered to the user as shown in FIG. 1.
  • the software development system 30 contains four key components relevant to another aspect of the present invention, shown in greater detail in FIG. 6: a compiler 108, an assembler 110, an instruction set simulator 112 and a debugger 130.
  • a compiler converts user applications written in high level programming languages such as C or C++ into processor-specific assembly language.
  • High level programming languages such as C or C++ are designed to allow application writers to describe their application in a form that is easy for them to precisely describe. These are not languages understood by processors. The application writer need not necessarily worry about all the specific characteristics of the processor that will be used.
  • the same C or C++ program can typically be used with little or no modification on many different types of processors.
  • the compiler translates the C or C++ program into assembly language.
  • Assembly language is much closer to machine language, the language directly supported by the processor. Different types of processors will have their own assembly language. Each assembly instruction often directly represents one machine instruction, but the two are not necessarily identical. Assembly instructions are designed to be human readable strings. Each instruction and operand is given a meaningful name or mnemonic, allowing humans to read assembly instructions and easily understand what operations will be performed by the machine. Assemblers convert from assembly language into machine language. Each assembly instruction string is efficiently encoded by the assembler into one or more machine instructions that can be directly and efficiently executed by the processor.
  • Machine code can be directly run on the processor, but physical processors are not always immediately available. Building physical processors is a time-consuming and expensive process.
  • a user cannot build a physical processor for each potential choice. Instead, the user is provided with a software program called a simulator.
  • the simulator a program running on a generic computer, is able to simulate the effects of running the user application on the user configured processor.
  • the simulator is able to mimic the semantics of the simulated processor and is able to tell the user how quickly the real processor will be able to run the user's application.
  • a debugger is a tool that allows users to interactively find problems with their software.
  • the debugger allows users to interactively run their programs.
  • the user can stop the program's execution at any time and look at her C source code, the resultant assembly or machine code.
  • the user can also examine or modify the values of any or all of her variables or the hardware registers at a break point.
  • the user can then continue execution ⁇ perhaps one statement at a time, perhaps one machine instruction at a time, perhaps to a new user-selected break point. All four components 108, 110, 112 and 130 need to be aware of user-defined instructions 750
  • the system allows the user to access user-defined instructions 750 via intrinsics added to user C and C++ applications.
  • the compiler 108 must translate the intrinsic calls into the assembly language instructions 738 for the user-defined instructions 750.
  • the assembler 110 must take the new assembly language instructions 738, whether written directly by the user or translated by the compiler 108, and encode them into the machine instructions 740 corresponding to the user-defined instructions 750.
  • the simulator 112 must decode the user-defined machine instructions 740. It must model the semantics of the instructions, and it must model the performance of the instructions on the configured processor. The simulator 112 must also model the values and performance implications of user-defined state.
  • the debugger 130 must allow the user to print the assembly language instructions 738 including user-defined instructions 750. It must allow the user to examined and modify the value of user-defined state.
  • the user invokes a tool, the TIE compiler 702, to process the current potential user-defined enhancements 736.
  • the TIE compiler 702 is different from the compiler 708 that translates the user application into assembly language 738.
  • the TIE compiler 702 builds components which enable the already-built base software system 30 (compiler 708, assembler
  • Each element of the software system 30 uses a somewhat different set of components.
  • FIG. 24 is a diagram of how the TIE-specific portions of these software tools are generated.
  • the TIE compiler 702 From the user-defined extension file 736, the TIE compiler 702 generates C code for several programs, each of which produces a file accessed by one or more of the software development tools for information about the user-defined instructions and state.
  • the program tie2 gee 800 generates a C header file 842 called xtensa-tie . h (described in greater detail below) which contains intrinsic function definitions for new instructions.
  • the program tie2isa 810 generates a dynamic linked library (DLL) 844/848 which contains information on user-defined instruction format (a combination of encode DLL 844 and decode DLL 848 described in greater detail below).
  • DLL dynamic linked library
  • the program tie2iss 840 generates C code 870 for performance modeling and instruction semantics which, as discussed below, is used by a host compiler 846 to produce a simulator DLL 849 used by the simulator 712 as described in greater detail below.
  • the program tie2ver 850 produces necessary descriptions 850 for user-defined instructions in an appropriate hardware description language.
  • the program tie2xtos 860 produces save and restore code 810 to save and restore the user-defined state for context switching. Additional information on the implementation of user-defined state can be found in the afore-mentioned Wang et al. application. Compiler 708
  • the compiler 708 translates intrinsic calls in the user's application into assembly language instructions 738 for the user-defined enhancements 736.
  • the compiler 708 implements this mechanism on top of the macro and inline assembly mechanisms found in standard compilers such as the GNU compilers. For more information on these mechanisms, see, e.g., GNU C and C++ Compiler User's Guide, EGCS Version 1.0.3.
  • the compiler 708 When the user invokes the compiler 708 on her application, she tells the compiler 708 either via a command line option or an environment variable the name of the directory with the user-defined enhancements 736. That directory also contains the xtensa-tie . h file 742.
  • the compiler 708 automatically includes the file xtensa-tie . h into the user C or C++ application program being compiled as if the user had written the definition of foo herself.
  • the user has included intrinsic calls to the instruction foo in her application. Because of the included definition, the compiler 708 treats those intrinsic calls as calls to the included definition.
  • the compiler 708 treats the call to the macro foo as if the user had directly written the assembly language statement 738 rather than the macro call. That is, based on the standard inline assembly mechanism, the compiler 708 translates the call into the single assembly instruction foo. For example, the user might have a function that contains a call to the intrinsic foo:
  • the TIE compiler 702 merely creates the file xtensa-tie . h 742 which is automatically included by the prebuilt compiler 708 into the user's application.
  • the assembler 710 uses an encode library 744 to encode assembly instructions 750.
  • the interface to this library 744 includes functions to:
  • provide the bit patterns to be generated for each opcode for the opcode fields in a machine instruction 740; and -- encode the operand value for each instruction operand and insert the encoded operand bit patterns into the operand fields of a machine instruction 740.
  • the assembler might take the "foo a2 , a2 , a3" instruction and convert it into the machine instruction represented by the hexadecimal number 0x62230, where the high order 6 and the lower order
  • 0 together represent the opcode for foo, and the 2, 2 and 3 represent the three registers a2, a2 and a3 respectively.
  • one table maps opcode mnemonic strings to the internal opcode representation. For efficiency, this table may be sorted or it may be a hash table or some other data structure allowing efficient searching.
  • Another table maps each opcode to a template of a machine instruction with the opcode fields initialized to the appropriate bit patterns for that opcode. Opcodes with the same operand fields and operand encodings are grouped together.
  • the library contains a function to encode the operand value into a bit pattern and another function to insert those bits into the appropriate fields in a machine instruction.
  • a separate internal table maps each instruction operand to these functions.
  • the encode library 744 is implemented as a dynamically linked library (DLL).
  • DLLs are a standard way to allow a program to extend its functionality dynamically. The details of handling DLLs vary across different host operating systems, but the basic concept is the same.
  • the DLL is dynamically loaded into a running program as an extension of the program's code.
  • a run-time linker resolves symbolic references between the DLL and the main program and between the DLL and other DLLs already loaded.
  • a small portion of the code is statically linked into the assembler 710. This code is responsible for loading the DLL, combining the information in the DLL with the existing encode information for the pre-built instruction set 746 (which may have been loaded from a separate DLL), and making that information accessible via the interface functions described above.
  • the TIE compiler 702 When the user creates new enhancements 736, she invokes the TIE compiler 702 on a description of the enhancements 736.
  • the TIE compiler 702 generates C code defining the internal tables and functions which implement the encode DLL 744.
  • the TIE compiler 702 then invokes the host system's native compiler 746 (which compiles code to run on the host rather than on the processor being configured) to create the encode DLL 144 for the user-defined instructions 750.
  • the user invokes the pre-built assembler 710 on her application with a flag or environment variable pointing to the directory containing the user-defined enhancements 736.
  • the prebuilt assembler 710 dynamically opens the DLL 744 in the directory. For each assembly instruction, the prebuilt assembler 710 uses the encode DLL 744 to look up the opcode mnemonic, find the bit patterns for the opcode fields in the machine instruction, and encode each of the instruction operands.
  • the assembler 710 sees the TIE instruction "foo a2 , a2 , a 3"
  • the assembler 710 sees from a table that the "foo" opcode translates into the number 6 in bit positions 16 to 23. From a table, it finds the encoding functions for each of the registers. The functions encode a2 into the number 2, the other a2 into the number 2 and a3 into the number 3. From a table, it finds the appropriate set functions. Set_r_field puts the result value 2 into bit positions 12..15 of the instruction. Similar set functions appropriately place the other 2 and the 3.
  • the simulator 712 interacts with user-defined enhancements 736 in several ways. Given a machine instruction 740, the simulator 712 must decode the instruction; i.e., break up the instruction into the component opcode and operands. Decoding of user-defined enhancements 736 is done via a function in a decode DLL 748 (it is possible that the encode DLL 744 and the decode DLL 748 are actually a single DLL). For example, consider a case where the user defines three opcodes; fool, f oo2 and f oo3 with encodings 0x6, 0x16 and 0x26 respectively in bits 16 to 23 of the instruction and with 0 in bits 0 to 3.
  • comparing an opcode against all possible user-defined instructions 750 can be expensive, so the TIE compiler can instead use a hierarchical set of switch statements switch ( get_op0_field ( insn) ) ⁇ case 0x0 : switch (get_opl_field (insn) ) ⁇ case 0x6: switch (get_op2_field(insn) ) ⁇ case 0x0: return xtensa_fool_op; case 0x1: return xtensa_foo2_op; case 0x2: return xtensa_foo3_op; default: return XTENSA_UNDEFINED; ⁇ default: return XTENSAJJNDEFINED;
  • the decode DLL 748 includes functions for decoding instruction operands. This is done in the same manner as for encoding operands in the encode DLL 744. First, the decode DLL 748 provides functions to extract the operand fields from machine instructions. Continuing the previous examples, the TIE compiler 702 generates the following function to extract a value from bits 12 to 15 of an instruction:
  • the TIE description of an operand includes specifications of both encoding and decoding, so whereas the encode DLL 744 uses the operand encode specification, the decode DLL 748 uses the operand decode specification.
  • the simulator 712 When the user invokes the simulator 712, she tells the simulator 712 the directory containing the decode DLL 748 for the user-defined enhancements 736. The simulator 712 opens the appropriate DLL. Whenever the simulator 712 decodes an instruction, if that instruction is not successfully decoded by the decode function for the pre-built instruction set, the simulator 712 invokes the decode function in the
  • the simulator 712 Given a decoded instruction 750, the simulator 712 must interpret and model the semantics of the instruction 750. This is done functionally. Every instruction 750 has a corresponding function that allows the simulator 712 to model the semantics of that instruction 750.
  • the simulator 712 internally keeps track of all states of the simulated processor.
  • the simulator 712 has a fixed interface to update or query the processor's state.
  • user-defined enhancements 736 are written in the TIE hardware description language which is a subset of Verilog.
  • the TIE compiler 702 converts the hardware description into a C function used by the simulator 712 to model the new enhancements 736. Operators in the hardware description language are translated directly into the corresponding C operators. Operations that read state or write state are translated into the simulator's interface to update or query the processor's state.
  • the hardware operator "+” is translated directly into the C operator "+”.
  • the reads of the hardware registers ars and art are translated into a call of the simulator 712 function call "ar”.
  • the write of the hardware register arr is translated into a call to the simulator 712 function "set_ar”. Since every instruction implicitly increments the program counter,pc, by the size of the instruction, the TIE compiler 702 also generates a call to the simulator 712 function that increments the simulatedpc by 3, the size of the add instruction.
  • the ⁇ E compiler 702 When the ⁇ E compiler 702 is invoked, it creates semantic functions as described above for every user-defined instruction. It also creates a table that maps all the opcode names to the associated semantic functions. The table and functions are compiled using the standard compiler 746 into the simulator DLL 749.
  • the simulator 712 opens the appropriate DLL. Whenever the simulator 712 is invoked, it decodes all the instructions in the program and creates a table that maps instructions to the associated semantic functions. When creating the mapping, the simulator 712 opens the DLL and searches for the appropriate semantic functions.
  • simulating the semantics of a user-defined instruction 736 the simulator 712 directly invokes the function in the DLL.
  • the simulator 712 needs to simulate the performance effects of an instruction 750.
  • the simulator 712 uses a pipeline model for this purpose. Every instruction executes over several cycles. In each cycle, an instruction uses different resources of the machine. The simulator 712 begins trying to execute all the instructions in parallel. If multiple instructions try to use the same resource in the same cycle, the latter instruction is stalled waiting for the resource to free. If a latter instruction reads some state that is written by an earlier instruction but in a later cycle, the latter instruction is stalled waiting for the value to be written.
  • the simulator 712 uses a functional interface to model the performance of each instruction. A function is created for every type of instruction. That function contains calls to the simulator's interface that models the performance of the processor.
  • void foo_sched (u32 opO, u32 opl, u32 op2, u32 op3) ⁇ pipe_use_ifetch (3) ;
  • pipe__use (REGF32_AR, opl, 1) pipe_use (REGF32_AR, op2, 1) pipe_def (REGF32_AR, opO, 2) pipe_def_ifetch (-1) ;
  • the call to pipe_use_i fetch tells the simulator 712 that the instruction will require 3 bytes to be fetched.
  • the two calls to pipe_use tell the simulator 712 that the two input registers will be read in cycle 1.
  • the call to pipe_def tells the simulator 712 that the output register will be written in cycle 2.
  • the call to pipe_def_ifetch tells the simulator 712 that this instruction is not a branch, hence the next instruction can be fetched in the next cycle.
  • Pointers to these functions are placed in the same table as the semantic functions.
  • the functions themselves are compiled into the same DLL 749 as the semantic functions.
  • the simulator 712 When the simulator 712 is invoked, it creates a mapping between instructions and performance functions. When creating the mapping, the simulator 712 opens the DLL 749 and searches for the appropriate performance functions.
  • the simulator 712 When simulating the performance of a user-defined instruction 736, the simulator 712 directly invokes the function in the DLL 749.
  • the debugger interacts with user-defined enhancements 750 in two ways.
  • the user has the ability to print the assembly instructions 738 for user-defined instructions 736.
  • the debugger 730 must decode machine instructions 740 into assembly instructions 738. This is the same mechanism used by the simulator 712 to decode instructions, and the debugger 730 preferably uses the same DLL used by the simulator 712 to do the decoding.
  • the debugger In addition to decoding the instructions, the debugger must convert the decoded instruction into strings.
  • the decode DLL 748 includes a function to map each internal opcode representation to the corresponding mnemonic string. This can be implemented with a simple table.
  • the user can invoke the prebuilt debugger with a flag or environment variable pointing to the directory containing the user-defined enhancements 750.
  • the prebuilt debugger dynamically opens the appropriate DLL 748.
  • the debugger 730 also interacts with user-defined state 752.
  • the debugger 730 must be able to read and modify that state 752. In order to do so the debugger 730 communicates with the simulator 712. It asks the simulator 712 how large the state is and what are the names of the state variables. Whenever the debugger 730 is asked to print the value of some user state, it asks the simulator 712 the value in the same way that it asks for predefined state. Similarly, to modify user state, the debugger 730 tells the simulator 712 to set the state to a given value.
  • core software development tools may be specific to particular core instruction sets and processor states, and a single set of plug-in modules for user-defined enhancements may be evaluated in connection with multiple sets of core software development tools resident on the system.
  • # Data One of the following: # DataCache Data Cache parameters tt DataRAM Data RAM parameters tt DataROM Data ROM parameters tt Debug Debug option parameters tt Impl Implementation goals # Inst One of the following: # InstCache Instruction Cache parameters
  • # column 2 default value of the parameter tt column 3 perl expression used to check the validity of the values #
  • IsaUseDensityInstruction IsaUse32bitMulDiv 0 0
  • Perl is embedded in the source text by one of two means. Whole lines of perl can be embedded by preceding them with a semicolon (you would typically do this for looping statments or subroutine calls) . Alternatively, perl expressions can be embedded into the middle of other text by escaping them with backticks. -debug Print perl code to STDERR, so you can figure out why your embedded perl statements are looping forever.
  • -I dir search for include files in directory dir -o output_file Redirect the output to a file rather than a stdout.
  • -c config_file Read the specified config file, -e eval Eval eval before running program NOTE: the lines with only ";” and "; //" will go unaltered.
  • $lasttype "perl”; ⁇ if (( / ⁇ ⁇ s*; ⁇ s* ⁇ / ⁇ // )
  • ( / ⁇ ⁇ s*; ⁇ s*$/ )) ⁇ $buf . "print STDOUT ⁇ "$_ ⁇ "; ⁇ n";
  • GCC /usr/xtensa/stools/bin/gcc
  • MFILE $ (XTENSA) /Hardware/diag/Makefile. common all: run-base run-tie-cstub run-iss run-iss-old run-iss-new run-ver
  • me run-base me-base me-base; exit 0 run-tie-cstub: me-tie-cstub me-tie-cstub; exit 0 run-iss: me-xt
  • testdir $ (XTGO) -vcs -testdir ' pwd Vtestdir -test me-xt > run-ver . out 2>&1 grep Status run-ver. out testdir : mkdir -p testdir/me-xt gecho ' all : me-xt . dat me-xt . bfd ' > testdir/me-xt/Makef ile gecho " include $ (MFILE ) " >> testdir/me-xt/Makef ile clean : rm -rf me-* * . out testdir results
  • TEST PROGRAM ttin clude ⁇ stdio . h> ttinclude ⁇ stdlib . h> ttinclude ⁇ limits . h>
  • ABS (x) (((x) ⁇ 0) ? (-(x)) : (x)) ttdefine MIN(x,y) (((x) ⁇ (y)) ? (x) : (y)) ttdefine MAX(x,y) (((x) > (y)) ? (x) : (y)) ttdefine ABSD(x,y) (((x) > (y)) ? ((x) - (y)) ((y) (x)))))
  • A O[0] ;
  • the purpose of motion estimation is to find the unaligned 8x8 block of an existing (old) image that most closely resembles an aligned 8x8 block.
  • the search here is at any byte offset in +/- 16 bytes in x and +/- 16 bytes in y.
  • the search is a set of six nested loops.
  • OldB is pointer to a byte array of old block
  • NewB is pointer to a byte array of base block */ ttdefine NY 480 ttdefine NX 640
  • NewB[x] [y] x+2*y+2;
  • OldW is pointer to a word array of old block NewW is pointer to a word array of base block */ ttdefine NY 480 ttdefine NX 640 ttdefine BLOCKX 16 ttdefine BLOCKY 16 ttdefine SEARCHX 16 ttdefine SEARCHY 16 ttdefine MIN(x,y) ((x ⁇ y)?x:y) ttdefine MAX (x,y) ((x>y)?x:y) unsigned long 01dW[NY] [NX/sizeof (long) ] ; unsigned long NewW [NY] [NX/sizeof (long) ] ; unsigned short VectX [NY/BLOCKY] [NX/BLOCKX] ; unsigned short VectY [NY/BLOCKY] [NX/BLOCKX] ; void init ( )
  • 01dW[y][x] ((x «2) ⁇ y) «24
  • NewW[y] [x] ( (x «2) +2*y+2) «24
  • tie_program_foreach_instruction (_prog, _inst) ⁇ ⁇ tie_t *_iclass; ⁇ tie_program_foreach_iclass (_prog, _iclass) ⁇ ⁇ if (tie_get_predefined(_iclass) ) continue; ⁇ tie_iclass_foreach_instruction (_iclass, _inst) ⁇ ttdefine end_tie2ver_program_foreach_instruction ⁇

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • Geometry (AREA)
  • Evolutionary Computation (AREA)
  • Quality & Reliability (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Devices For Executing Special Programs (AREA)
  • Debugging And Monitoring (AREA)
  • Stored Programmes (AREA)
  • Executing Machine-Instructions (AREA)
EP00913380A 1999-02-05 2000-02-04 Automated processor generation system & method for designing a configurable processor Ceased EP1159693A2 (en)

Applications Claiming Priority (7)

Application Number Priority Date Filing Date Title
US246047 1999-02-05
US09/246,047 US6477683B1 (en) 1999-02-05 1999-02-05 Automated processor generation system for designing a configurable processor and method for the same
US09/323,161 US6701515B1 (en) 1999-05-27 1999-05-27 System and method for dynamically designing and evaluating configurable processor instructions
US323161 1999-05-27
US322735 1999-05-28
US09/322,735 US6477697B1 (en) 1999-02-05 1999-05-28 Adding complex instruction extensions defined in a standardized language to a microprocessor design to produce a configurable definition of a target instruction set, and hdl description of circuitry necessary to implement the instruction set, and development and verification tools for the instruction set
PCT/US2000/003091 WO2000046704A2 (en) 1999-02-05 2000-02-04 Automated processor generation system and method for designing a configurable processor

Publications (1)

Publication Number Publication Date
EP1159693A2 true EP1159693A2 (en) 2001-12-05

Family

ID=27399897

Family Applications (1)

Application Number Title Priority Date Filing Date
EP00913380A Ceased EP1159693A2 (en) 1999-02-05 2000-02-04 Automated processor generation system & method for designing a configurable processor

Country Status (6)

Country Link
EP (1) EP1159693A2 (ja)
JP (2) JP2003518280A (ja)
KR (2) KR100775547B1 (ja)
AU (1) AU3484100A (ja)
TW (1) TW539965B (ja)
WO (1) WO2000046704A2 (ja)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8775125B1 (en) 2009-09-10 2014-07-08 Jpmorgan Chase Bank, N.A. System and method for improved processing performance
US10084456B2 (en) 2016-06-18 2018-09-25 Mohsen Tanzify Foomany Plurality voter circuit
US10558437B1 (en) * 2013-01-22 2020-02-11 Altera Corporation Method and apparatus for performing profile guided optimization for high-level synthesis

Families Citing this family (38)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB0028079D0 (en) * 2000-11-17 2001-01-03 Imperial College System and method
JP2002230065A (ja) 2001-02-02 2002-08-16 Toshiba Corp システムlsi開発装置およびシステムlsi開発方法
WO2002084538A1 (en) 2001-04-11 2002-10-24 Mentor Graphics Corporation Hdl preprocessor
DE10128339A1 (de) * 2001-06-12 2003-01-02 Systemonic Ag Verfahren zur Validierung eines Modells für eine datenverarbeitende Schaltungsanordung
US6941548B2 (en) * 2001-10-16 2005-09-06 Tensilica, Inc. Automatic instruction set architecture generation
DE10205523A1 (de) * 2002-02-08 2003-08-28 Systemonic Ag Verfahren zum Bereitstellen einer Entwurfs-, Test- und Entwicklungsumgebung sowie ein System zur Ausführung des Verfahrens
US7200735B2 (en) 2002-04-10 2007-04-03 Tensilica, Inc. High-performance hybrid processor with configurable execution units
JP2003316838A (ja) * 2002-04-19 2003-11-07 Nec Electronics Corp システムlsiの設計方法及びこれを記憶した記録媒体
JP4202673B2 (ja) * 2002-04-26 2008-12-24 株式会社東芝 システムlsi開発環境生成方法及びそのプログラム
US7346881B2 (en) 2002-05-13 2008-03-18 Tensilica, Inc. Method and apparatus for adding advanced instructions in an extensible processor architecture
US7937559B1 (en) 2002-05-13 2011-05-03 Tensilica, Inc. System and method for generating a configurable processor supporting a user-defined plurality of instruction sizes
US7376812B1 (en) 2002-05-13 2008-05-20 Tensilica, Inc. Vector co-processor for configurable and extensible processor architecture
US7784024B2 (en) 2003-08-20 2010-08-24 Japan Tobacco Inc. Program creating system, program creating program, and program creating module
US7278122B2 (en) * 2004-06-24 2007-10-02 Ftl Systems, Inc. Hardware/software design tool and language specification mechanism enabling efficient technology retargeting and optimization
KR100722428B1 (ko) * 2005-02-07 2007-05-29 재단법인서울대학교산학협력재단 리소스 공유 및 파이프 라이닝 구성을 갖는 재구성가능배열구조
US7757224B2 (en) * 2006-02-02 2010-07-13 Microsoft Corporation Software support for dynamically extensible processors
KR100793210B1 (ko) * 2006-06-01 2008-01-10 조용범 Arm 프로세서에서의 메모리 접근 횟수를 줄인 디코더구현방법
KR100813662B1 (ko) 2006-11-17 2008-03-14 삼성전자주식회사 프로세서 구조 및 응용의 최적화를 위한 프로파일러
WO2008062768A1 (fr) 2006-11-21 2008-05-29 Nec Corporation Système de génération de code d'opération de commande
JP5217431B2 (ja) 2007-12-28 2013-06-19 富士通株式会社 演算処理装置及び演算処理装置の制御方法
WO2009084570A1 (ja) * 2007-12-28 2009-07-09 Nec Corporation コンパイラ組み込み関数追加装置
JP2010181942A (ja) * 2009-02-03 2010-08-19 Renesas Electronics Corp Pld/cpldからマイコンへの置換え見積の情報提供システム及び方法
TWI416302B (zh) * 2009-11-20 2013-11-21 Ind Tech Res Inst 具電源模式感知之時脈樹及其合成方法
KR101635397B1 (ko) * 2010-03-03 2016-07-04 삼성전자주식회사 재구성 가능한 프로세서 코어를 사용하는 멀티코어 시스템의 시뮬레이터 및 시뮬레이션 방법
WO2012108411A1 (ja) 2011-02-10 2012-08-16 日本電気株式会社 符号化/復号化処理プロセッサ、および無線通信装置
US8880851B2 (en) * 2011-04-07 2014-11-04 Via Technologies, Inc. Microprocessor that performs X86 ISA and arm ISA machine language program instructions by hardware translation into microinstructions executed by common execution pipeline
KR20130088285A (ko) * 2012-01-31 2013-08-08 삼성전자주식회사 데이터 처리 시스템 및 그 시스템에서 데이터 시뮬레이션 방법
KR102025694B1 (ko) * 2012-09-07 2019-09-27 삼성전자 주식회사 재구성 가능한 프로세서의 검증 방법
KR102122455B1 (ko) * 2013-10-08 2020-06-12 삼성전자주식회사 프로세서의 디코더 검증을 위한 테스트 벤치 생성 방법 및 이를 위한 장치
RU2631989C1 (ru) * 2016-09-22 2017-09-29 ФЕДЕРАЛЬНОЕ ГОСУДАРСТВЕННОЕ КАЗЕННОЕ ВОЕННОЕ ОБРАЗОВАТЕЛЬНОЕ УЧРЕЖДЕНИЕ ВЫСШЕГО ОБРАЗОВАНИЯ "Военная академия Ракетных войск стратегического назначения имени Петра Великого" МИНИСТЕРСТВА ОБОРОНЫ РОССИЙСКОЙ ФЕДЕРАЦИИ Устройство для диагностического контроля выполнения проверок
US10426424B2 (en) 2017-11-21 2019-10-01 General Electric Company System and method for generating and performing imaging protocol simulations
KR102104198B1 (ko) * 2019-01-10 2020-05-29 한국과학기술원 느긋한 심볼화를 활용한 바이너리 재조립 기술의 정확도 향상 기술 및 도구
CN110096257B (zh) * 2019-04-10 2023-04-07 沈阳哲航信息科技有限公司 一种基于智能识别的设计图形自动化评判系统及方法
CN111832739B (zh) * 2019-04-18 2024-01-09 中科寒武纪科技股份有限公司 一种数据处理方法及相关产品
CN111400986B (zh) * 2020-02-19 2024-03-19 西安智多晶微电子有限公司 一种集成电路计算设备及计算处理系统
JP7461181B2 (ja) * 2020-03-16 2024-04-03 本田技研工業株式会社 制御装置、システム、プログラム、及び制御方法
CN114721982A (zh) * 2022-03-22 2022-07-08 潍柴动力股份有限公司 一种可配置存储数据类型的读写处理方法及系统
CN114492264B (zh) * 2022-03-31 2022-06-24 南昌大学 门级电路的转译方法、系统、存储介质及设备

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5896521A (en) * 1996-03-15 1999-04-20 Mitsubishi Denki Kabushiki Kaisha Processor synthesis system and processor synthesis method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
SE505783C3 (sv) * 1995-10-03 1997-10-06 Ericsson Telefon Ab L M Foerfarande foer att tillverka en digital signalprocessor
GB2308470B (en) * 1995-12-22 2000-02-16 Nokia Mobile Phones Ltd Program memory scheme for processors

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5896521A (en) * 1996-03-15 1999-04-20 Mitsubishi Denki Kabushiki Kaisha Processor synthesis system and processor synthesis method

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
HADJIYIANNIS ET AL: "ISDL: An Instruction Set Description Language for Retargetability", 6 June 1997 (1997-06-06), DAC'97, pages 299 - 302, XP000731853 *
HARTOOG ET AL: "Generation of Software Tools from Processor Descriptions for Hardware/Software Codesign", 1997, DAC'97, pages 303 - 306, XP010227598 *
JIN-HYUK YANG ET AL: "MetaCore: an application specific DSP development system", DESIGN AUTOMATION CONFERENCE, 1998. PROCEEDINGS SAN FRANCISCO, CA, USA 15-19 JUNE 1998, NEW YORK, NY, USA,IEEE, US, 15 June 1998 (1998-06-15), pages 800 - 803, XP010309324, ISBN: 978-0-89791-964-7, DOI: DOI:10.1109/DAC.1998.724580 *
NURMI J ET AL: "A new generation of parameterized and extensible DSP cores", SIGNAL PROCESSING SYSTEMS, 1997. SIPS 97 - DESIGN AND IMPLEMENTATION., 1997 IEEE WORKSHOP ON LEICESTER, UK 3-5 NOV. 1997, NEW YORK, NY, USA,IEEE, US, 3 November 1997 (1997-11-03), pages 320 - 329, XP010249792, ISBN: 978-0-7803-3806-7, DOI: 10.1109/SIPS.1997.626254 *
NURPRASETYO E F ET AL: "SOFT-CORE PROCESSOR ARCHITECTURE FOR EMBEDDED SYSTEM DESIGN", IEICE TRANSACTIONS ON ELECTRONICS, ELECTRONICS SOCIETY, TOKYO, JP, vol. E81-C, no. 9, 1 September 1998 (1998-09-01), pages 1416 - 1423, XP000851263, ISSN: 0916-8524 *
SATO J ET AL: "PEAS-I: A HARDWARE/SOFTWARE CODESIGN SYSTEM FOR ASIP DEVELOPMENT", IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS,COMMUNICATIONS AND COMPUTER SCIENCES, ENGINEERING SCIENCES SOCIETY, TOKYO, JP, vol. E77-A, no. 3, 1 March 1994 (1994-03-01), pages 483 - 491, XP000450885, ISSN: 0916-8508 *
See also references of WO0046704A2 *
SHACKLEFORD B ET AL: "SATSUKI: AN INTEGRATED PROCESSOR SYNTHESIS AND COMPILER GENERATION SYSTEM", IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, INFORMATION & SYSTEMS SOCIETY, TOKYO, JP, vol. E79-D, no. 10, 1 October 1996 (1996-10-01), pages 1373 - 1381, XP000635525, ISSN: 0916-8532 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8775125B1 (en) 2009-09-10 2014-07-08 Jpmorgan Chase Bank, N.A. System and method for improved processing performance
US10365986B2 (en) 2009-09-10 2019-07-30 Jpmorgan Chase Bank, N.A. System and method for improved processing performance
US11036609B2 (en) 2009-09-10 2021-06-15 Jpmorgan Chase Bank, N.A. System and method for improved processing performance
US10558437B1 (en) * 2013-01-22 2020-02-11 Altera Corporation Method and apparatus for performing profile guided optimization for high-level synthesis
US10084456B2 (en) 2016-06-18 2018-09-25 Mohsen Tanzify Foomany Plurality voter circuit

Also Published As

Publication number Publication date
JP2003518280A (ja) 2003-06-03
JP2007250010A (ja) 2007-09-27
KR20020021081A (ko) 2002-03-18
CN1382280A (zh) 2002-11-27
KR100874738B1 (ko) 2008-12-22
WO2000046704A3 (en) 2000-12-14
KR100775547B1 (ko) 2007-11-09
AU3484100A (en) 2000-08-25
TW539965B (en) 2003-07-01
WO2000046704A2 (en) 2000-08-10
KR20070088818A (ko) 2007-08-29

Similar Documents

Publication Publication Date Title
US8875068B2 (en) System and method of customizing an existing processor design having an existing processor instruction set architecture with instruction extensions
KR100775547B1 (ko) 구성가능한 프로세서를 설계하기 위한 프로세서 자동 생성시스템 및 방법
Mishra et al. Architecture description languages for programmable embedded systems
US6964029B2 (en) System and method for partitioning control-dataflow graph representations
JP4403080B2 (ja) 再構成可能なハードウェアエミュレーションによる制御データフローグラフを用いたデバッグ
JP4482454B2 (ja) 高級プログラミング言語におけるプログラムをハイブリッド計算プラットフォームの統一された実行可能要素に変換するためのプロセス
Schliebusch et al. Optimized ASIP synthesis from architecture description language models
Chattopadhyay et al. LISA: A uniform ADL for embedded processor modeling, implementation, and software toolsuite generation
Villarraga et al. Software in a Hardware View: New Models for HW-dependent Software in SoC Verification
Halambi et al. Automatic software toolkit generation for embedded systems-on-chip
JP4801210B2 (ja) 拡張プロセッサを設計するシステム
Kim et al. Top-down retargetable framework with token-level design for accelerating simulation speed of processor architecture
Mishra et al. Processor modeling and design tools
Balboni et al. Partitioning of hardware-software embedded systems: A metrics-based approach
CN1382280B (zh) 用于设计可配置的处理器的自动处理器产生系统及其方法
Chattopadhyay et al. Processor Modeling and Design Tools
Meyr et al. Designing and modeling MPSoC processors and communication architectures
Lamberts A move processor generator
Weber et al. Efficiently Describing and Evaluating the ASIPs
Mishra et al. EXPRESSION: An ADL for Software Toolkit Generation, Exploration, and Validation

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20010823

AK Designated contracting states

Kind code of ref document: A2

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE

AX Request for extension of the european patent

Free format text: AL;LT;LV;MK;RO;SI

17Q First examination report despatched

Effective date: 20080208

REG Reference to a national code

Ref country code: DE

Ref legal event code: R003

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN REFUSED

18R Application refused

Effective date: 20160401