WO2015078380A1 - 一种指令集转换系统和方法 - Google Patents

一种指令集转换系统和方法 Download PDF

Info

Publication number
WO2015078380A1
WO2015078380A1 PCT/CN2014/092313 CN2014092313W WO2015078380A1 WO 2015078380 A1 WO2015078380 A1 WO 2015078380A1 CN 2014092313 W CN2014092313 W CN 2014092313W WO 2015078380 A1 WO2015078380 A1 WO 2015078380A1
Authority
WO
WIPO (PCT)
Prior art keywords
instruction
address
internal
block
external
Prior art date
Application number
PCT/CN2014/092313
Other languages
English (en)
French (fr)
Inventor
林正浩
Original Assignee
上海芯豪微电子有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海芯豪微电子有限公司 filed Critical 上海芯豪微电子有限公司
Priority to JP2016534248A priority Critical patent/JP6591978B2/ja
Priority to US15/100,250 priority patent/US10387157B2/en
Priority to KR1020167017252A priority patent/KR20160130741A/ko
Priority to EP14865998.0A priority patent/EP3076288A4/en
Publication of WO2015078380A1 publication Critical patent/WO2015078380A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3802Instruction prefetching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0875Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with dedicated cache, e.g. instruction or stack
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/3017Runtime instruction translation, e.g. macros
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/45Caching of specific data in cache memory
    • G06F2212/452Instruction code

Definitions

  • the invention relates to the field of computers, communications and integrated circuits.
  • the function of the virtual machine is to translate or interpret a program consisting of an instruction set (external instruction set) that is not supported by the processor core, and generate an instruction corresponding to the instruction set (internal instruction set) supported by the processor core for execution.
  • the method of interpretation is to sequentially take out the fields including the operation code and the operand in the external instruction by the virtual machine in real time during the running process, and then use the stack structure implemented in the memory according to different operations.
  • the code operates on the operands. Therefore, many internal instructions need to be executed to implement any external instruction function, which is inefficient.
  • a similar software compilation process is executed before the program is executed, and the program is converted into a form completely composed of an internal instruction set. This is more efficient when executing programs, but the software compilation itself still has a lot of overhead.
  • the second solution is to include an instruction decoder corresponding to different instruction sets in the processor core, and execute the instructions of different instruction sets to decode and perform subsequent pipeline operations.
  • This method has almost no loss in execution efficiency, but the increased instruction decoder leads to an increase in hardware overhead and an increase in the cost of the processor chip.
  • the third solution is to add a conversion module outside the processor core to convert the external instruction set into an internal instruction set for execution by the processor core.
  • a conversion module can be implemented in software, but in general, interpretation by software is easy to expand, but the efficiency is too low.
  • This conversion module can also be implemented in hardware, but it is difficult to extend and cannot fully utilize the internal instructions obtained by the cache storage conversion.
  • the conversion module is located between the cache and the processor core, the external instructions stored in the cache must be converted for execution by the processor core. In this way, whether or not the cache hits, the conversion step is performed, and the same external instruction is repeatedly converted repeatedly, which not only increases the power consumption, but also deepens the pipeline of the processor core, thereby increasing hardware overhead and branch prediction. Performance loss on failure.
  • the cache stores internal instructions that are converted, that is, the cache is addressed according to the internal instruction address, and the processor core executes the branch.
  • the branch target instruction address calculated by the instruction is the external instruction address. Since the internal instruction and the external instruction are not one-to-one (for example, an external instruction can correspond to multiple internal instructions), it is necessary to record the correspondence between the internal instruction address and the external instruction address in order to branch the target instruction during branch transfer.
  • the external instruction address is translated to the internal instruction address and the correct instruction is found in the cache.
  • the difficulty in recording the correspondence between the internal instruction address and the external instruction address is how to store it efficiently and how to convert it efficiently.
  • the instruction can only be read from the lower level memory other than the conversion module according to the external instruction address, and then converted into the cache by the conversion module and stored in the buffer for execution by the processor core, which still seriously affects the execution efficiency.
  • One way to solve this problem is to use a program-based path trace cache ( Trace cache Instead of the traditional address-based matching cache.
  • the trace cache stores a large number of instructions with duplicate addresses but on different paths, resulting in a large capacity waste, resulting in low performance of the trace cache.
  • the method and system apparatus proposed by the present invention can directly address one or more of the above or other difficulties.
  • the invention provides an instruction set conversion method, the method comprising: converting an external instruction into an internal instruction, and establishing a mapping relationship between an external instruction address and an internal instruction address; storing the internal instruction directly in the processor core energy Accessing the cache; directly reading the corresponding internal instruction for the buffer execution according to the internal instruction address; or converting the external instruction address output by the processor core to the internal instruction address according to the mapping relationship, Cache addressing reads the corresponding internal instructions for execution by the processor core.
  • the processor core is provided with a subsequent instruction according to the program execution flow and the feedback of the processor core execution instruction; the feedback of the processor core execution instruction may be generated when the processor core executes the branch instruction. A signal of whether a branch transfer has occurred.
  • each instruction field including an instruction type in an external instruction is extracted for an external instruction that needs to be converted; and an instruction type and an instruction conversion control of the corresponding internal instruction are searched according to the extracted instruction type.
  • an external instruction is converted into an internal instruction; wherein an instruction address of the external instruction corresponds to an instruction address of the internal instruction; or an external instruction is converted into a plurality of internal instructions; wherein The instruction address of the external instruction corresponds to the instruction address of the first internal instruction of the plurality of internal instructions.
  • the plurality of external instructions are converted into an internal instruction; wherein an instruction address of the first external instruction of the plurality of external instructions corresponds to an instruction address of the internal instruction.
  • a mapping relationship between an external instruction address and an internal instruction address is established.
  • the mapping relationship between the external instruction address and the internal instruction address includes: mapping between an external instruction address and an internal instruction block address, an address within an external instruction block, and an internal instruction block. The mapping between addresses.
  • a mapping relationship between an external instruction address and an internal instruction block address may be represented by a data structure; the internal instruction block address is stored in the data structure, and the internal instruction block address is Simultaneously sort by the external instruction block address and the address within the external instruction block.
  • the external instruction block address and the address in the external instruction block in the external instruction address may be in the data structure. Find the corresponding location and read the internal instruction block address stored in it.
  • the insertion position may be found according to an external instruction block address in the external instruction address and an address in the external instruction block. And store the internal instruction block address corresponding to the external instruction address in the location.
  • the external instruction address may be converted to obtain a corresponding internal instruction block address.
  • the address in the external instruction block can be converted to obtain the address in the corresponding internal instruction block.
  • the forward shift logic starts from the initial value, and starts from the start address of the external instruction block where the address is located to the external instruction address.
  • the number of external instructions is counted; wherein, after each of the external instructions, one bit is shifted forward, and finally a shift result is obtained; by the reverse shift logic, from the internal instruction block corresponding to the external instruction block
  • the start address starts counting the number of the first internal instructions corresponding to each external instruction; wherein, after each of the internal instructions, one bit is reversed until the shift result returns to the initial value;
  • the corresponding internal instruction block address corresponds to the intra-block address of the external instruction.
  • the stack register operation is converted into an operation on the register file by address calculation, so that the register file inside the processor core can be used as a stack register.
  • the converting is capable of converting an instruction of one or more instruction sets into an instruction of an instruction set.
  • the invention also proposes an instruction set conversion system, the system comprising: a processor core for executing an internal instruction; a converter for converting an external instruction into an internal instruction, and establishing an external instruction address and an internal instruction address The mapping relationship between the external instruction address and the internal instruction address, and the conversion between the external instruction address and the internal instruction address; the cache is used to store the converted internal The instruction outputs the corresponding internal for the processor core according to the internal instruction address.
  • the converter further includes: a memory, configured to store a correspondence between an external instruction type and an internal instruction type, and a correspondence between respective instruction fields between the corresponding external instruction and the internal instruction; For aligning external instructions, and shifting the external instructions to an instruction block and aligning when the external instruction crosses the boundary of the instruction block; the extractor is used to extract each instruction field in the external instruction Wherein the extracted instruction type is used to address the memory to read the instruction conversion control information corresponding to the external instruction and the corresponding internal instruction type, and extract the instruction domain according to the control information. Shifting; an instruction splicer for splicing the internal instruction type and the shifted instruction domain to form an internal instruction.
  • the address mapping module further includes: a block address mapping module, configured to store a mapping relationship between an external instruction block address and an internal instruction block address, and convert the external instruction block address into an internal The instruction block address; the offset address mapping module is configured to store the mapping relationship between the address in the external instruction block and the address in the internal instruction block, and convert the address in the external instruction block into the address in the internal instruction block.
  • a block address mapping module configured to store a mapping relationship between an external instruction block address and an internal instruction block address, and convert the external instruction block address into an internal The instruction block address
  • the offset address mapping module is configured to store the mapping relationship between the address in the external instruction block and the address in the internal instruction block, and convert the address in the external instruction block into the address in the internal instruction block.
  • the system further includes a tracking system; the tracking system performs flow and cache addressing on the program execution flow and processor core execution instructions according to the program execution stream stored therein, and The subsequent instructions are read out from the cache and sent to the processor core for execution; the feedback of the execution instructions of the processor core may be a signal of whether a branch transfer occurs when the processor core executes the branch instruction.
  • the address mapping module further includes a forward shift logic and a reverse shift logic; for any external instruction address, by using a forward shift logic, starting from an initial value, Counting from the start address of the external instruction block where the address is located to the number of external instructions between the external instruction address; wherein, after each of the external instructions, one bit is shifted forward, and finally a shift result is obtained; Counting, by the reverse shift logic, the number of the first internal instructions corresponding to each external instruction from the start address of the internal instruction block corresponding to the external instruction block; wherein, each of the internal instructions is passed , shifting one bit backwards until the shift result returns to the initial value; at this time, the corresponding internal instruction block address corresponds to the intra-block address of the external instruction.
  • a register file in the processor core can be used as a stack register; the system further includes: a top-of-stack pointer register for storing a current top-of-stack pointer, the pointer pointing to the register file a register; an adder for calculating a value of a top-of-stack pointer plus one, corresponding to a position of a register above the top of the current stack; a subtractor for calculating a value of a top-of-stack minus one, corresponding to the current top-of-stack register
  • the position of the register; the bottom control module is configured to detect whether the stack register is about to be empty or about to be full, and send the value of at least one register at the bottom of the stack to the memory when the stack register is about to be full, and adjust the stack accordingly.
  • the bottom pointer makes the stack register not overflow; or when the stack register is about to be empty, the stack bottom pointer is adjusted accordingly, and the value of at least one register previously sent to the memory is saved back to the bottom of the stack, so that the stack register can continue to provide operations.
  • the number is executed by the processor core.
  • the instruction filled in the level 1 cache is reviewed, and corresponding instruction information is extracted; the first read pointer determines how to update according to the instruction information rather than the function of the instruction itself.
  • the execution result of the conditional branch instruction is checked according to the processor: if the branch transfer occurs, The first read pointer is updated to the branch target addressing address value of the conditional branch instruction; if the branch transfer does not occur, the first read pointer is updated to the branch target addressing address value of the unconditional branch instruction; The core does not need to execute the unconditional branch instruction in a single clock cycle.
  • one of the next instruction and the branch target instruction is executed as a subsequent instruction execution according to the branch prediction selection order, and another addressing address is saved; If the branch transfer result is consistent with the branch prediction, the subsequent instruction is executed; if the branch transfer result is inconsistent with the branch prediction, the pipeline is cleared and re-executed from the instruction corresponding to the saved addressed address.
  • the first read pointer determines how to update according to the instruction information rather than the function of the instruction itself.
  • the instruction information stored in the track point pointed by the first read pointer and the subsequent track point is simultaneously read from the track table.
  • the execution result of the conditional branch instruction is checked according to the processor: if the branch transfer occurs, The first read pointer is updated to the branch target addressing address value of the conditional branch instruction; if the branch transfer does not occur, the first read pointer is updated to the branch target addressing address value of the unconditional branch instruction; The core does not need to execute the unconditional branch instruction in a single clock cycle.
  • the tracking system further includes a register for storing one of the next instruction and the branch target instruction sequentially; when the processor core executes the branch instruction, according to The branch prediction selection sequentially executes one of the next instruction and the branch target instruction as a subsequent instruction execution, and stores the other address address in the register; if the branch transfer result is consistent with the branch prediction, the subsequent instruction is continued; If the branch transfer result is inconsistent with the branch prediction, the pipeline is emptied and re-executed from the instruction corresponding to the addressed address held in the register.
  • an end track point is added after the last track point of each track in the track table;
  • the instruction type of the end track point is an unconditional branch instruction, and the branch target addressing address
  • the addressing address of the first track point of the next track is sequentially executed; when the first read pointer points to the end track point, the level 1 cache outputs a null instruction.
  • an end track point is added after the last track point of each track in the track table;
  • the instruction type of the end track point is an unconditional branch instruction, and the branch target addressing address
  • the addressing address of the first track point of the next track is sequentially executed; when the track point before the end of the track point is not a branch point, the instruction type of the end track point and the branch target addressing address may be used as the instruction of the track point.
  • Type and branch target addressing address is used as the instruction of the track point.
  • the present invention also provides a processor system capable of executing one or more sets of instructions, comprising: a first memory for storing a plurality of computer instructions belonging to the first instruction set; and an instruction converter for The plurality of computer instructions belonging to the first instruction set are converted into a plurality of internal instructions, the internal instructions belong to a second instruction set; and a second memory is configured to store the plurality of blocks converted by the instruction converter An internal instruction; a processor core coupled to the second memory for reading and executing the second memory from the second memory without accessing the plurality of computer instructions and without the participation of the instruction converter Multiple internal instructions.
  • the instruction converter includes a memory, and the memory may be configured to store a mapping relationship between the first instruction set and the second instruction set according to the configuration; the instruction converter is configured according to the first A mapping relationship between an instruction set and a second instruction set converts the plurality of computer instructions belonging to the first instruction set into the plurality of internal instructions belonging to the second instruction set.
  • system further includes: an address converter connecting the instruction converter and the processor core, configured to translate the target computer instruction address in the plurality of computer instructions into a target in the plurality of internal instructions The internal address of the instruction.
  • mapping the target computer instruction address to an internal instruction block address when the address converter converts the address: mapping the target computer instruction address to an internal instruction block address; mapping the target computer instruction address to an internal instruction corresponding to the block address An intra-block offset address in an instruction block; merging the block address and the intra-block offset address to form an internal address.
  • the block address is generated according to a block address mapping relationship between the computer instruction block address and the internal instruction block address.
  • the block address mapping relationship is stored by an address converter; and the intra-block offset address is generated by hardware logic according to a mapping relationship table mapping.
  • the system further includes: an end flag memory for storing an internal instruction address of the end instruction of the internal instruction block; the end instruction is the last internal instruction before the next internal instruction block transferred to the sequential address .
  • system further includes: a lower block address memory for storing a block address of the next internal instruction block of the sequential address; and a branch target buffer for storing the internal instruction address of the branch target.
  • the first memory stores a plurality of computer instructions belonging to a third instruction set; and the instruction converter stores the third instruction set and the second instruction set in the memory according to the configuration.
  • a mapping relationship between the plurality of computer instructions belonging to the third instruction set and the plurality of computer instructions belonging to the second instruction set according to a mapping relationship between the third instruction set and the second instruction set stored therein A plurality of internal instructions are described.
  • a first thread instruction sequence and a second thread instruction sequence are run on the system; wherein: the first thread instruction sequence is composed of a plurality of computer instructions of the first instruction set; and the second thread instruction sequence is a plurality of computer instructions of the three instruction sets;
  • the instruction converter simultaneously stores a mapping relationship between the first instruction set and the second instruction set, and a mapping relationship between the third instruction set and the second instruction set in the memory according to the configuration; the instruction converter is configured according to Thread number selects one of a mapping relationship between the first instruction set and the second instruction set and a mapping relationship between the third instruction set and the second instruction set, and converts the plurality of computer instructions of the thread into The plurality of internal instructions belonging to the second instruction set.
  • each of the plurality of computer instructions includes at least one instruction field whose content is an instruction type; each of the plurality of internal instructions includes at least one content as an instruction type.
  • the instruction field; the plurality of computer instructions and the plurality of internal instructions are in one-to-one correspondence; the mapping relationship includes a mapping relationship between an instruction type of each computer instruction and an instruction type of each internal instruction, and each A mapping relationship between an instruction field other than the instruction type and an instruction field other than the instruction type in each internal instruction.
  • each of the plurality of computer instructions includes at least one instruction field whose content is an instruction type; each of the plurality of internal instructions includes at least one content as an instruction type.
  • the mapping relationship includes a shift logic; and an instruction field of at least one of the plurality of internal instructions is generated by shifting a corresponding instruction field of the corresponding computer instruction.
  • the instruction field of the computer instruction includes at least one instruction type; and the instruction converter uses at least the instruction type to read a corresponding mapping relationship of the memory address in the instruction converter.
  • the present invention also provides a method for a processor system for executing one or more sets of instructions, the method comprising: storing a plurality of computer instructions belonging to a first set of instructions in a first memory; Converting, by the converter, the plurality of computer instructions into a plurality of internal instructions belonging to a second instruction set; storing the plurality of internal instructions converted by the instruction converter in a second memory; The processor core of the second memory reads and executes the plurality of internal instructions from the second memory without accessing the plurality of computer instructions and without the participation of the instruction converter.
  • the instruction converter is configured by storing the first instruction set and the second instruction set mapping relationship in a memory of the instruction converter; and the instruction converter is configured according to the first instruction stored therein
  • the mapping relationship between the set and the second set of instructions converts the plurality of computer instructions belonging to the first set of instructions into the plurality of internal instructions belonging to the second set of instructions.
  • converting a target computer instruction address in the plurality of computer instructions into a target instruction in the plurality of internal instructions by an address converter connecting the instruction converter and the processor core Internal address.
  • mapping the target computer instruction address to an internal instruction block address when the address converter converts the address: mapping the target computer instruction address to an internal instruction block address; mapping the target computer instruction address to an internal instruction corresponding to the block address An intra-block offset address in an instruction block; merging the block address and the intra-block offset address to form an internal address.
  • the block address is generated according to a block address mapping relationship between the computer instruction block address and the internal instruction block address.
  • the block address mapping relationship is stored by an address converter; and the intra-block offset address is generated by hardware logic according to a mapping relationship table mapping.
  • the method further comprises: storing, by an end flag memory, an internal instruction address of an end instruction of the internal instruction block; the end instruction is the last internal instruction before the next internal instruction block transferred to the sequential address.
  • the method further includes: storing, by a lower block address memory, a block address of the next internal instruction block of the sequential address; storing, by a branch target, an internal instruction address of the branch target.
  • the method storing a plurality of computer instructions belonging to a third instruction set in the first memory; storing, by the instruction converter, a third instruction set and a a mapping relationship between the two instruction sets; converting, by the instruction converter, the plurality of computer instructions belonging to the third instruction set to belong to the second according to a mapping relationship between the third instruction set and the second instruction set stored therein The plurality of internal instructions of the instruction set.
  • the method running a first thread instruction sequence and a second thread instruction sequence; wherein: the first thread instruction sequence is composed of a plurality of computer instructions of the first instruction set; and the second thread instruction sequence is a plurality of computer instructions of the third instruction set; And the mapping between the first instruction set and the second instruction set and the mapping relationship between the third instruction set and the second instruction set are simultaneously stored in the memory by the instruction converter according to the configuration; Selecting one of the mapping relationship between the first instruction set and the second instruction set and the mapping relationship between the third instruction set and the second instruction set according to a thread number, converting the plurality of computer instructions of the thread The plurality of internal instructions belonging to the second instruction set.
  • each of the plurality of computer instructions includes at least one instruction field whose content is an instruction type; each of the plurality of internal instructions includes at least one content as an instruction type.
  • the instruction field; the plurality of computer instructions and the plurality of internal instructions are in one-to-one correspondence; the mapping relationship includes a mapping relationship between an instruction type of each computer instruction and an instruction type of each internal instruction, and each A mapping relationship between an instruction field other than the instruction type and an instruction field other than the instruction type in each internal instruction.
  • each of the plurality of computer instructions includes at least one instruction field whose content is an instruction type; each of the plurality of internal instructions includes at least one content as an instruction type.
  • an instruction field of at least one of the plurality of internal instructions is generated by shifting a corresponding instruction field of the corresponding computer instruction.
  • the instruction field of the computer instruction includes at least one instruction type; and the instruction converter uses at least the instruction type to read a corresponding mapping relationship of the memory address in the instruction converter.
  • the cache system (ie, the higher level cache) closest to the processor core in the processor system of the present invention stores the internal instruction set supported by the processor core itself, and is stored in the main memory or the lower level cache. External instruction set. By configuring the converter, the corresponding external instruction set can be converted to an internal instruction set for execution by the processor core. Therefore, it is convenient to extend the instruction set supported by the processor system.
  • the invention provides internal instructions to the processor core directly from the higher level cache according to the program execution flow and the feedback of the processor core execution instructions, which reduces the pipeline depth and improves the pipeline efficiency. Especially in the case of branch prediction errors, the wasteful pipeline cycle can be reduced.
  • FIG. 1 is a schematic diagram of a processor system of the present invention
  • Figure 2 is an embodiment of the converter of the present invention
  • Figure 3A is an embodiment of the aligner of the present invention.
  • Figure 3B is an embodiment of the operation of the aligner of the present invention.
  • Figure 4A is an embodiment of the extractor of the present invention.
  • Figure 4B is an embodiment of the operation process of the extractor of the present invention.
  • FIG. 5A is a schematic diagram of mapping information according to the present invention.
  • FIG. 5B is another schematic diagram of mapping information according to the present invention.
  • Figure 5C is an embodiment of the operation of the mapping information store of the present invention.
  • Figure 5D is another embodiment of the operation of the mapping information store of the present invention.
  • Figure 5E is another embodiment of the operation of the mapping information store of the present invention.
  • Figure 5F is an embodiment of the command splicer of the present invention.
  • FIG. 6 is an embodiment of a processor system including a multi-layer cache of the present invention.
  • FIG. 7A is an embodiment of a track table based cache structure of the present invention.
  • Figure 7B is an embodiment of the scan converter of the present invention.
  • 8A is a schematic diagram of a correspondence relationship between an external instruction block and an internal instruction block according to the present invention.
  • 8B is an embodiment of the storage form of the offset address mapping relationship of the present invention.
  • Figure 8C is an embodiment of the offset address converter of the present invention.
  • 8D is an embodiment of the block address mapping module of the present invention.
  • FIGS. 9A-9F are schematic diagrams showing the operation process of the processor system including the multi-layer cache according to the present invention.
  • Figure 10A is an embodiment of the operand stack of the present invention.
  • Figure 10B is an embodiment of the update stack of the present invention.
  • Figure 10C is another embodiment of the update stack of the present invention.
  • 11A is another embodiment of a track table based cache structure of the present invention.
  • Figure 11B is an embodiment of the present invention supporting guess execution
  • FIG. 12 is an embodiment of a processor system including a configurable converter of the present invention
  • Figure 13A is a block diagram embodiment of the configurable converter of the present invention.
  • Figure 13B is an embodiment of a memory in the configurable converter of the present invention.
  • Figure 13C is another embodiment of a memory in the configurable converter of the present invention.
  • FIG. 14 is an embodiment of a processor system including a configurable converter and an address mapping module of the present invention
  • 15 is another embodiment of a processor system including a configurable converter and an address mapping module of the present invention.
  • 16 is an embodiment of a processor system including a branch target table of the present invention.
  • 17 is another embodiment of a processor system including a branch target table and a tracker of the present invention.
  • Figure 18A is an embodiment of a lower block address memory format of the present invention.
  • Figure 18B is another embodiment of the lower block address memory format of the present invention.
  • Figure 18C is a schematic diagram of an external instruction address format in the two storage hierarchy processor systems
  • 19 is an embodiment of a processor system including a two-layer instruction memory of the present invention.
  • 20 is a schematic diagram of a tag memory structure in two storage hierarchy processor systems of the present invention.
  • 21 is an embodiment of an instruction memory storing internal instructions in the case where the external instruction boundary is not aligned according to the present invention
  • FIG. 23 is an embodiment of a processor system including a track table of the present invention.
  • Figure 24 is an embodiment of a processing system for implementing a stack operation function using a register file in accordance with the present invention.
  • Figure 14 shows a preferred embodiment of the invention.
  • Instruction Address refers to the memory address of the instruction in the main memory, that is, the instruction can be found in the main memory according to the address.
  • the virtual address is assumed to the physical address, and the method of the present invention is also applicable for the case where address mapping is required.
  • the current instruction may refer to an instruction currently being executed or fetched by the processor core; the current instruction block may refer to an instruction block containing an instruction currently being executed by the processor.
  • the term 'external instruction set ( Guest Instrution Set)
  • the instruction set corresponding to the program executed by the processor system of the present invention the instruction contained in the 'external instruction set' is the 'external instruction';
  • the term 'internal instruction set' (Host Instruction Set) 'represents the instruction set supported by the processor core itself in the processor system of the present invention, the instruction contained in the 'internal instruction set' is the 'internal instruction';
  • the term 'instruction block' represents a group of consecutive consecutive high-order instruction addresses.
  • the term 'instruction field' is a contiguous area in the instruction word that represents the same content ( Field ), such as the first opcode (Op-code) field, the second opcode field, the first source register (Source Register) field, the second source register field, and the destination register ( Target Register ) field, immediate field, etc.
  • the internal instruction set is a fixed length instruction set, that is, the word length of each target instruction is fixed (eg: 32) Bit);
  • the external instruction set can be a fixed-length instruction set or a variable-length instruction set.
  • the external instruction set is variable length and the address high bits of all the bytes occupied by a variable length external instruction are not exactly the same, that is, the instruction spans two instruction blocks, the external instruction is the last of the previous instruction block. An instruction, and an instruction following the instruction is the first instruction of the next instruction block.
  • a branch instruction or a branch point refers to any appropriate process that can cause the processor core to change the execution flow (Execution Flow) ) (eg, instructions that do not execute instructions or micro-ops in sequence).
  • the branch instruction address refers to the instruction address of the branch instruction itself, and the address is composed of the instruction block address and the instruction offset address.
  • the branch target instruction refers to the target instruction that the branch instruction is caused by the branch instruction, and the branch target instruction address refers to the instruction address of the branch target instruction.
  • each external instruction is first converted into a single number or a plurality of internal instructions; or a plurality of external instructions are converted into a single number or a plurality of internal instructions; and then executed by the processor core, thereby implementing and directly executing
  • the external command has the same function.
  • the map 1 is a schematic diagram of a processor system of the present invention.
  • the memory 103 The executable code of the program to be executed is stored, and the executable code is composed of instructions of the external instruction set; each of the external instructions is first sent to the converter 200 The conversion to a corresponding singular or plural internal instruction is sent to the processor core 101 for execution.
  • the converter 200 It can be fixed-structured, that is, it only supports converting a specific external instruction set to an internal instruction set; it can also be configurable, that is, one or more external instruction sets can be converted into an internal instruction set according to the configuration.
  • the converter of the fixed structure is a special example of a configurable converter, and therefore, in the present specification, only the configurable converter will be described.
  • the converter 200 is constituted by the memory 201.
  • the aligner 203, the extraction array 205, the command splicer 207, and the opcode splicer 209 are constructed.
  • the aligner 203 Shift external instructions and shift the external instructions to an instruction block and align if the external instruction crosses the instruction block boundary.
  • FIG. 3A is an embodiment of the aligner of the present invention.
  • the aligner 203 is controlled by the controller 301. , buffer 303 , 305 and cyclic shifter 307 Composition.
  • this embodiment uses two buffers to store two consecutive instruction blocks. In this way, an external instruction being processed can be completely buffered.
  • an instruction block of 303 or across the instruction block boundary (ie, the instruction's header is at the end of the instruction block in a buffer 303, and the remainder is in the buffer 305 instruction block header).
  • Selector 312, 314, 316, 318, and 320 correspond to one byte in order from left to right, and buffer 303 or 305 is selected for its corresponding byte under the control of decoder 327.
  • the contents are sent to the input of the cyclic shifter 307.
  • the controller 301 has a register 321 and an adder 323 whose number of bits is m and 2 m is equal to the byte width of the memories 303, 305.
  • the register 321 stores the start offset address (SA, Start Address) of the external instruction currently being converted.
  • the SA is encoded by the encoder 327 and controls the output selectors 312, 314, 316, 318 and 320 of the buffers 303 and 305 as selection signals, and correspondingly selects the byte with the offset address greater than or equal to SA from the buffer 303 and the slave buffer 305.
  • the byte whose offset address is smaller than SA is selected and sent to the cyclic shifter 307.
  • the bus 313 is sent to the cyclic shifter 307 as a shift bit (Shift Amount).
  • the portion of the offset address greater than or equal to the SA is the head of the external instruction 353
  • the portion of the offset address smaller than the SA is the tail 355 of the external instruction, and there may be a portion of the content of the subsequent external instruction after the tail. Therefore, the cyclic shifter 307 is based on the slave bus 313
  • the received shift bit number ie, SA
  • SA is rotated left to move the head 353 of the external instruction to the start position of the instruction block, and the tail of the external instruction 355 Positioned to the right of the head in the same instruction block and output the instruction block from the cyclic shifter 307.
  • FIG. 3B is an embodiment of the operation process of the aligner of the present invention.
  • External command 351 Crossed the instruction block boundary.
  • the header 353 is located in instruction block 357 and the tail 355 is located in instruction block 359.
  • instruction blocks 357 and 359 They are stored in buffers 303 and 305, respectively, and are selected by the selector and spliced to form an input such as instruction block 361 as input to cyclic shifter 307.
  • the instruction block 361 It consists of three parts, from left to right, the tail 355 of the external instruction 351, a part 363 of the subsequent instruction of the external instruction 351, and the head of the external instruction 351 353 .
  • the shifter 307 is based on the offset address of the start byte of the external instruction header 353 in the instruction block.
  • the left shift of the shift displacement is made such that the start address of the external command 351 is aligned with the start position of the instruction block.
  • the cyclically shifted instruction block 365 includes an external command 351.
  • the external instructions shifted by the aligner 203 are sent to the extraction array 205.
  • Each instruction field is extracted according to the instruction type.
  • Extraction array 205 It consists of several extractors of the same structure. Here, the number of extractors is greater than or equal to the maximum number of instruction fields contained in any instruction in the external instruction set. In all external instruction sets supported by the processor system of the present invention, if the instruction contains at most n instruction fields, the extraction array 205 is composed of n extractors, and each extracter receives the same external instruction as an input, and according to the memory 201 The sent control signal outputs the extracted information.
  • FIG. 4A is an embodiment of the extractor of the present invention.
  • the extractor is rotated by the cyclic shifter 401.
  • the masker 403 is composed.
  • the cyclic shifter 401 cyclically shifts the input external instruction word according to the received shift bit number, thereby shifting the specific instruction field in the instruction to the corresponding position.
  • Masker 403 Then, the bitwise AND (bit AND) operation is performed on the shifted instruction and the mask word, so that the output of the extractor is all '0' except the specific instruction domain. '. In this way, the instruction field of the external instruction can be moved to the position of the instruction field corresponding to the internal instruction.
  • FIG. 4B is an embodiment of the operation process of the extractor of the present invention.
  • the instruction field in the external instruction 451 in this embodiment The shift and mask of 453 are explained.
  • the shift bit number of the cyclic shifter 401 is equal to the difference between the instruction field in the internal instruction and the external instruction.
  • instruction field 453 is located in the outer instruction 451 10, 11, 12 bits (Bit), and in the corresponding internal instruction, the instruction field should be in the 6th, 7th, and 8th bits, then the corresponding shifting digits are shifted to the left by 4 bits (ie '10 'minus' 6 ').
  • the external command 451 is shifted by the cyclic shifter 401 to obtain the form of the post-shift instruction 455 as shown in Fig. 4B.
  • the instruction field is located at bits 6, 7, and 8 of the internal instruction. Therefore, the sixth word of the mask word 457 The 7th and 8th digits are all '1', and the other bits are '0'.
  • the post-shift instruction 455 is bitwise and followed by the mask word 457 in the mask 403 as the output of the extractor, as shown in Figure 4B.
  • the extractor in the form of output 459.
  • the extraction array 205 Some of the extractors are used to extract the opcode field of the external instruction, and another part of the extractor is to extract the other instruction fields of the external instruction. For example, assuming that there are at most three opcode fields in the instructions of the external instruction set, then the array 205 is extracted. Extractors 211, 213, and 215 are used to extract the opcode field (called the opcode extractor), and the remaining extractors (such as extractors 221, 223, 225, and 227) ) Used to extract other instruction fields (called other domain extractors).
  • the operation codes extracted by the extractors 211, 213, and 215 are respectively shifted to different positions without overlapping each other, and are sent to the operation code splicer.
  • 209 Perform a bitwise OR operation to get the complete opcode.
  • the full opcode is sent to the memory 201 as an addressed address.
  • the extractor's control signals are derived from the corresponding registers.
  • the control signal in register 212 is passed through the selector. 222 is selected to control the extractor 211; the control signal in the register 214 is selected by the selector 224 to control the extractor 213; the control signal in the register 216 is passed through the selector 226 is selected to control the extractor 215.
  • Memory 201 It consists of several rows of mapping information, which are divided into direct access zone and indirect access zone. Each row of mapping information corresponds to an addressing address. Since each addressing address corresponds to a complete internal instruction opcode, one or more rows of mapping information corresponds to one or more external instructions in the external instruction set, wherein the corresponding extracted information is stored.
  • the extraction information includes an operation code of an internal instruction corresponding to the external instruction, a start position and a width of each instruction field except the operation code field, a positional relationship between the instruction field and a command domain of a corresponding internal instruction, and the like.
  • the memory 201 can be directly directed to the operation code according to the external instruction.
  • Direct access area addressing find the corresponding row mapping information.
  • the complete operation code output by the opcode splicer 209 can be used as the addressing address to the direct access area, and the mapping information in the corresponding row can be read.
  • Memory 201 The indirect access area must be accessed based on the index value (ie, row address information) in other row mapping information. For example, when an external instruction corresponds to a plurality of internal instructions, the mapping information corresponding to the first internal instruction of the plurality of internal instructions may be read out in the direct access area by using the complete operation code of the external instruction as the addressing address. , thus converting the first internal instruction.
  • the mapping information includes an index value of the mapping information corresponding to the second internal instruction of the plurality of internal instructions in the indirect access area. Therefore, according to the index value, the mapping information corresponding to the second internal instruction can be found in the indirect access area, thereby converting the second internal instruction. This is repeated until the last internal instruction of the plurality of internal instructions is converted.
  • FIG. 5A is a schematic diagram of mapping information according to the present invention.
  • the one line of mapping information shown corresponds to an external instruction, that is, the external instruction corresponds to an internal instruction.
  • Mapping information 501 by internal instruction opcode 503, external instruction length 505 A plurality of extractor configuration information (such as extractor configuration information 507, 509, 511, 513) and an end flag 515 are formed.
  • the internal instruction opcode 503 It is the opcode of the internal instruction corresponding to the external instruction.
  • the external instruction length 505 is the instruction word length of the external instruction itself, and is sent to the aligner 203 as the external instruction length value 325. Adds to the current instruction start point to calculate the starting point of the next external instruction.
  • the end flag 515 stores all '0' to indicate that the line is the last line of internal instruction mapping information corresponding to the external instruction.
  • the number of extractor configuration information is the same as the number of extractors, and one-to-one correspondence.
  • Each extractor configuration information consists of three parts: the number of shift bits (R), the start position of the '1' in the mask value (B), and the mask value.
  • the shift bit number R is sent to the corresponding extractor for controlling the shift of the cyclic shifter 401; the start position B and the number W are used to determine the mask value '1
  • the position of ' that is, the value of consecutive W mask bits starting from B is '1', and the value of the remaining mask bits is '0'.
  • FIG. 5B is another schematic diagram of the mapping information according to the present invention.
  • the plurality of mapping information shown corresponds to an external instruction, that is, the external instruction corresponds to a plurality of internal instructions.
  • the mapping information corresponding to the corresponding information of the three internal instructions is respectively mapping information. 551, 561 and 571.
  • the direct access area in which the mapping information 551 is located in the memory 201 can be directly addressed by the operation code extracted from the external instruction.
  • Mapping information The indirect access areas of 561 and 571 located in memory 201 must be addressed according to the index values stored in the direct access area mapping information (e.g., mapping information 551).
  • mapping information 551 is also represented by internal instruction opcode 503, external instruction length 505, and several extractor configuration information (eg, extractor configuration information 507, 509, 511). , 513) and the end mark 515 constitute.
  • the mapping information 561 and 571 also include an internal instruction opcode 503 and several extractor configuration information (such as extractor configuration information 507, 509, 511, 513) and end flag 515, but may not include external command length 505.
  • the external instruction length 505 in the mapping information 551 is the instruction word length of the external instruction itself, and is sent to the aligner 203. As the external command length value 325 is used to calculate the starting point of the next external instruction.
  • mapping information 551 and 561 The end zone label is not the end but the address of the next mapping information. This can be done in other cases as well.
  • Instruction word length of mapping information 551 and 561 505 Each stores an index pointing to subsequent mapping information. That is, the instruction word length 505 of the mapping information 551 stores the index value of the mapping information 561 in the memory 201, and the mapping information 561 The instruction word length 505 stores the index value of the mapping information 571 in the memory 201.
  • Mapping information 571 The last internal instruction information corresponding to an external instruction is used as a plurality of internal instructions, and the instruction word length 505 gives the instruction length of the external instruction. Mapping information as the last row mapping information corresponding to the external instruction All '0' is stored in the end flag 515 of 571. Thus, the first line of mapping information can be found based on the complete opcode extracted by the opcode extractor, and then the end of the mapping information at each line 515 Under the control of the memory 201, the memory 201 can correctly output the mapping information of all internal instructions corresponding to an external instruction, thereby correctly performing the instruction set conversion.
  • the complete opcode extracted by the opcode extractor as the addressing address can read the corresponding internal instruction opcode 503 from the memory 201 and sent to the command splicer via the bus 230. 207 And reading out the extracted information corresponding to each instruction field of the external instruction and respectively sending them to each of the other domain extractors.
  • Each other domain extractor moves the corresponding instruction field of the external instruction to a specific location according to the domain start position, the domain width, and the shift bit information in the extracted information, and performs a mask operation to make other domain extractors
  • the output is 'except for the shifted instruction field' 0 '.
  • the memory 201 is divided into a direct access area 531 and an indirect access area 533.
  • the address of the indirect access area is higher than the direct access area.
  • the address formed by the external instruction opcode is n bits
  • the address of the memory 201 is n+1 bits.
  • Each row of mapping information in memory 201 contains a two-bit end flag (as shown by Y bit and Z). Bit configuration), used to indicate the conversion relationship between the external instruction and the internal instruction corresponding to the row mapping information, that is, whether an external instruction corresponds to an internal instruction, or whether an external instruction corresponds to multiple internal instructions, or multiple external instructions Corresponding to an internal instruction, which way the control converter handles the next instruction.
  • the figure The value '00' of the flag 535 in 5C indicates that the row mapping information corresponds to the current external instruction, that is, an external instruction corresponds to an internal instruction;
  • the value of the flag 545 in Figure 5D is '10' ' Indicates that the row mapping information not only corresponds to the current external instruction, but also corresponds to the next external instruction, that is, multiple external instructions correspond to one internal instruction;
  • the value of the flag 555 in Figure 5E is '01' ' Indicates that the row mapping information and the mapping information pointed to by the index value in the row mapping information together with the current external instruction, that is, one external instruction corresponds to multiple internal instructions.
  • Y in the sign The bit is used to indicate whether to convert the next external instruction.
  • the Y bit is '0' ', indicates that the conversion of the current external instruction (or several consecutive external instructions including the current external instruction) has been completed, and the next cycle begins the conversion of the next external instruction. If the Y bit is ' 1 ', indicating that the conversion of the current external instruction has not been completed, the next cycle will continue the relevant conversion, and the conversion of the next external instruction cannot be started.
  • the flag is stored in register 537 while the index value in the row mapping information is stored in register 539.
  • the flag of the previous external instruction stored in register 537 can be used to control selector 541 when processing the current external instruction (by A in the flag) Bit control) and address stitching logic 543 (controlled by bit B in the flag).
  • the Y output of register 537 in the figure controls a two-way selector.
  • the Y value is '0' '
  • the operation code from the external instruction is selected as the address of the memory 201
  • the Y value is '1'
  • the slave memory 201 in the presence register 539 is selected.
  • the index value of the previous instruction is used as the address of the memory 201 at the time of the current instruction conversion.
  • the Z value is spliced as an address high bit to an address formed from an opcode from an external instruction.
  • the Z value is '0
  • the address on the memory 201 points to the direct access area
  • the Z value is '1'
  • the address on the memory 201 points to the indirect access area.
  • the circles in the figure indicate bus splicing.
  • the end value YZ corresponding to the previous external command is '00. '
  • the internal instruction can be generated according to the corresponding mapping information according to the previously described method
  • the current external instruction should correspond to at least one new internal instruction.
  • an input to address stitching logic 543 is derived from the opcode splicer.
  • the complete opcode of the current external instruction of 209, the other input is the Z bit (' 0 ') of the flag in register 537, that is, the full '0' is spliced before the complete opcode, so the address mosaic logic
  • the output of 543 is still the full opcode of the current external command, corresponding to the address of the direct access zone 531.
  • the selector 541 is affected by the Y bit in the flag (' 0 ).
  • the ' control selects the output derived from or logic as the addressed address of the memory 201.
  • the direct access area 531 from the memory 201 is available.
  • the mapping information corresponding to the current external instruction is read out, and the corresponding instruction field is shifted and masked according to the method described above, and sent to the instruction splicer 207.
  • the Y bit in the flag is '0 ', so the next cycle can start to convert the next external instruction.
  • FIG. 5F is an embodiment of the instruction splicer of the present invention.
  • register 563 The internal conversion result obtained by the conversion is completed or the intermediate conversion result obtained after the conversion is completed.
  • the Z bit in the flag is stored in register 561 and sent to logic 567 in the next cycle.
  • the output is a signal indicating whether the internal command in the register 563 has been converted.
  • Another input to logic 567 comes from register 563 The value stored in , the output is sent to or logic 569 .
  • Another input to OR logic 569 is the result of the shift mask from bus extract 559 from each extractor.
  • Register 563 The output is the output 667 of the command splicer 207.
  • the flag value corresponding to the previous external command is '10.
  • ' indicating that the external instruction corresponds to multiple internal instructions, and the internal instruction corresponding to the last mapping information is not enough to complete the conversion, then the current external instruction cannot be converted until the conversion of the previous external instruction is completed.
  • the register Stored in 539 is the index value included in the last mapping information, that is, the last mapping information of the last mapping information (both mapping information corresponds to the previous external instruction) in the indirect access area 533 Addressed address in .
  • the selector 541 is controlled by the Y bit (' 1 ') in the flag, and the index value output by the register 539 is selected.
  • Memory 201 corresponding to the index value The address space is located in the indirect access area 533. Therefore, the mapping information corresponding to the previous external instruction can be read from the indirect access area 533, and the corresponding instruction field is shifted and masked according to the method described above, and sent to the instruction splicer. 207. Since the Y bit in the flag is '1', the next cycle continues to convert to the current external instruction and cannot start the conversion of the next external instruction.
  • the Z bit of the flag is '0', so in the next cycle, with logic 567
  • the output is '0', then the output of logic 569 is the result of each extractor shift mask.
  • These results are stitched together in register 563 into a complete internal instruction.
  • Inverter 565 at this time The output value is '1' (that is, the inverse of the Z bit above), indicating that the conversion is complete, and the register 563
  • the content stored in is the internal instruction obtained by the conversion.
  • one of the plurality of internal instructions is generated during an external command conversion to a corresponding plurality of internal instructions, and is output in the next cycle.
  • the flag value corresponding to the previous external command is '01. ', indicating that the external instruction and its subsequent external instruction (ie, the current external instruction) correspond to the same internal instruction, then the current external instruction should continue to be converted until the same internal instruction corresponding to the multiple external instructions is generated.
  • address stitching logic One input to 543 is the full opcode from the current external instruction of opcode splice 209, and the other input is the Z bit of the flag in register 537 (' 1 '), that is, an additional address is spliced before the complete opcode, so that the output of the address splicing logic 543 is the address corresponding to the indirect access area 533.
  • the selector 541 is affected by the Y bit in the flag (' 0 ).
  • the ') control selects the output derived from or logic as the addressed address of the memory 201.
  • the indirect access area 509 from the memory 201 is available.
  • the corresponding mapping information is read out, that is, the mapping information corresponding to the previous external instruction and the current external instruction. Thereafter, the corresponding command fields are shifted and masked as previously described and sent to the command splicer 207.
  • the Z bit of the flag is '1', so in the next cycle, with logic 567
  • the output is the value stored in register 563 (that is, the intermediate result of the conversion), or logic 569
  • the output is the result of the combination of the result of the current extractor shift mask and the intermediate result (such as bitwise OR operation).
  • the converter starts to convert to the next external instruction, then repeat the above process, the successive shift instructions of the corresponding instruction fields of multiple external instructions are OR logic 569 They are combined to convert a plurality of external instructions into an internal instruction until the Z bit is '0', indicating that the current external instruction is the last one of the plurality of external instructions corresponding to the internal instruction.
  • the inverter The output value of 565 is '1' (that is, the inverted value of the above Z bit), indicating that the conversion is completed, and the register 563
  • the content stored in is the internal instruction obtained by the conversion. In this way, the conversion of multiple external instructions to an internal instruction is completed.
  • the memory 201 may be a rewritable random access memory (RAM). Configuring, writing different mapping information to the random access memory according to different external instruction sets that need to be supported; also by read only memory (ROM)
  • RAM random access memory
  • ROM read only memory
  • the composition that is, fixedly supports one or more external instruction sets; it can also be composed of logic circuits capable of performing the same function, and fixedly supports one or more external instruction sets.
  • a portion of the buffer can be designated as the memory 201 Use without caching.
  • the aligner can be omitted in the converter 200. 203.
  • the converter 200 can support different external instruction sets depending on the configuration. Then when the instruction length of one of the external instruction sets is the same as the length of the extractor, it can be selected by the selector 204. Direct selection of external commands is sent to each of the extractors; otherwise selector 204 selects the output of aligner 203 to be sent to each of the extractors. Other operations are the same as those described in the previous embodiments, and are not described herein again.
  • instructions of different instruction sets can be stored in different levels of caches of the processor system to improve the performance of the processor system.
  • an external instruction can be stored in the L2 cache of the processor system
  • an internal instruction can be stored in the L1 cache
  • an instruction set conversion can be performed in the process in which the external instruction is filled into the L1 cache.
  • map 6 It is an embodiment of a processor system including a multi-layer cache according to the present invention.
  • the processor system consists of a processor core 601, an active table 604, a scan converter 608, and a track table. 610, replacement module 611, tracker 614, block address mapping module 620, offset address mapping module 618, offset address converter 622, subtractor 928, level 1 cache 602, the second level cache 606 and the selectors 640, 660, 680, 638, 692, 694 and 696 are constructed.
  • Figure 6 The open circle in the middle represents the splicing of the bus.
  • a controller that receives the slave block address mapping module 620, the scan converter 608, the active table 604, and the track table 610. And the output of the replacement module 611 controls the operation of each functional module.
  • the second level cache 606 stores external instructions, and the level 1 cache 602. Stored in the corresponding internal instructions.
  • the first address and the second address may be used to indicate location information of the instruction in the level 1 cache or the level 2 cache.
  • the first address and the second address may be an address address of the level 1 cache, or may be an address address of the level 2 cache.
  • BN1X can be used. Indicates the first-order block number of the instruction block where the internal instruction is located (that is, the corresponding one-level instruction block in the level 1 cache), and uses BN1Y Indicates the intra-block offset of the internal instruction (ie, the relative position of the internal instruction in the primary instruction block).
  • BN2X can be used. Indicates the secondary block number of the instruction block in which the external instruction is located (ie, points to the corresponding one-level instruction block in the secondary cache) and uses BN2Y Indicates the intra-block offset of the external instruction (ie, the relative position of the external instruction in the secondary instruction block).
  • BN1 can be used to represent BN1X and BN1Y
  • BN2 is used for BN2X and BN2Y
  • the internal instruction stored in the level 1 cache can be BN1 or BN2. Said.
  • the entries in the active table 604 correspond one-to-one with the storage blocks in the secondary cache 606.
  • Active table 604 Each entry in the table stores a matching pair of a secondary instruction block address and a secondary block number BN2X, indicating that the secondary instruction block corresponding to the instruction block address is stored in the secondary cache 606. Which of the memory blocks is in.
  • matching may be performed in the active table 604 according to a secondary command block address, and a BN2X may be obtained if the matching is successful; or a BN2X may be used.
  • the active table 604 is addressed to read the corresponding secondary instruction block address.
  • Scan converter 608 when an external instruction is filled from level 2 cache 608 to level 1 cache 602
  • the branch target address of the branch instruction in the external instruction is calculated, and the external instruction is converted into an internal instruction by the instruction converter 200 in 608.
  • the calculated branch target address is sent to the active table 604.
  • a match with the instruction block address stored therein determines whether the branch target has been stored in the secondary cache 606. If the match is unsuccessful, the instruction block in which the branch target instruction is located has not been filled into the secondary cache 606 Then, while the lower layer memory of the instruction block is filled into the second level cache 606, a matching pair of the corresponding level two instruction block address and the second level block number is established in the active table 604.
  • Scan converter 608 pairs from secondary cache 606 to primary cache 602
  • the filled instruction block (external instruction) is converted and reviewed, and the track point information corresponding to the internal instruction is extracted and filled into the track table 610.
  • Corresponding entries thereby establishing a track of at least one level one instruction block corresponding to the second level instruction block.
  • a BN1X is first generated by the replacement module 611. Point to an available track.
  • the replacement module 611 can determine the available tracks based on a replacement algorithm, such as an LRU algorithm.
  • scan converter 608 fills from secondary cache 606 to primary cache 602.
  • Each of the external instructions is reviewed and some information is extracted, such as the instruction type, the instruction source address, and the branch increment of the branch instruction, and the branch destination address is calculated based on the information.
  • the instruction block address can be read from active table 604 and sent directly to scan converter 608 In the middle of the adder.
  • a register for storing the current instruction block address may also be added to scan converter 608 such that active table 604 It is not necessary to send the instruction block address in real time.
  • the branch target address of the direct branch instruction is generated by scan converter 608, and the branch target address of the indirect branch instruction is by processor core 601. Generated, and the two correspond to the external instruction address.
  • scan converter 608 Each of the external instructions is further converted into a corresponding one or more internal instructions, and the branch increment of the branch instruction is not changed during the conversion process, that is, the branch increment in the external branch instruction and the corresponding internal branch instruction Branch increments are equal, guaranteeing processor core 601 The correctness of the branch destination address of the generated indirect branch instruction.
  • Block Address Mapping Module 620 Each row corresponding to each L2 cache block has a plurality of entries, and each entry stores one of the L1 cache blocks corresponding to a part of the L2 cache block (referred to as a sub-block of the L2 cache block) Block number (BN1X And the starting offset (BN2Y) of the L2 cache sub-block within the L2 cache block.
  • the BN2Y in each entry is arranged from left to right.
  • Rows in block address mapping module 620 and rows in active table 604 and secondary cache 606 The memory blocks in the one-to-one correspondence are pointed to by the same BN2X.
  • the block address mapping module 620 is configured to store a correspondence between the corresponding secondary block number and the primary block number, as shown in FIG. 6 , and its entry format 680 Includes the first block number BN1X and the intra-block offset.
  • a row in the block address mapping module 620 can be found according to BN2X therein, and then BN2Y is used.
  • the valid BN2Y stored in each entry in the row is compared, and BN1X in the correspondingly successful entry can be read (that is, the BN1X corresponding to the corresponding internal instruction of the external instruction corresponding to the BN2Y) ), thereby converting BN2X to the corresponding BN1X, or obtaining a relatively unsuccessful result (ie, the corresponding internal instruction of the external instruction corresponding to the BN2Y has not been stored in the level 1 cache 602).
  • the format of the track table 610 is 686 or 688.
  • 686 consists of three parts: format ( TYPE ), secondary block number (BN2X) and secondary block offset (BN2Y ).
  • the format includes an instruction type address, including a non-branch instruction, an unconditional direct branch instruction, a conditional direct branch instruction, an unconditional indirect branch instruction, and a conditional indirect branch instruction.
  • the conditional direct branch instruction, the unconditional direct branch instruction, the conditional indirect branch instruction, and the unconditional indirect branch instruction may be collectively referred to as a branch instruction, and the corresponding track point is a branch point.
  • the format also contains the address type, which is The 686 format is the secondary cache address BN2.
  • the format of 688 is also composed of three parts: format ( TYPE ), first block number (BN1X) and one-level block offset (BN1Y). 688
  • the instruction type in the format is the same as 686, but the address type is fixed in 688 as the level 1 cache address BN1 .
  • the format of the memory 920 in the block address mapping module 620 is as follows. As shown in 684, it is a combination of the level 1 cache block address BN1X and the level 2 cache block offset address BN2Y.
  • Track table 610 contains a plurality of track points (track point) ).
  • a track point is an entry in the track table, which may contain information of at least one instruction, such as instruction class information, branch target address, and the like.
  • the track point address of the track point itself is related to the command address of the instruction represented by the track point ( Correspondence; and the branch instruction track point contains the track point address of the branch target, and the track point address is related to the branch target instruction address.
  • level 1 cache 602 A plurality of consecutive track points corresponding to a level one instruction block formed by a series of consecutive internal instructions are referred to as one track.
  • the first-level instruction block is corresponding to the corresponding track by the same first-order block number BN1X Instructions.
  • the track table contains at least one track.
  • the total number of track points in a track can be equal to the track table 610
  • the total number of entries in a row can be equal to the track table 610
  • the track table becomes a table representing a branch instruction with the branch source address corresponding to the track entry address and the branch target address corresponding to the entry of the entry.
  • An additional block number entry can be added to each row to record the BN2 of the external command corresponding to the first track point of the row. .
  • the BN1 in the other track table row of the behavior branch target can be converted into the corresponding BN2. So that the line can be written by other lines of command without causing an error.
  • the possible path of the program run or the possible flow of the program execution flow is recorded in the track table 610, so the tracker 614 It can be tracked along the program flow according to the program flow in track table 610 and the feedback from processor core 601. Since the internal command corresponding to the track table entry is stored in the primary buffer 602, the primary buffer 602 The output bus 631 of the tracker 614 is the read address, followed by the program flow followed by the tracker 614, and the instructions are sent over the bus 695 for execution by the processor core 601. Track table 610 Some of the branch targets are using the secondary cache address BN2 Recorded, the purpose is to convert only the external instructions needed to be stored in the first level buffer, so that the first level buffer can have smaller capacity and faster speed than the second level buffer.
  • the tracker 614 When the tracker 614 reads out the entries in the entries When BN2 is recorded, the BN2 is sent to the module matching or scan conversion module 620 such as the block address mapping module 620 to obtain the BN1 address, and the instruction is filled into the first level cache 602. The BN1 address is also filled back into the entry in the track table, and the tracker 614 is along the BN1 and according to the processor core 601.
  • the feedback instruction execution result controls the level 1 cache 602 to output an instruction to the processor core 601 for execution.
  • the first address and the second address may be used to represent positional information of the track point in the track table.
  • the instruction type of the direct branch point can also include the branch target addressing address represented by BN1 (that is, the branch target is The direct branch instruction of BN1) is also the information represented by BN2 (that is, the direct branch instruction whose branch destination is BN2).
  • the external instruction block in which the branch target external instruction of the branch point is located has been stored in the second level cache 606.
  • the branch instruction pointed to by the BN2X, and the branch target external instruction can be found therefrom according to the BN2Y, but it is not directly determined whether the internal instruction corresponding to the branch target external instruction has been stored in the level 1 cache 602. Medium.
  • the offset address mapping module 618 is configured to store the external instruction offset address and the level 1 cache 602 in the level 2 cache 606. The correspondence between the internal instruction offset addresses in .
  • the offset address converter 622 can be mapped according to the offset address mapping module 618 by BN1X (ie, BN2Y and BN1Y). The mapping relationship) converts the received BN2Y to the corresponding BN1Y or converts the received BN1Y to the corresponding BN2Y.
  • BN2 needs to be converted to BN1, first based on BN2X and BN2Y, the block address mapping module BN1X is converted in 620, and BN2Y is converted into BN1Y according to the mapping relationship in the row pointed by the BN1X in the offset address mapping module 618, thereby completing the BN2 direction. Conversion of BN1.
  • BN1X is the external instruction block number corresponding to the internal instruction block pointed to by the BN1X
  • BN2Y is the BN1X
  • BN1Y is converted to BN2Y to complete the conversion of BN1 to BN2.
  • bus there are three main types of buses: external instruction address bus, BN1 bus, and BN2.
  • the external instruction address bus mainly has buses 657, 683 and 675;
  • the BN1 bus mainly has buses 631 and 693;
  • the BN2 bus mainly has the bus 633 and 687.
  • other buses such as a BN2X bus 639, a BN2Y bus 637, and a mapping relationship bus 691.
  • the content on bus 675 is active table 604 by BN2X
  • the external block address ie, the level 2 cache block address
  • This address is sent back to scan converter 608 to calculate the branch target address of the direct branch instruction.
  • the content on bus 657 is scan converter 608
  • the branch target instruction address of the direct branch instruction outputted when the branch instruction is found is examined, and the content on the bus 683 is the branch target instruction address that the processor core 601 outputs when executing the indirect branch instruction.
  • Bus 657 The format of the 683 and the 683 are the same as the external command address format.
  • the block address portion (higher portion) is selected by the selector 680 and sent to the active table 604 via the bus 681.
  • the external instruction block address stored therein is matched to obtain a secondary block number BN2X and the external command is read from the secondary buffer 606 via the bus 671.
  • bus 671 The format of bus 671 is BN2X, with bus The BN2Y of the external command address offset part (lower part) is spliced into a complete BN2 address and sent to the track table 611 for storage. BN2X is also sent to the selector on bus 671 640. Selector 640 selects bus 671 and track table 610 outputs one of BN2X via bus 633 as BN2X puts on bus 639 It is used to read a row of data in the block address mapping module 620 for BN2 to BN1 mapping.
  • Bus 637 is the output of three-input selector 638, and three-input selector 638 selects bus 633, 657 Or BN2Y on 683 is sent to block address mapping module 620 to match the corresponding BN1X in the row pointed to by BN2X on bus 639.
  • Bus 633 is the output of track table 610 and can be in the format BN1 or BN2.
  • the format is BN2
  • the time is sent to the block address mapping module 620 and the offset address mapping module 618 to map BN2X to BN1X. Its mapping also needs to pass BN2Y in BN2 via bus 637
  • the subtraction 928 is subtracted from the start address of the corresponding secondary sub-memory block output by the block address mapping module 620 to obtain the correct intra-block offset address for use by the offset address converter 622, which will be BN2Y Convert to BN1Y.
  • the BN1X and BN1Y are merged into BN1 and written back to the track table 610.
  • BN2X on bus 633 can also be sent to active table 604
  • the corresponding external command block address is read via bus 657 and sent to scan converter 608, which is coupled to BN2Y of bus 633 to scan converter 608 to form an external command address.
  • the bus The BN2X on 633 can also be sent via bus 673 to the secondary cache 606 to read the corresponding external block of instructions.
  • Bus 631 is the output of tracker 614 and has the format BN1. The output is sent to the level 1 cache 602 The read command is used as an address for the processor core 601 to use.
  • Bus 693 is the output of replacement module 611 in the format BN1X, which means to scan converter 608
  • the next available first-order block number, BN1X (or track number) is provided for scan converter 608 to fill the internal instructions of the conversion.
  • BN1X on bus 693 is also associated with bus 657
  • the BN2Y is placed on the bus 665 (and the contents of the entries in the block address mapping module 620) and sent to the selector 940 for storage in the block address mapping module 620 in address order. Therefore, 665
  • the format on the bus is BN1X and BN2Y.
  • Bus 693 controls the write block address of level 1 cache 602 and the BN1Y bus 669 from scan conversion module 608 output.
  • control internal instructions converted by scan converter 608 are populated into bus cache 602 via bus 667.
  • bus 693 and bus 669 also co-addresses the format corresponding to the internal instructions (sent by scan conversion module 608 via bus 687), the branch target (BN2X on bus 671 and BN2Y on bus 657) Stitching to bus 687) is synchronously written to track table 610 via bus 687.
  • Bus 687 has the instruction type on it, BN2Y with BN2X from bus 671 The splicing into a complete track point content is sent to the track table 610 storage.
  • Bus 954 is the output of block address mapping module 620, where BN1X is used to derive from offset address mapping module 618
  • the corresponding offset address mapping information is read and sent to the offset address converter 622; the BN2Y output is sent to the subtractor 928 and the BN2Y sent from the bus 633.
  • the values are subtracted and the result is sent to the offset address converter 622.
  • the offset address converter 622 maps BN2Y on bus 954 to a BN1Y address based on the input. From bus 954 The BN1Y address of the BN1X address and offset address converter 622 is spliced into a complete BN1 and sent via bus 685 to an input of a three-input selector 692.
  • Selector 692 selects BN1 on bus 685, BN2 on bus 687, or bus 693 BN1 (where BN1X sent from bus 693 is spliced into a complete BN1 with BN1Y added as '0') is sent to track table 610 as the track point content for writing.
  • FIG. 7A is an embodiment of a track table based cache structure according to the present invention.
  • Rows and level 1 cache 602 of track table 610 as described in the previous embodiment.
  • the memory blocks correspond one-to-one, and the number of entries (ie, track points) in the track table row (ie, track) is one more than the number of instructions in the primary memory block.
  • the last track point of the track stores the position of the next track that is sequentially executed, and the remaining items are in one-to-one correspondence with the instructions in the first-level memory block, and store program execution flow information (such as instruction type and branch target). Address, etc.), and the address corresponding to each track point from left to right in the track is incremented.
  • the read port of the track table 610 is read by the tracker 614. Under the addressing, the corresponding track point content is output and placed on the bus 633, and the controller detects the content on the bus 633.
  • the selector 738 selects the incrementer 736 The output causes the tracker to move to the right to the next address (ie, a larger address).
  • selector 738 selects bus 633
  • the upper branch target address causes read pointer 631 to go to the track point location corresponding to the branch target address on bus 633.
  • the tracker 614 pauses the update and waits until the processor core 601 TAKEN signal that produces a branch transfer 635 . If the branch transfer does not occur, it runs as before the non-branch instruction, and if the branch transfer occurs, it runs as before the unconditional branch instruction.
  • the write address corresponding to the track table 610 write port has two sources, namely selector 694 (BN1X) and 696. (BN1Y).
  • selector 694 BN1X
  • BN1Y selector 694
  • the replacement module 611 outputs the row address BN1X
  • the scan converter 608 outputs the column address BN1Y.
  • the tracker 614 When BN2 is stored in the read track point content, the BN2 is sent to the block address mapping module 620 or the scan converter 608 to generate/generate BN1, which is BN1.
  • the processor core is 601
  • the generated indirect branch destination address is sent to the active table 604, the block address mapping module 620, etc. generates/generates BN1, the BN1 It also needs to be written back to the track point.
  • the write address of track table 610 is the current read address.
  • Track Table 610 The write port itself has three sources: bus 685, 687, and 693, via selector 692 After selection, it is written as content.
  • the value on bus 685 is BN1 output by block address mapping module 620 and offset address converter 622, and the value on bus 687 is in the form of a secondary cache address ( The branch destination address of BN2), and the value on bus 693 is the BN1 that will be written to the next track in the last entry of the track to execute the next track.
  • the scan converter 608 is scanned while the external command is converted to the internal command. Review and extract the corresponding information.
  • the track table content has three parts: if the internal instruction is a non-branch instruction or an indirect branch instruction, the selector 694 selects the replacement module 611.
  • the generated BN1X 693 corresponding to the internal instruction is the first address in the write address of the track table 610, and the selector 696 selects the scan converter 608.
  • the output of the branch internal instruction is within the block offset 669 in its instruction block as the track table 610
  • the second address in the write address is written into the track table 610 as the write content (that is, the non-branch instruction or the indirect branch instruction), and the establishment of the track point is completed.
  • scan converter 608 Calculate the branch target address.
  • the block address in the branch target address is sent to the active table 604 for matching via bus 657. If the match is successful, the BN2X corresponding to the successful entry is obtained via the bus 671, 639.
  • the block address mapping module 620 is sent to, and the intra-block offset (i.e., BN2Y) in the branch target address is sent to the block address mapping module 620 via the buses 657, 637.
  • Block address mapping module Find the corresponding BN1X in the line pointed to by the BN2X in 620. If there is a valid BN1X, the BN1X is read from the offset address mapping module 618.
  • the mapping relationship in the pointed row is sent to the offset address converter 622 to convert the BN2Y to BN1Y .
  • the selector 694 selects the corresponding internal command generated by the replacement module 611 BN1X 693 is the first address in the track table 610 write address, and selector 696 selects the intra-block offset of the branch internal instruction output by scan converter 608 in its instruction block 669
  • the BN1X and BN1Y are merged into BN1 and placed on the bus 693 and passed through the selector 692.
  • the content of the track point is written into the track table 610 together with the extracted instruction type, and the establishment of the track point is completed. At this point, the track point contains BN1.
  • the selector 694 selects the BN1X 693 corresponding to the internal instruction generated by the replacement module 611 as the first address in the write address of the track table 610, and the selector 696 selects the scan converter 608
  • the output of the branch internal instruction in block block 669 in its instruction block is the second address in the write address of track table 610, which will be the BN2X and scan converter 608 on bus 671.
  • the output BN2Y is spliced into BN2 and placed on the bus 687 and selected by the selector 692 and written to the track table as the track point content together with the extracted instruction type. In the middle, the establishment of the track point is completed. At this point, the track point contains BN2.
  • the block number of the secondary storage block is allocated according to a replacement algorithm (such as the LRU algorithm). And forwarding the branch target address to a lower level memory to retrieve the corresponding instruction block and store it in the secondary cache 606 in the memory block pointed to by the BN2X.
  • Selector 694 select by replacement module 611
  • the generated BN1X 693 corresponding to the internal instruction is the first address in the write address of the track table 610, and the selector 696 selects the scan converter 608.
  • the output intra-block offset 669 in the instruction block in which it is located is the second address in the write address of the track table 610, directly in the BN2X and the intra-block offset address in the branch target address (and BN2Y) is merged into BN2 and placed on the bus 687 and selected by the selector 692 and written to the track table as the track point content together with the extracted instruction type. In the middle, the establishment of the track point is completed. At this point, the track point contains BN2.
  • the first address (BNX) in the write address of the track table 610 is also via the bus 745.
  • Pointing to offset address mapping module 618 Corresponding rows in the row such that a mapping relationship between each internal instruction block and a corresponding external instruction is stored in the row.
  • the excess portion is sequentially filled to the replacement module. 611
  • the newly generated BN1X points to the primary storage block and establishes the corresponding track.
  • Tracer 614 is comprised of register 740, incrementer 736, and selector 738, which reads pointer 631. (ie, the output of register 740) points to the track point corresponding to the instruction to be executed (ie, the current instruction) of processor core 601 in track table 110, and reads the contents of the track point to be sent to the selector via bus 633. 738.
  • read pointer 631 addresses level 1 cache 602, reads the current instruction and sends it to processor core 601 for execution.
  • the selector 738 selects the source from the incrementer 736.
  • the result of incrementing the value of register 740 is sent back to register 740 as an output, causing the value of next cycle register 740 to increment by one, i.e., read pointer 631 points to the next track point and is buffered from level one.
  • the corresponding internal instruction is read out for execution by the processor core 601.
  • the selector 738 The BN1 is selected as an output return register 740 such that the value of the next cycle register 740 is updated to the BN1, i.e., the read pointer 631. Point to the track point corresponding to the branch target internal instruction and read the branch target internal instruction from the level 1 cache 602 for execution by the processor core 601.
  • the selector 738 The TAKEN signal 635, which is generated when the branch of the processor executes the branch instruction, indicates whether a branch transfer has occurred, and the update of the register 740 is suspended until the processor core 601 is sent. TAKEN signal 635. At this time, if the value of TAKEN signal 635 is '1', it indicates that branch transfer occurs, and BN1 output from the track table is selected as the return register 740. So that the value of the next cycle register 740 is updated to the BN1, that is, the read pointer 631 points to the track point corresponding to the branch target internal instruction and is cached from the first level 602.
  • the branch target internal instruction is read out for execution by the processor core 601. If the value of TAKEN signal 635 is '0', indicating that the branch transfer has not occurred, then the pair register of the incrementer 736 is selected. The result of incrementing the value of 740 is sent back to register 740 as an output, causing the value of next cycle register 740 to increment by one, i.e., read pointer 631 points to the next track point and is buffered from level one. The corresponding internal instruction is read out via the bus 695 for execution by the processor core 601.
  • the direct branch instruction (including both conditional and unconditional) is sent to the block address mapping module 620.
  • the block address mapping module 620 if there is a valid corresponding to the BN2 BN1X, the BN1X is output, and BN2Y in BN2 is converted into a corresponding BN1Y by the offset address converter 622, and the BN1X and BN1Y are merged into BN1 puts on bus 685.
  • the selector 694 selects BN1X in the read pointer 631 value (ie, the branch point BN1 corresponding to the branch instruction itself) as the first address in the write address, the selector 696 Select BN1Y in the Read Point 631 value as the second address in the write address, and the selector 692 selects BN1 on the bus 685 as the write content back to the branch point. If it does not exist For the valid BN1X corresponding to BN2, a replacement module 611 generates a BN1X in the track table 610 (and the level 1 cache 602). Specify an available track (and a corresponding block).
  • the internal instruction corresponding to the branch target external instruction must be the first instruction in the first-level storage block, that is, BN1Y The value is '0'.
  • the branch target instruction of the branch point is stored in the level 1 cache 602, and the BN2X in the BN2 is converted into the BN1X corresponding to the branch target internal instruction. (generated by replacement module 611), combined with BN1Y (value '0') into BN1 on bus 693.
  • the selectors 694, 696 select the read pointer 631
  • the value i.e., the branch point corresponding to the branch instruction itself
  • the selector 692 selects BN1 on the bus 693 as the write content to be written back to the branch point.
  • track table 610 The output track point content contains BN1.
  • the subsequent operations are the same as those in the direct branch instruction in which the branch target is BN1, and will not be described here.
  • the processor core is 601 sends a block address in the branch target address generated when the branch instruction is executed to the active table 604 to match. If the match is successful, the BN2X corresponding to the matching success item can be obtained.
  • the replacement algorithm (such as LRU)
  • the algorithm allocates a block number BN2X of the secondary storage block by the active table 604, and sends the branch target address to the lower level memory to retrieve the corresponding instruction block for storage to the secondary cache 606 by the BN2X Point to the storage block.
  • the external instruction block is converted and filled into the first-level cache 602 to establish a corresponding track, record mapping relationship, and the BN2 is converted into BN1.
  • the controller accordingly determines that the indirect branch instruction has been accessed before, and can use the BN1 address to guess the execution, but obtains the corresponding external instruction address through the BN1 address (for example, through the BN1X).
  • the BN2X stored in the corresponding track addresses the active table 604 to read the external instruction block address, and obtains the address in the external instruction block through 618 conversion, thereby obtaining the complete external instruction address), waiting for the processor core. 601
  • the branch target address is compared with the inverse obtained external command address.
  • scan converter 608 is responsible for converting external instructions into internal instructions and filling them into the level one cache.
  • In-process scan converter 608 also calculates the branch target address of the external instruction, extracts the type of the instruction, and populates the target address and type information with the corresponding track table entry populated with the internal instruction of the level one cache.
  • Figure 7B It is an embodiment of the scan converter of the present invention.
  • scan converter 608 accepts inputs from two sources.
  • the first source is when the track table 610 goes through the bus 633 sends a direct branch external instruction address BN2, this BN2 matches the miss in the block address mapping module 620, and the required external instruction block is already stored in the secondary cache 606, the active table There is also a corresponding external instruction (PC) upper address in 604, but has not been converted to an internal instruction stored in the level 1 cache 602.
  • the BN2X address on bus 633 is sent to the active list 604
  • the corresponding PC high bit is read out and sent to scan converter 608 via bus 675, and the intra-block offset BN2Y on bus 633 is also sent to scan converter 608.
  • the selector The 660 also selects to place the BN2X on bus 633 on the bus 673 to provide a block address to the secondary cache 606.
  • the second source is when the track table 610 goes through the bus 633
  • An indirect branch external instruction type is sent and its address format is an external instruction address format, indicating that the target of the indirect branch instruction needs to be calculated by the processor core 601.
  • the controller will processor core 601
  • the external branch target address obtained when the corresponding indirect conditional branch instruction is executed is sent to the active table 604 via the bus 683, the selector 680, and the bus 681. Match. If there is no match, the external instruction block indicating the branch target is not in the second level cache 606. At this time, the active table sends the external instruction address on the bus 681 to the lower layer memory to read the corresponding instruction block and fill it into the second level cache.
  • the active table 604 is allocated by the active table 604, via the selector 660, the secondary cache 606 pointed to by the bus 673 The secondary cache block in .
  • the upper bits of the external instruction are stored in the corresponding tag field in the active table. If there is a match, the active table 604 points to the secondary cache corresponding to the matching tag via selector 660 and bus 673.
  • the PC address on bus 683 is sent to scan converter 608.
  • Converter 200 is included in scan converter 608 , direct branch target address calculator 792, intra-block offset map generator 796, controller 790 and input selectors 798, 799.
  • controller 790 Accept the status signals from each module and control the modules to work together.
  • Selector 798 selects the PC high address from bus 675 or bus 683 to be stored in register 788 .
  • the selector 799 selects the PC lower address (BN2Y) from bus 633 or 683 to be stored in register 321 .
  • the address is used to convert BN2 in the track table to a BN1 address, during which the corresponding external instructions are translated into internal instructions and stored in the level 1 cache 602.
  • the address is used to translate the corresponding external instruction of the indirect branch target into an internal instruction and store it in the level 1 buffer and store the level 1 block number BN1X along with the in-block offset BN1Y to the track table 610.
  • the corresponding entry of the indirect branch instruction Regardless of the source from which the selectors 798 and 799 are selected, the operation is the same.
  • the following takes BN2 as a BN1 address as an example.
  • the address of the second level cache 606 is BN2, in this case its format is '8XYY'. Where '8X' is the block address BN2X with a value of '80' ⁇ '82'.
  • Each L2 cache block in the L2 cache 606 (one row in the figure) has 32 bytes, and its intra-block offset BN2Y is its intra-block byte ( Byte )
  • the address ' YY ' whose value is ' 0 ' ⁇ ' 31 ', stores externally variable instructions in bytes.
  • the level 1 cache 602 address is BN1 and its format is '7XY', where ' 7X ' is the block address BN1X and its value is ' 70 ' ⁇ ' 75 '.
  • Each level 1 instruction block in the level 1 cache 602 (one line in the figure) has 4 fixed length internal instructions with an intra-block offset BNY1 is its block word address 'Y', which is easy to understand and distinguish from BN2Y, and its value is A ⁇ D in this embodiment.
  • Annotation; in this embodiment, the length of an internal instruction is a word (word), and the internal instruction can have other lengths.
  • Each row in track table 610 has five entries A ⁇ E, of which A ⁇ D The four entries correspond to the four internal instructions A ⁇ D in the first level cache 602, and the entry E is used to store the address of the next level one cache block in the row order.
  • Direct branch destination address calculator 792 has a three-input adder 760 Used to calculate the direct branch target address. There is also a boundary comparator 772 in the direct branch target address calculator 792 whose input is connected to bus 679. Boundary Comparator 772 The largest address in a L2 cache block ('31' in this embodiment) is stored, and the BN2Y value on bus 679 crosses the boundary of the L2 cache block (greater than '31'), boundary comparator 772 A level 2 cache address will be generated to signal the controller 790.
  • the direct branch target address calculator 792 also has a selector 774 that can be controlled by the controller 790. The branch offset of the 200 output or all '0' is sent to the adder 760. When all '0' is selected, the next external block address is calculated in the order.
  • BN2 address '8024' which means the number 80 in the second level cache 606
  • the BN2 address is sent via bus 633 to block address mapping module 620 for matching. Its BN2X value is selected by the selector 640 After selection, the bus address 639 selects the BN2Y in the contents of the '80' row entry stored in the block address mapping module 620, and the selector 638 on the bus 633.
  • the BN2Y comparison sent via bus 637 is selected.
  • the result of the comparison is a miss, that is, the branch instruction is an external instruction stored in the level 2 buffer, but has not been converted into an internal instruction and stored in the level 1 cache. .
  • the controller receives the miss signal, i.e., controls the BN2X on bus 633 to read the tag (assumed to be '9132') via bus 675 in the '80' row of active table 604. Send to scan conversion module 608. Referring to FIG. 7B, the controller also controls the selector 798 in the 608 to select the bus 675, the selector 799 selects the bus 633, and also notifies the scan converter.
  • the controller 790 in 608 starts the conversion instruction.
  • Controller 790 Control Register 756 is stored in the output of selector 798 (' 9132 '), also controls the register 321 is stored in the output of selector 799 (' 24 ', binary '1100 '). That is, the PC address of the branch target is '913224', which is stored in the second level cache. 'Line, so its BN2 address is '8024'. Assuming that the L2 cache 606 reads 16 bytes at a time, the 4-bit intra-block offset address on the register 321 has only the highest bit from the bus 679.
  • the binary value '1100' on register 321 is sent to decoder 762 via bus 679 and translated into a one-hot code ( One-hot-code ) ' 0000000000000000000100000000 ', stored in memory 766 via OR gate 764.
  • the counter 776 is set to '0' when an external instruction begins to be converted, and the value '000' on output bus 669 is also translated by decoder 778 into a unique thermal code '1000' via logic gate 780 Send to memory 782 for storage.
  • the value on bus 679 is stored in the register when an external instruction segment begins to convert.
  • '1100' is stored in register 770 to control shifter 768 to shift left by 24 bits so that register 766 corresponds to byte ' The 24' information is shifted to the byte '0' position on the bus 691.
  • the replacement module 611 assigns the '72 in the level 1 cache 602 to the internal instruction being converted. 'No. 1 level cache block.
  • the controller controls the selector 692 to select the BN1X address '72' on the bus 693 along with the BN1Y address A (' 00 ') to write to the track table 610 Medium.
  • the selectors 694 and 696 select the address on the bus 631, so the BN1 address '72A' is written to an entry instead of the original BN2 address '8026 ', but does not change the original instruction type.
  • the bus 631 is placed at the '72A'. Point to the first entry in the '72' row of track table 610 to continue execution.
  • Replace module 611 sends the BN1X address ' 72 ' via bus 693 to select ' 72 in 602
  • the 'number one level cache block also selects the '72' line in the track table 610 and the offset address mapping module 618 for the scan converter 608
  • the generated internal instructions and corresponding program streams are filled with offset information within the block.
  • the bus 669 is sent out of the scan converter 608 to the first level cache 602 and the track table 610 as the intra-block offset address of the level one cache block.
  • BN1Y is used to fill the level 1 cache block and the corresponding track table.
  • a branch instruction in the secondary cache 606 starting with a BN2 address of '8024' is passed through the converter 200
  • the conversion generates a non-branch internal instruction that is sent from bus 667 to the first-level buffer 602 to fill in the A entry of the '72' level one cache block (the intra-block offset is '00' '), its corresponding instruction type (non-branch instruction) is also sent by the memory 201 via the bus 687 to the track table 610 stored in the '72A' entry.
  • the controller also controls the selection of the BN2Y value '24' on bus 633 via selector 698 and bus 693.
  • the upper BN1X address '72' is spliced into BN1X, and the BN2Y form '7224' is written to the block address storage module 920 in block address mapping module 620 via bus 665.
  • the leftmost entry in the '80' row addressed by bus 639 is selected by BN2X on bus 633 via selector 640. This entry is made by BN2Y '24 on bus 633
  • the value is selected by the selector 638 and sent via the bus 637 to the block address mapping module 620 with the BN2Y value of each entry in the row '32 'Comparative decision to determine.
  • the value and its position indicate that the external instruction segment starting from the '24' byte of the '80' L2 cache block in the L2 buffer is stored in the '72' level 1 cache block, and the L2 buffer '80 External instructions with a byte address less than '24' in the line have not yet been converted to internal instructions.
  • the specific structure and operation are shown in the embodiment of Fig. 8.
  • the converter 201 detects that the length of the above external non-branch instruction is 2 bytes during the conversion process, via the bus 325.
  • the control aligner 203 continues to shift the external command input via the bus 677 to the left by 2 bits to start the instruction conversion. This byte length is also sent to the adder 323 and the contents of the register 321 is added, and '26 'Save in register 321 again.
  • register 321 is again translated by decoder 762 as a one-hot code '0000000000000000000000000100000 ', and with the contents of register 766 via OR gate 764 for a bitwise 'or' operation, the result ' 000000000000000000010100000 ' is again stored in register 766, which means that the '24' byte and the '26' byte in the '80' L2 cache block are the start bytes of an external instruction.
  • the converter 200 converts an external instruction starting at '26' bytes, which is found to be a 4 during the conversion process.
  • the direct branch instruction of the byte length the converter does not make any modification to its branch offset and is directly put on the bus 667 together with other parts of the internal instruction obtained by the conversion.
  • Its branch instruction type is also as in the previous example by bus 687 Output.
  • Counter 776 is also incremented by '1' under the control of bus 786, with a bus 669 value of '001'.
  • the controller 790 controls the adder 760 according to the instruction for the branch instruction.
  • the PC high address in memory 756 is offset from the intra-block offset in register 321 and the corresponding branch offset 798 from bus 667 (assuming the value is '24 at this time) ') Add, the sum (sum) is the branch destination PC address '913316' put on the bus 657 output.
  • Low and low BN2Y (The portion that is not larger than the number of bytes in the L2 cache block) is spliced to the output on bus 687.
  • the upper bits of the PC address on bus 657 are passed to selector 680, which is sent to active list 604. Match, the result is a miss.
  • the active table 604 sends the '9133' PC upper address to the lower memory via bus 681 to read the corresponding external block. Active table 604
  • the '81' level 2 cache block in the Level 2 buffer is also allocated for this external block.
  • the secondary cache block number BN2X (' 81 ') is also sent via bus 671 to the lower BN2Y on bus 687.
  • the value '001' on bus 669 is also translated by decoder 778 as one-hot code '0100' with register 782
  • the value in the 'or' operation value '1100' is stored in the register 782 , means that the first and second lines in the internal instruction block each correspond to an external instruction. If an internal instruction corresponds to the start byte of an external instruction (that is, the internal instruction after the corresponding first internal instruction when an external instruction is converted into multiple internal instructions), the memory
  • the signal sent by the contents of 201 via bus 788 controls OR gate 780, causing the signal in register 782 to be '0'.
  • Registers 782 and 766 when an external instruction is converted to the corresponding internal instruction The number of '1' is the same, although the location is not the same.
  • the location of '1' in register 766 is the byte address representing the start byte of the external instruction.
  • the location of ' is the address of the instruction that represents the internal instruction start instruction.
  • the memory 201 detects that the length of the external instruction starting from 26 bytes is 4 bytes during the conversion process, via the bus. 325 Control Aligner 203
  • the external command input via bus 677 is shifted to the left by 4 bits to start the conversion. This byte length is also sent to adder 323 and the contents of register 321 are added, and ' 30 'Save to register 321 again.
  • the output of register 321 is again translated by decoder 762 into a one-hot code and is subjected to a bitwise 'or' operation with the contents stored in register 766. 0000000000000000000000010100010 ' is again stored in register 766.
  • Counter 776 also increments '1' according to the previous example, causing bus 669 to point to C Item.
  • Converter 200 reads from memory 201 via bus 325 during the conversion process starting at '30
  • the length of the external instruction of the byte is 4 bytes, and the length of this byte is also sent to the adder 323 and the contents of the register 321 is added, and the '34' is again stored in the register 321 .
  • Register 321 The output 679 is compared to the second level cache block byte number '31' stored in the comparator 772, at which time the controller 790 has been notified to have crossed the level 2 cache block boundary based on the comparison result. Controller 790 Accordingly, control selector 774 selects all '0', and also controls adder 760 to shift the PC high address in memory 756 to the intra-block offset in register 321 and from bus 667.
  • All '0's sent in are summed to the next external block address.
  • the result PC address '913302' is sent by bus 657, where the PC address is high '9133 'Sent to active table 604 match, get BN2X value '81' (previously due to PC address '913326' match miss, assigned by active table 604).
  • the BN2X The value is selected by the selector 660 and the bus 673 to select the '81' level 2 cache block in the second level cache 606.
  • the converter 200 reads the '0' in the '81' level 2 cache block according to the previous example.
  • the controller 790 controls the converter 200 accordingly.
  • the conversion instruction is stopped and the counter 776 is also incremented by one bit so that the address on bus 669 points to the '72D' entry.
  • the controller also causes the BN2X value on the bus 671 to be '81', via the selector 640 and bus 639 are sent to the block address mapping module 620 in the block address storage module 920, and the contents of the '81' row are read out and the bus 657, the selector 638, and the bus are selected. 637
  • the BN2Y address '02' of the block address mapping module 620 is compared for comparison.
  • controller 790 If a match hits, it will match the resulting BN1 along with the controller's unconditional branch instruction type via the bus. 685, selector 692 stored in track table 610 '72D 'Item. The match result is now a miss, which means that the corresponding external block is already in the L2 cache, but has not yet been converted to an internal instruction. At this point controller 790 generates a direct branch instruction type on bus 687 It is output by the bus 687 together with the lower bit BN2Y (corresponding to the number of offset bytes in the block) '02' from the adder 760.
  • the controller causes the BN2X value on bus 671 to be on the bus
  • the BN2Y address on 687 is combined into the BN2 address '8102', along with the unconditional branch instruction type, written to the track table via selector 692 '72D 'item. At this time, there is no corresponding internal instruction, so the '72D' item in the level 1 cache 602 is not filled.
  • controller 790 also controls the content of the register 766 to be shifted to the left by the shifter 768 by 24 bits, and its value is ' 10100010 ', this format is the data format of line 751 in Figure 8B.
  • controller 790 also controls placing the contents of register 782 '1110' on the bus. 691.
  • the format in register 782 is the data format of line 771 in Figure 8B.
  • the contents of bus 691 are sent to offset address mapping module 618 by level 1 cache permutator 611
  • the '72' line pointed to is written for future intra-block offset mapping of external and internal instructions to the line.
  • the scan converter 608 cooperates with other modules to complete the conversion of an external instruction, and extracts the program flow in the instruction (
  • the program flow is information, and the program flow information and the converted internal instructions are stored in the corresponding entries in the track table 610 and the level 1 cache 602.
  • This embodiment can be made via the tracker 614 Reading and following the program flow in the track table 610 will supply the corresponding internal instructions to the register core for execution.
  • the values in the block address mapping module 620 and the track table 610 can be referred to FIG. 9A.
  • the level 1 cache block is filled before the level 2 instruction segment.
  • Counter 776 There is also a comparator equivalent to the boundary comparator 772, which notifies the controller 790 if it crosses the boundary of the level one cache block. Controller 790 in this case to the level 1 cache block permutator 611 Request a new level 1 cache block and control the BN1X address of this new cache block along with the BN1Y address of '0' via bus 693 and selector 692 Write to the last item in the track table that is filled with rows. Each row in the track table has more than one entry in the corresponding level 1 cache block, so that the program stream can continue to the next new track if the level 1 cache block is full.
  • the external instruction set may be a fixed length instruction set or a variable length instruction set.
  • a variable length external instruction set is mainly taken as an example, and a fixed length external instruction set can be used as a special case of a variable length external instruction set.
  • an external instruction block is 16 bytes long (from byte 0 to byte 15) ), and each internal instruction is 4 bytes in length.
  • the external command block 701 contains six variable length instructions. Byte 0 in the outer instruction block as described in the previous embodiment It is the last byte of the previous instruction and therefore belongs to the previous external instruction block, that is, the external instruction in this external instruction block starts from byte 1 of the instruction block.
  • the external instruction 703 occupies 3 bytes (byte 1 , 2 and 3), external instruction 705 occupies 5 bytes (bytes 4, 5, 6, 7, and 8), and external instruction 707 occupies 2 bytes (bytes 9 and 10) ), external instruction 709 occupies 1 byte (byte 11), external instruction 711 occupies 3 bytes (bytes 12, 13, and 14), external instruction 713 It occupies 1 byte in this external instruction block and the rest is in the next external instruction block.
  • the external instruction 705 can be converted into two internal instructions (ie, internal instructions 725 and 727).
  • external instructions 703, 707, 709, 711, and 713 can all be converted into one internal instruction, which is internal instructions 723, 729, 731, 733 and 735
  • the internal instruction block 721 obtained by the conversion of the scan converter 608 contains seven internal instructions (from internal instruction 0 to internal instruction 7).
  • the scan converter 608 When the instruction block conversion is performed, the correspondence between the offset address BN2Y in the external instruction block and the offset address BN1Y in the internal instruction block is also generated. The correspondence is stored in the offset address mapping module 618 Medium.
  • an external instruction may be converted into one or more internal instructions.
  • an external instruction corresponding to a plurality of internal instructions is taken as an example, and an external instruction corresponding to an internal instruction is a special case. That is, when an external instruction corresponds to an internal instruction, the first internal instruction and the last internal instruction corresponding to the external instruction are the internal instructions corresponding to the external instruction.
  • FIG. 8B is an embodiment of the storage form of the offset address mapping relationship according to the present invention.
  • row 751 and 771 constitutes a set of mapping relationships corresponding to the external instruction block and the internal instruction block, respectively, to store the offset address mapping relationship between the external instruction and the internal instruction in the embodiment of FIG. 8A.
  • line 751 has 16
  • Each entry contains only one bit (bit) of data (ie ' 0 ' or ' 1 '), where ' 0 ' indicates that the external instruction offset address corresponding to the entry is not the start of an external instruction.
  • Location, ' 1 ' Indicates that the external instruction offset address corresponding to this entry is the starting position of an external instruction.
  • Each entry in the file corresponds to an internal instruction offset address, that is, the number of entries is the same as the maximum number of internal instructions that the internal instruction block may contain. And only one bit of data (ie ' 0 ' or ' 1 ') is stored in each entry, where ' 0 ' indicates that the internal instruction corresponding to the entry is not the first internal instruction of its corresponding external instruction, and '1' indicates that the internal instruction corresponding to the entry is the first internal instruction of its corresponding external instruction.
  • FIG. 8C is the offset address converter 622 of the present invention.
  • an external command offset address is converted into an internal command offset address as an example for description.
  • the mapping relationship sent from the offset address mapping module 618 is as shown in FIG. 8B. As described in the examples.
  • the number of columns of the selector in the selector array 801 is equal to the number of offset addresses included in the external block, and the number of rows is one plus one, that is, 17 Rows and 16 columns. For the sake of clarity, only 4 rows and 3 columns are shown in Figure 8C, which are the first 4 rows from left to right and the first 3 columns from bottom to top.
  • the line number is the 0th line of the last action, The line numbers of the above lines are incremented in turn.
  • the column number is 0 in the leftmost column, and the column numbers in the right column are incremented, and each column corresponds to an offset address in an external instruction.
  • Inputs A and B of each selector in column 0 are ' 0 ', except that the A input of the 0th line selector is '1'. Input 0 of all selectors on line 0 is '0'. Inputs for other column selectors A From the output value of the same row selector in the previous column, input B is derived from the output value of the selector in the next row of the previous column.
  • the selector array 803 has a structure similar to the selector array 801 and has the same number of rows. The difference is in the selector array
  • the number of columns in the selector in 803 is equal to the number of instructions contained in the internal instruction block.
  • Figure 8C which are the first 4 from left to right. Lines and the first 5 columns from bottom to top.
  • the line number and column number are set the same as 801.
  • the input B of all selectors in row 0 of selector array 803 is '0'; the last row (16 Line) All selector inputs A are '0', and the output of each selector in row 0 is sent to encoder 809 to encode the position of the output column.
  • Inputs to other selectors A From the output value of the previous row selector in the previous column, input B is derived from the output value of the same row selector in the previous column; and input A in column 0 is derived from the output value of the row selector on the selector array 801, input B From the selector array 801 the output value of the same row selector.
  • the decoder 805 decodes the external instruction offset address, and the obtained mask value is sent to the masker 807. . Since an external instruction block contains 16 offset addresses, the width of the mask value is 16 bits, where the mask bit corresponding to the external instruction offset address and the value of the previous mask bit are '1' ', the value of the mask bit after the mask bit corresponding to the external instruction offset address is '0'. Thereafter, the mask value is compared with the row 751 in the mapping relationship sent from the offset address mapping module 618. Perform bitwise AND operation to retain the mask bit corresponding to the external instruction offset address in line 751 and the value corresponding to the previous mask bit, and clear the remaining values to obtain a 16-bit control word to be sent to the selector. Array 801 .
  • Each bit of the control word controls a column of selectors in selector array 801.
  • selector array 801 When the bit is '1 'When the selector of the corresponding column selects input B all the way; when the bit is '0', the selector of the corresponding column selects input A. That is, for the selector array 801 In each column selector, if the corresponding control bit is '1', the output value from the next row of the previous column is selected as the input, so that the output value of the previous column is shifted up by one row, and the last row is filled with '0'.
  • the number of rows and columns is equal to the number of offset addresses contained in the external block, so the output of the selector array 801 contains and contains only one '1', and this '1' The location of the line where it is located is determined by the control word.
  • each bit of the control word controls a column of selectors in selector array 803.
  • the selector of the corresponding column selects input A.
  • the selector of the corresponding column selects input B.
  • the selector array 803 For each column selector in the selector array 803, if the corresponding control bit is '1' ', select the output value from the previous row of the previous column as the input, so that the output value of the previous column is shifted down by one row as a whole, and '0' is added to the top row as the output of this column; if the corresponding control bit is '0' ', select the output value from the same row of the previous column as the input, and keep the output value of the previous column as the output of this column.
  • the selector array 803 The input will be shifted down one line, ie the only one of the inputs is '1' shifted down by one line. Therefore, when the encoder 809 receives the '1 from the bottom line of the selector array 803 'When, the corresponding internal instruction offset address can be generated according to the position of the column where the '1' is located.
  • the mask value output by the masker 807 is '1111111111000000', and the value in line 751' 0100100001011001 'Get ' 0100100001000000 ' after bitwise and operation, that is, there are three ' 1 ' in the control word.
  • the selector array 801 The '1' in the input is shifted up by three lines, that is, the output '1' is located on the third line. Therefore, the '1' has three values of '1' in the selector array 803.
  • the selector column corresponding to the control bit arrives at the encoder 809 because the value in row 771 is 1101111, making the selector array 803 at 0, 1 and 3
  • the column has a row of '1' for each input, and finally the value output to encoder 809 in column 3 is '1', corresponding to the fourth instruction in the internal instruction block (offset address is '3').
  • Encoder 809 Pressing this code yields '3', thereby converting the external instruction offset address value '4' to the internal instruction offset address value '3'.
  • the BN2Y value to be sorted and the block address mapping module 620 can be stored in each entry.
  • the BN2Y value is compared to store the currently written BN1X and BN2Y to the correct position.
  • FIG. 8D is an embodiment of the block address mapping module of the present invention.
  • the block address mapping module 620 includes a block address storage module 920, a comparison module 924, and a shifter. 926, multiplexer 940, multiplexer 942 and some selector control logic.
  • Each functional module is divided into basically the same plurality of columns (eg: R, S, and T) ).
  • Each of its columns has its own block address storage module 920, comparison module 924, shifter 926, multiplexer 940, and multiplexer 942.
  • block address storage module 920 A memory array consisting of a plurality of entries organized into a plurality of rows and a plurality of columns (such as memory modules 970, 971, and 972 in Figure 8D).
  • each entry There are two parts in each entry: the first level cache block number ( BN1X) and the intra-level cache block displacement (BN2Y).
  • the memory array is selected by address 639 and one of the lines is output by bus 950; a row is also selected by bus 639 to bus 952 The data on it is written to this line.
  • Each column in the column address function module 920 has its corresponding comparison module 924 for comparing the intra-block offset BN2Y. In addition to the comparison module 924 The bit widths of the external function modules and the bus are equal to the block address storage module 920.
  • the entry width is used to transfer the entry.
  • Comparison module 924 is greater than the comparator with a bit width of BN2Y, when a column is on bus 950
  • the column comparator output is '1'; when BN2Y on bus 950 is less than or equal to BN2Y on bus 635
  • the column comparator output is '0'.
  • the selector 940 selects the contents of the entry on the column bus 950 and places it on the bus 952.
  • the comparator output is '1 'When the selector on the right side of the column selects the column in the column 950 where the data is shifted by the shifter 926, the data is placed on the bus 952.
  • the controller will list this column 950 Move the data to the right by one column.
  • the comparator output of a column is '1' and the output of the left column of the comparator is '0', then the data on the column selection bus 665 is placed on the bus 952.
  • Bus 952 The output of selector 940 is sent to block address storage module 920 by column. For example, the output of selector 976 is only sent back to storage module 970, and the output of selector 977 is only sent back to storage module 971 .
  • the control logic selects the data on the column bus 950 to be placed on the bus 954 and sent to the track table 610.
  • Comparison module 924 The result of the comparison is that the comparator outputs 973, 974, and 975 are both '1' (output 973 is '1', which means that there is no corresponding bus 637 in the block address storage module 920.
  • a valid entry for BN2Y), selectors 977 and 978 in control selector 940 select C input, ie the output of shifter 926 is placed on bus 952; and selector 976 Select the data on bus 665 and put it on bus 952.
  • the data on bus 952 is written to the same line that was just read out of block address storage module 920.
  • the result is the storage module 970
  • the middle entry stores the data sent from the bus 665.
  • the entry in the storage module 971 stores the entry data of the original storage module 970, and the entry in the storage module 972 stores the original storage module 971.
  • the BN2Y input from bus 950 of the corresponding comparator in the right column not shown in the figure is '32' greater than '18', so the comparison result is '1 ', each control the corresponding column of data to the right. That is, the BN2Y value of the BN2Y value is larger than the new data, and the items including the new data are arranged in ascending order of BN2Y values.
  • Controller detection comparison module The output of the leftmost comparator in 924 is 973 to determine if the input BN2Y value has a corresponding level 1 cache block. If the comparator output 973 is '1', it means the input BN2Y There is no corresponding level 1 cache block. If the comparator output 973 is '0', it means that the input BN2Y has a corresponding L1 cache block.
  • Comparator output 973 causes selector 976 to select A
  • the input puts the data on this column bus 950 onto bus 952;
  • comparator output 974 causes selector 978 to select the C input, which is the output of shifter 926;
  • comparator output 975 Let the selector on the right column select the C input, which is the output of the shifter.
  • Comparator output 973 is '0' and comparator output 974 is '1', causing selector 977 to select B Enter the data on bus 665.
  • the BN2Y value of the entry data in the storage module 970 is '18', and the data of the entry in the module 971 is stored.
  • the BN2Y value is '27'
  • the BN2Y value of the entry data in the storage module 972 is '32', and is '32' in all other items on the right. So the data in the entry is based on its BN2Y
  • the value is sorted, and the corresponding first-level cache block number is also sorted by the offset within the secondary storage block, so that the BN1 address of the corresponding internal instruction can be obtained according to the BN2 address mapping of an external instruction.
  • BN2 address '8123' is sent from buses 639 and 637.
  • ' 81 The line is read, and the BN2Y values in the storage module 970, 971, and 972 entries are '18', '27', and '32', respectively.
  • BN2Y fed in bus 637 The value is '23'.
  • Comparing module 924 compares comparator output 973 to '0' and outputs 974 and 975 to '1'. At this time, only the signal is controlled on the selector 954.
  • Offset Address Converter 622 According to the offset and the mapping relationship, the corresponding first-level cache block offset BN1Y can be obtained (the second-level cache offset in the correspondence is byte 5 must be '1) 'Identifies the first byte of an external instruction, and the offset address converter 622 finds the internal instruction level 1 cache offset corresponding to the external instruction).
  • BN1X On bus 954 and this BN1Y The splicing obtains a level 1 cache address BN1 corresponding to the above secondary cache address '8123'.
  • the BN1 can be placed in the table of the track table 611 for the tracker to find.
  • FIG. 9A to FIG. 9F are FIG. A schematic diagram of the operational process of an embodiment.
  • the runtime block address storage module 920 and the secondary cache 606 are shown in Figures 9A-9F.
  • each row and secondary cache 606 in the block address storage module 920 corresponds to an external block address in the active table 604.
  • the offset address mapping module 618 corresponds to a row of the track table 610 to the level 1 cache 602.
  • the active table 604 in FIG. 6 is also responsible for allocating a secondary cache block in the secondary cache 606 for the newly taken external instruction block by the replacement rule, replacing the module 611. It is responsible for allocating a level 1 cache block in the level 1 cache 602 for internal instructions by permutation rules.
  • the shaded portion of the level 1 buffer 601 in the figure represents the internal instructions that have been filled.
  • the address cache of the second level cache 606 is BN2 and its format is '8XYY'. Where '8X' is the block address BN2X.
  • the secondary cache 606 is a way group cache, and its block address is an index address (index ), and its value is '80 ' ⁇ '82 ', Its corresponding label (ie, the block address) is stored in the row of the same index address in the active table.
  • Each L2 cache block in the L2 cache 606 (one row in the figure) has 32 bytes, and its intra-block offset BN2Y is its intra-block byte ( Byte ) The address ' YY ', whose value is ' 0 ' ⁇ ' 31 '.
  • the variable length external instructions are stored, each of the partitions representing an external command of a different length, in this embodiment the length of the external command is from 2 bytes to 8 bytes.
  • the level 1 cache 602 is in the track table 610 and the block address storage module 920.
  • a fully connected cache under cooperative control whose address is BN1, and its format is '7XY', where '7X' is the block address BN1X and its value is '70' ⁇ '75'.
  • Level 1 cache Each level 1 instruction block in 602 (one row in the figure) has 4 fixed-length internal instructions whose intra-block offset BNY1 is its block word (word) address 'Y', which is easy to understand and with BN2Y. Differentiated, the value is marked with the letters A ⁇ D in this embodiment; in this embodiment, the length of an internal instruction is one word (word), and the internal instruction may have other lengths.
  • Each row in track table 610 also has The four entries A ⁇ D correspond to the four internal instructions A ⁇ D in the first level cache 602. Each row in track table 610 also has an E entry that holds the address of its next instruction block. Track table 610 Each entry in the store stores a type, and the tracker determines the next address based on the type. The entry can also store a pointer to the target address of the instruction represented by the entry, which can be either BN2 or BN1. .
  • the offset address mapping module 618 corresponds to one row of a level one cache block and its corresponding track table.
  • Each row in block address storage module 920 corresponds to a secondary cache block in secondary cache 606.
  • Secondary cache 620 There are multiple entries in each row (such as: R, S, T, U, V). Each entry can correspond to a level one instruction block in the level one cache.
  • Block address storage module 920 The contents of each entry contain the block address BN1X of its corresponding primary cache block, and the address of the corresponding external instruction in the secondary cache block of the first internal instruction in the primary cache block in the secondary buffer block. BN2Y . When a L2 cache block is written, the BN2Y address in its corresponding row in block address storage module 920 is reset to '32. ', its meaning is the first byte in the next secondary cache block in its order.
  • Figure 9A is the start state when the L2 cache block '80' in the L2 cache 606 has been filled, and the L2 cache block '81 'And '82' has not been filled.
  • the external instruction starting from byte '24' in block '80' is being converted by scan converter 608 to internal instruction format via bus 667 to level 1 cache 602
  • the middle level cache block '72' is filled in order.
  • scan converter 608 is found starting at '80' block' 26
  • the external instruction of the byte is a branch instruction, and the cache block address stored in the active table 604 with the cache block plus the intra-block offset ' 26 ', plus the branch offset to calculate its branch target.
  • the branch target high bit is sent to the match miss in the active table 604 via the bus 657, and the new secondary cache block is allocated via the active table 604.
  • the 'number buffer block ie, BN2X is '81'
  • the active table 604 also sends the branch target high bit to the lower layer memory to read the corresponding external instruction block and store it in the '81' cache block.
  • Corresponding scan converter The BN2Y of the '81' line in 608 is all reset to '32'.
  • the newly allocated L2 cache block number is sent by the active table 604 via the bus 671 and the bus output by the scan converter 608.
  • the branch destination low bits ('18' byte) on 657 are spliced into a BN2 address on bus 687.
  • Scan converter 608 is also derived with the external instruction '8026' (ie '80
  • the corresponding internal instruction address of 'block' 26 'bytes' is '72B' (ie the second word in the '72' level one block), so the address bus 669 of scan converter 608 points to the track table.
  • the entry in column '72' and column B is written to the contents of the entry from bus 687. Therefore, the content of the '72B' entry in the track table 610 is BN2 address '8118 '.
  • the lower bits of the branch target on bus 657 (BNY2 value '18') are selected by selector 638 and placed on the compare module
  • An input 637 of 924 and the '81' line from block address storage module 920 (BN2X value '81' assigned by active table 604 is selected by selector 640 and via bus Comparing the contents of each entry of 639), it is found that the value of '18' is less than the content of all entries (ie ' 18 ' ⁇ ' 32 '), so the BN1X value is '72' and BN2Y
  • the value '18' (the branch target external instruction whose address is in the '18' byte is written to the first block of '72') is written to the R in the '81' line in the block address storage module 920. Item. At this time, the value of the R term is '7218'.
  • Scan Converter 608 continues to convert the format to the bytes of the '80' block in the Level 2 cache 606 30 ', found that the instruction length is 4 bytes, beyond the 2 bytes of this block, so add '30' (intra-block offset) to the second level cache block address plus '4 '(number of instruction bytes), the next external instruction block address is generated.
  • the next cache block address is also sent by bus 657 to active table 604 and found that the external block is already in (or is being read from lower bank) '81 'No.
  • the way in which the instruction stream is transferred from the last instruction of an instruction block to the next instruction in its order is regarded as an unconditional branch instruction, that is, BN2 on the bus 687.
  • the address is used as a destination address and is placed in the table following the last instruction (address '72C ') of an instruction block in the track table, and the type is set to an unconditional branch. Therefore, scan converter 608 is routed via bus 661 Send the address whose value is '72D', and control the track table to write BN2 address '8102' in '72' row D.
  • the tracker 614 reads the contents of the track table from the track item '72' line A, because the line A The item is not a branch instruction and the tracker continues to read to the right.
  • the tracker 614 reads '8118' from the '72' line B item and judges to be a BN2 address, that is, the address is routed via the bus 631. It is sent to the block address storage module 920 and the second level cache 606.
  • the BN2 address reads the contents of its '81' row from the block address storage module 920.
  • Control logic discovery block address storage module All the L1 cache block numbers of the '81' line in 920 are invalid, and it is judged that the corresponding external command of the BN2 address has not been converted into an internal command, that is, the control L2 cache 606 sequentially reads the address from ' 8118 'The beginning part of the '81' L2 cache block is supplied to the scan converter 608 for format conversion until the external instruction on '8131' (the last byte of the '81' block).
  • the scan converter 608 therefore also requests the replacement module 611 for a level one instruction block number that can be replaced.
  • Replacement module 611 follows certain rules, such as the LRU replacement algorithm, to determine the replaceable primary storage block, in this case, '70', '71', '73', '74', '75 '. Therefore, the '70' primary storage block is provided in order for filling.
  • the scan converter 608 will then be from the secondary cache 606 '8118 'The internal instructions converted into the external instructions are sequentially filled into the A, B, C, and D items in the '70' memory block in the Level 1 cache 602, and the BN1 address is '70A.
  • Scan Converter 608 finds that the D item of the '70' level one block is filled, '81 The instruction with the address ' 8118 ' ⁇ ' 8131 ' in the 'II secondary memory block has not been converted and only converted to an external instruction with the address ' 8126 '. Then to the replacement module 611 Request a first-level instruction block number that can be replaced. Replacement module 611 provides the first-level storage block of '71' in order.
  • the controller will then replace the BNX value generated by module 611 '73A 'Also with the unconditional branch instruction type generated by the controller '71A' (the address of the first instruction in the '71 ' level one cache block) is written in the '70' line in the track table 610 according to the previous example.
  • the entry is used by the tracker 614 to jump to the first instruction of the '71' cache block.
  • the scan converter 608 also continues to convert external instructions and fills the '71' level one block in sequence.
  • Scan converter 608 also maps the intra-block offset address BN2Y of the first byte of each external instruction in the address '8118' ⁇ '8126' and the intra-block offset address BN1Y of the corresponding internal instruction.
  • the format in the 7B example is stored in the '70' line pointed to by the tracker pointer 631 in the in-block offset mapper 618.
  • the BNY2 value '27' sent from bus 657 is sent to compare modules 924 and '81 'Line comparison of each item. It is found that the BNY2 value is greater than the BNY2 value in the R entry '18', but less than the BNY2 value in the S entry and other entries (both '32'). value' 7127 'The S entry of the '81' line in the block address storage module 920 is filled in. The original R entry value '7018' does not change. The original T, U, V The value of the entry is shifted to the right by one entry.
  • scan converter 608 is at ' 8118 ' ⁇ ' 8131
  • the branch instruction is not found in the external instruction of ', so there is no branch target record in the A, B, C, and D items in the '70' line in the track table 610.
  • Scan Converter 608 finds that the external instruction starting from the '26' byte in the '81' line ends at '31 'Byte, does not extend to the next instruction block, and the corresponding internal instruction of the external instruction ends in storage block B of '71'. Therefore, the next external instruction address assigned as the previous example is calculated and matched. 8200 'Save in track table 610 '71' line C item.
  • the active table 604 reads the corresponding external instruction block of the '82' secondary cache block to the lower layer memory as in the previous example to fill the '82 'No. 2 cache block. Please see Figure 9C.
  • the processor core executes the branch instruction in the '72B' entry in the track table, and the result of the determination is sent to the tracker 614 via signal 635. . At this point, the result is no branch.
  • the tracker 614 is moved to the next track point '72C' in the same row in the track table, and is found to be a non-branch instruction, moving to the next entry '72D '. After reading, it is found to be an unconditional branch address with the target '8102'.
  • the controller determines that this is the BN2 address and is sent via bus 633.
  • the upper bit of bus 633 is sent to the block address storage module.
  • the 'number one level cache block fills the internal instructions of the conversion in order.
  • the value ' 7302 ' (the corresponding internal instruction representing the external instruction of BNY2 '02' is placed in the '73' level one instruction block) is placed in the '81' line R entry, while the original '81 ' Each row of the row is shifted to the right by one entry.
  • the BNY2 value '18' in the entry to which the new value is written (in this case, the R entry) is sent to the scan converter 608 to notify the scan converter 608.
  • the BNX value generated by module 611 is replaced while the internal instruction obtained by the conversion is filled in the first level cache block of '73'.
  • 73A ' The unconditional branch instruction type generated along with the controller is written to '72D' in track table 610, replacing BN2 value '8102' with BN1 value '73A '.
  • the tracker 614 read pointer 631 still points to the '72D' entry, so the value of '73A' is read on bus 633.
  • Control logic judges this is BN1
  • Scan converter 608 found the '73' when converting to an external instruction at the end of the '81' line '9' byte
  • the 'level one instruction block has been filled in the D item, and according to this request, the '74' level one instruction block continues to convert and fills the external instruction starting from the '10' byte.
  • Replace BNX generated by module 611 as in the previous example
  • the value '74A' is filled in with the '73' row E entry in track table 610 along with the unconditional branch instruction type generated by the controller.
  • BNY2 value sent from bus 657 ' 10 ' As in the previous example, it is sent to the comparison module 924 to compare with the '81' line items.
  • the BNY2 value is greater than the BNY2 value '02' in the R entry, but less than the S entry.
  • the BNY2 value is '18' and the BNY2 value in other entries.
  • the value '7410' is filled in the S entry of the '81' line in the block address storage module 920, the original R The value of the entry is unchanged.
  • the values of the original T, U, and V entries are shifted to the right by one entry.
  • Scan converter 608 continues to convert external instructions and fills into level 1 cache 602. In bytes ' 17 The 'end' of the external command is padded to the B entry in the '74' level 1 cache block. At this point, scan converter 608 finds that it has encountered the limit sent by previous comparison module 924' 18 ', and with this limit in the block address memory 920 81 lines get '70', that is, '70A', that is, the unconditional branch instruction type is stored in the track table 610 '74' line C Item storage. Another embodiment can store the BN2 address '8118' in the track table 610 in the '74' line. The C item is reserved for the tracker to map it to when it is read. Please see Figure 9D .
  • the tracker 614 continues along '73
  • the 'number track' advances because the '73B', '73C', and '73D' entries in the track table are non-branch instructions, and the tracker does not stop at these entries, from '73E
  • the entry reads the unconditional branch instruction target '74A', that is, the transition to the '74' line starts from the A entry.
  • the tracker reads the unconditional branch instruction target '70A' in the '74C' entry. '. That is, the transfer to the '70' line continues, and the '70E' entry reads the unconditional branch transfer instruction, target '71A'.
  • the tracker 614 is transferred to the '71' line and continues on the '71C
  • the contents of the table entry readout entry are unconditional branch instructions, target '8200'.
  • the controller determines that the target is a secondary cache block address, and then sends the address to the block address storage module 920 via bus 631.
  • the match finds that the '82' level 2 cache block does not have a valid level 1 cache block.
  • the result of the match causes scan converter 608 to begin converting all external instructions in the '82' cache block to internal instructions, from the replacement module
  • the first-level storage block of '75' provided by 611 starts to be filled into the first-level buffer 602.
  • the scan converter 608 also fills the instruction type extracted at the time of conversion and the calculated branch target into the track table.
  • the controller also controls the BN1 address '75A' generated by the permutation module 911 along with the unconditional branch instruction type to be written to the tracker 610 in the track table 610.
  • the new content of the entry is read from the track table and sent directly to the level 1 cache via bus 631.
  • 602 reads the internal instructions for use by processor core 601.
  • the comparator outputs 973 and 974 are both '0' and output. 975 is '1'.
  • the signal 981 is '1' on the control of the selector 954 (the signal 981 is the exclusive OR of the output 974 and the output 975), and the storage module 971
  • the content '7410' on the middle entry is placed on the bus 954 and sent to the intra-block offset mapping logic (including the intra-block offset mapping module 618, the offset address translator 622 and the subtractor 928). ).
  • the primary cache block number BN1X in the entry of the entry is sent to the offset address converter 622 as the address from the mapping in the 74th row read out from the intra-block offset mapping module 618.
  • BN2Y offset in the second level cache block
  • BN2Y on the bus 954 which is the starting address of the second level sub-cache block corresponding to the level one buffer in the level two cache block
  • BN1Y Offset Address Converter 622
  • the corresponding first-level cache block offset BN1Y can be obtained.
  • the BN1X on the bus 954 is spliced with this BN1Y to obtain the address corresponding to the above secondary cache address' 8116 'Corresponding level 1 cache address BN1 value '74B'.
  • the BN1 value can be placed in the '75B' entry in the track table 611 to replace the original '8116' for the tracker 614 controls the level 1 cache 602 to read the instruction based on the BN1 value and the feedback from the processor core 601.
  • Scan converter 608 continues to convert the secondary buffer 606 on '82
  • the external instruction on the line is assigned the '77' cache block as the next sequential cache block after completing the '75' level 1 cache block. Please refer to Figure 9F.
  • the tracker 614 After the branch instruction address required by the tracker in the track table is converted from BN2 to BN1, the tracker 614 Once the value is read, it can be directly uninterrupted (in addition to waiting for the processor core 601 to be determined by the conditional branch sent by bus 635) to control the level one instruction cache to provide instructions to processor core 601.
  • the processor system can not only support various external instruction sets (binary code instruction sets) corresponding to different processor platforms, but also support a bytecode instruction set of the corresponding virtual machine, for example, as JAVA
  • the bytecode instruction entered by the interpreter can be converted to one or more internal instructions for execution by the processor core in the same manner as in the previous embodiment. Due to the special nature of bytecode instructions, some improvements can be made during the conversion process to improve performance. For example, for a bytecode instruction that requires a constant operation, since the constant is stored in a constant pool in memory, the method described in the previous embodiment is converted into a data read instruction and a corresponding operation instruction.
  • the constant can be populated from the memory into the data cache in advance when the scan converter review finds that the bytecode instruction is an instruction to read the constant.
  • the processor core executes the first internal instruction (ie, the data read instruction) corresponding to the bytecode instruction, the cache miss caused by the data reading does not occur.
  • the constant when the constant is obtained from the memory in advance, the constant is directly embedded into the corresponding internal instruction (ie, the operation instruction) in the form of an immediate value, so that the data read instruction can be omitted.
  • the processor core executes the internal instruction corresponding to the bytecode instruction (that is, the operation instruction in which the constant is embedded), the operation can be directly performed, thereby further improving the performance of the processor system.
  • the stack operation instruction in the bytecode instruction can also be converted into a corresponding internal instruction for execution by the processor core by the method of the present invention, thereby eliminating the process of translating the bytecode instruction into a machine code instruction.
  • a stack operation is converted to an internal instruction, and the operand of the internal instruction of the class is not a register value in the register file, but a number of register values at the top of the stack in the operand stack.
  • the corresponding control logic can be added to the existing register file in the processor core so that the register file can be used as a stack register.
  • FIG. 10A It is an embodiment of the operand stack of the present invention.
  • a stack operation requires a maximum of two operands and an operation result is taken as an example for description. For other situations, this can also be the case.
  • register file 1001 supports two read operations and one write operation simultaneously.
  • the decoder 1003, 1005 decodes the two register numbers sent to the first read port and the second read port respectively, and reads the corresponding register values from the buses 1013 and 1015.
  • Decoder 1007 The register number of the register to be written is then decoded and sent to the write port so that the value on bus 1017 can be written to the corresponding register.
  • Register 1011 The stack top pointer value is stored, which is the register number pointed to by the stack top when the register stack is used.
  • the value in register 1011 is sent to selectors 1053, 1055 and via bus 1045. 1057, and the reducer 1031, the incrementer 1041, and the controller 1019.
  • the reducer 1031 and the incrementer 1041 are respectively connected to the bus 1045.
  • the sent stack top pointer value is decremented by one and incremented by one, and the corresponding result is sent to selectors 1053, 1055 and 1057 via buses 1043 and 1047, respectively. Due to register file 1001 The capacity is limited.
  • the stack pointer In the operand stack, the stack pointer is moved at the same time so that the operand stack can continue to provide operands.
  • the controller 1019 According to the value of the top-of-stack pointer, a new stack bottom pointer value is generated and decoded by the decoder 1009 to control the register file 1001. Stores the register value between the original stack bottom pointer and the new stack bottom pointer into external memory, or fills the corresponding operand from the external memory to the register file 1001 The register between the bottom stack pointer and the new stack bottom pointer.
  • an instruction field in the internal instruction indicating whether the internal instruction is a register operation instruction or a stack operation instruction, and the value of the instruction field is sent to the selectors 1033, 1035 and through the control line 1021. 1037.
  • selectors 1033, 1035, and 1037 select input A and send them to decoders 1003, 1005, and 1007, respectively.
  • selectors 1033, 1035, and 1037 select input B and send them to decoders 1003, 1005, and 1007, respectively. .
  • an internal instruction is a register operation instruction
  • the two source register numbers and a destination register number pass through the bus 1023, respectively.
  • 1025 and 1027 are selected by selectors 1033, 1035 and 1037 and decoders 1003, 1005 and 1007
  • the register file is addressed after decoding to read and write the corresponding register value. This operation is similar to the prior art and will not be described here.
  • an internal instruction is a stack operation instruction
  • the instruction fields of the above three storage register numbers are used to store the stack top pointer movement information.
  • the register number corresponding to one operand is the register. 1011
  • the stack top pointer value stored in the other operand corresponds to the stack pointer value minus one
  • the register number corresponding to the operation result is also the stack top pointer value minus one. That is, after the two operands at the top of the stack are popped, the result of the operation is pushed back to the top of the stack.
  • selector 1053 is controlled by the instruction field on bus 1023, selects input D (current stack top pointer value), reads the first operand from the register file; selector 1055 is affected by bus 1025 On the instruction field control, select input H (current stack top pointer value minus one), read the second operand from the register file; selector 1057 is controlled by the command field on bus 1027, select input K (The current stack top pointer value is decremented by one), and the register to be written back is selected after decoding.
  • selector 1051 is controlled by the command field on bus 1029, select input N (The current stack top pointer value is decremented by one) as the new stack top pointer value is written back to the register 1011 to complete the stack top pointer update.
  • selector 1057 is subject to bus 1027.
  • the upper instruction field control selects the input I (the current stack top pointer value plus one), and after decoding, selects the corresponding register, thereby writing the operand to the register to implement the push operation.
  • selector 1051 is subject to bus 1029
  • the upper instruction field control selects the input L (the current top-of-stack pointer value plus one) as the new stack top pointer value to be written back to the register 1011 to complete the stack top pointer update.
  • selector 1053 is subject to bus 1023.
  • the upper instruction domain control selects the input D (current stack top pointer value), and after decoding, selects the corresponding register read operand to implement the pop operation.
  • selector 1051 is subject to bus 1029
  • the upper instruction field control selects the input N (the current top-of-stack pointer value minus one) as the new stack top pointer value to be written back to the register 1011 to complete the stack top pointer update.
  • the controller 1019 stores the current stack bottom pointer value and slave register 1011.
  • the current stack top pointer value sent is judged. If the value of the bottom pointer and the value of the top pointer are close to a certain extent, the operand stack is nearly empty. If the operand is previously stored in the external memory (or cache), a certain number of operands need to be taken from the external memory (or The cache is populated outside the bottom of the stack in the register file and the stack pointer value is updated.
  • FIG. 10B It is an embodiment of the update stack of the present invention.
  • '3 'Time' indicates that the operand stack is nearly empty and one operand is filled in at a time.
  • the bottom pointer points to register 1073 and the top pointer points to register 1079.
  • the top-of-stack pointer points to register 1077.
  • the controller 1019 Sends a signal from the external memory (or cache) to retrieve the last operand stored before, and fills the operand to the register at the bottom of the stack pointer minus one position (ie register 1071)
  • the value of the bottom pointer is decremented by one, so that the bottom pointer points to the register 1071, keeping the number of operands in the stack larger than '3'.
  • FIG. 10C It is another embodiment of the update stack of the present invention.
  • '7 A 'time' indicates that the operand stack is nearly full and each time an operand is stored to the external memory (or cache).
  • the bottom pointer points to register 1081 and the top pointer points to register 1091.
  • the top-of-stack pointer points to register 1093.
  • the controller 1019 Signals the operand pointed to by the bottom pointer to the external memory (or cache), and increments the value of the stack pointer so that the bottom pointer points to register 1083, keeping the number of operands in the stack less than '7' '.
  • FIG. 10B and FIG. 10C a method of filling or storing a plurality of operands each time is shown in FIG. 10B and FIG. 10C.
  • the similarities in the embodiments are not described here.
  • the difference between the top-of-stack pointer value and the bottom-of-stack pointer value is judged to determine whether the operand stack is nearly empty or full.
  • it can also be judged based on the change in the value of the top pointer. For example, since the stack pointer value was last adjusted, if the stack top pointer value is increased or decreased to a certain extent, the corresponding operation can be performed.
  • the end track point is treated as an unconditional branch point, so when the tracker reads the pointer 631 The tracker reads the pointer 631 when it points to the track point before the end track point (ie, the last instruction in the instruction block), and the track point is not a branch point, or a branch point where branch branching does not occur.
  • the tracker reads the pointer 631 It will not be updated to the first track point of the next track until the next clock cycle, so the first level cache 602 also needs to be to the processor core 601 during this clock cycle.
  • An empty instruction ie, an instruction that does not change the internal state of the processor core, such as NOP
  • it can be sent to the level 1 cache 602.
  • the addressing address is judged. Once the addressing address is found to correspond to the ending track point, the first level cache 602 is not required to be accessed, and the empty instruction is directly output for the processor core 601 to execute.
  • the disadvantage of doing this is to make the processor core 601 takes one more clock cycle to execute useless empty instructions.
  • Figure 7A can be modified so that the tracker reads the pointer 631 When pointing to the previous track point of the end track point, according to the instruction type of the track point and the processor core 601 The feedback of the instruction is executed to directly point to the branch target track point or the first track point of the next track in the next clock cycle.
  • FIG. 11A is another embodiment of the track table based cache structure according to the present invention.
  • Processor core 601 in this embodiment level 1 cache 602, scan converter 608, level 2 cache 606, replacement module 611, offset address mapping module 618, and selectors 692, 696, 694 are all
  • the 7A embodiment is the same.
  • the track table 610 outputs the contents of two track points at a time (the track point content of the tracker read pointer 631 1182 and one track point content 1183) ), and the type decoder 1152, controller 1154, and selector 1116 are added to the tracker.
  • controller 1154 performs Figure 7A Similar functions of controllers not shown, which are shown here to illustrate more complex functions and operations.
  • the read port of the track table 610 is read pointer 631 of the tracker output. Under the addressing, the contents of two adjacent track points are output and placed on bus 1117 and bus 1121, and controller 1154 detects the type of instruction on bus 1117, type decoder 1152 The type of instruction on the bus 1121 is detected. At any one time, two entries are read from the track table 610: the current entry 1182 and its next (right) entry 1183. Current entry The contents of 1182 are read via bus 1117 to an input to controller 738 and to controller 1154. The next entry, 1183, is sent over bus 1121 and sent to the type decoder. 1152 decodes, the result of which controls selector 1116.
  • selector 1116 One input of selector 1116 is derived from bus 1121 and the other input is derived from BN1X and the incrementer in read pointer 631. 736 The added BN1Y (that is, the BN1Y value in the read pointer 631 is increased by one).
  • Type decoder 1152 decodes only unconditional branch instruction types, if bus 1121 The type above is an unconditional branch instruction type, then control selector 1116 selects the content on output bus 1121; if any other type, selects BN1X and incrementer from bus 631 736 The output is increased by BN1Y.
  • the selector 1116 Selects an output from the incrementer 736 to be sent to one of the selectors 738.
  • controller 1154 If controller 1154 is translated on bus 1117 (ie current entry 1182)
  • the instruction type in the content is a non-branch instruction, and the controller 1154 controls the selector 738 to select the output of the incrementer 736 selected by the selector 1116 as the register 740.
  • Control signal 1111 from processor core 601 controls the input to be stored in register 740 so that the tracker moves to the right to the next address (i.e., the more sequential address BNX1 does not change, BNY1+ ' 1 ').
  • the control signal 1111 is a feedback signal provided by the processor core 601 to the tracker. The control signal 1111 is always '1 when the processor core is operating normally.
  • controller 1154 controls selector 738.
  • the branch target address on bus 1117 is selected such that read pointer 631 jumps to the track point location corresponding to the branch target address on bus 1117.
  • controller 1154 The tracker is controlled to pause the update and wait until the processor core 601 generates a TAKEN signal 635 that the branch transfer occurred. At this time, register 740 is not only controlled by the signal 1111. Control, also by the processor core 601, a signal indicating whether the Taken signal 635 is valid 1161 control, requires the signal 1161 to display the TAKEN signal 635 Register 740 is updated when valid and control signal 1111 is also active.
  • selector 738 selects the selector The output of 1116, as previously executed as a non-branch instruction; if a branch transfer occurs (TAKEN signal 113 is '1'), selector 738 selects bus 1117
  • the branch target address thereon is stored in the register 740, the pointer 631 points to the corresponding entry of the branch target in the track table, and the branch target instruction in the level 1 cache 602 is read out for the processor core 601. carried out.
  • controller 1154 controls the register in the tracer. 740 Pause the update and wait, convert the BN2 to the BN1 address according to the previous example, and write back the original indirect branch entry in the track table. This entry is via bus 1117 Read, and the processing thereafter is the same as the previous example.
  • the tracker executes the result (e.g., the execution result of the branch instruction) along the BN1 and according to the instruction fed back by the processor core 601, and controls the level 1 cache 602 to the processor core 601. Output instructions for execution.
  • branch transfer does not occur, it runs as before the non-branch instruction, and if the branch transfer occurs, it runs as before the unconditional branch instruction.
  • the controller 1154 controls the register 740 in the tracker.
  • the update is suspended, and the processor core 601 waits for the branch target address to be sent via the bus 683, and is sent to the active table 604 and the block address mapping module 620 as in the previous example.
  • the subsequent operations are the same as in the previous example.
  • the branch type decoder 1152 pairs the bus 1121.
  • the instruction type on the decoding is such that the selector 1116 selects the branch target on the bus 1121 without selecting BN1 provided by the incrementer 736 (the BN1 is BN1X, BN1Y+) ' 1 '), so after the processor core 601 executes the corresponding instruction of the entry 1182, the instruction corresponding to the entry 1183 is not executed (because the entry 1183)
  • Corresponding may be the end track point, there is no instruction corresponding to it in the level 1 cache 602, but the corresponding instruction of the branch target address contained in the entry 1183 is directly executed.
  • the entry 1182 is a non-branch instruction
  • the next instruction executed after executing the instruction as described above is the entry.
  • the entry 1182 is an unconditional branch instruction
  • the next instruction executed after the instruction is executed is the entry 1182.
  • the instruction pointed to by the branch target in , the entry 1183 has no effect on the process.
  • the next instruction executed after the instruction is executed depends on the processor core. 601 generated TAKEN signal 635. If it is determined that a branch transfer occurs (TAKEN signal 635 is '1'), selector 738 selects bus 1117.
  • the upper branch target indicating the TAKEN signal 635 valid signal 1161 controls the target to be stored in register 740, so that pointer 631 points to the branch target, and the next executed instruction is the entry.
  • Branch target on 1121 indicating TAKEN signal 635 active signal 1161 and control signal 1111 control stores unconditional branch target from 1183 into register 740 Pointer 631 points to the branch target, and the next instruction executed is the instruction pointed to by the unconditional branch destination address in entry 1183.
  • the unconditional branch destination in the end track point can also be the secondary cache address BN2.
  • Type decoder 1152 on the decoding bus If the address type of the entry read out on 1121 is BN2 format, you can also put BN2 output from bus 1121 on the bus. 1117 Convert to BN1 according to the previous example. Save the entry back. For clarity and ease of illustration, this path is not shown in Figure 11A.
  • the type judgment of the conditional branch instruction can be in four ways.
  • the first way is that there is only one type of unconditional branch, that is, the unconditional branch instruction in the program, and the unconditional jump operation in the end track point added by the present invention to control the jump to the next track start entry is not added. distinguish.
  • the original conditional branch instruction in the program is skipped and not processed by the processor core. 601 execution, but the program flow is on the track table 610 Under the control of the tracker, the target instruction of the branch instruction and its subsequent instructions can be correctly executed. In this way, the clock cycle occupied by the original unconditional branch instruction is saved. But the processor core 601 Because the instruction is not executed, the program counter PC value will have an error.
  • the cache system of the present invention can correctly feed the processor core without a PC. Provide the instructions it will execute for its uninterrupted execution. If you need to obtain the PC value at a certain time (such as debugging), the corresponding L2 cache block address of the first-level instruction block is recorded in each track table BN2X. And the level 2 cache sub-block address. Thus, BN2X can read the corresponding tag from active table 604, with the secondary cache block address, the sub-block address, and the BNY in pointer 631. The value splicing is the PC value of the instruction being executed.
  • the second way is to have two types of unconditional branches. Among them, one is the end point unconditional branch type corresponding to the end point of each track in the track.
  • type decoder 1152 It is considered that the end point does not correspond to an instruction in the program, thereby controlling the selector 1116 to select the branch target on the bus 1121, and directly jumps to the bus 1121 after executing the instruction on the bus 1117. The branch destination address on the branch.
  • the type decoder 1152 does not treat this type as a branch when it is decoded, and the control selector 1116 selects the incrementer 736. Output.
  • the next executed instruction is the next instruction in the order, that is, the original unconditional branch instruction in the program. PC in the processor core in this way Then keep the correct value.
  • a third way is to improve the embodiment of Figure 11A at scan converter 608.
  • scan converter 608 In this case, the end track point is merged into the track point corresponding to the last instruction. That is, the instruction type of the last instruction is marked as an unconditional branch instruction, and the BN1 corresponding to the first instruction of the next instruction block or BN2 (If it is BN2, the tracker will convert it to BN1 according to the previous example).
  • the track point content is stored in the track point corresponding to the last instruction.
  • the controller 1154 will bus 1117 in addition to reading the instruction from the level 1 cache 602 for the processor core 601 to execute normally.
  • the upper instruction type decoding finds that it is an unconditional branch type, so the control selector 738 selects the bus 1117 to update the read pointer 631 to the branch target of the unconditional branch BN1 in the next clock cycle. (ie BN1 corresponding to the first instruction of the next instruction block). At this point, processor core 601 does not need to waste one clock cycle to execute a null instruction.
  • the scan converter 608 During the process of reviewing the instruction block, if the last instruction of the first-level instruction block (corresponding to the last track point in one track) is found as a branch instruction, the scan converter 608 In this case, the end track point is not merged into the track point corresponding to the instruction, and the content of the end track point is placed after the track point (right side) corresponding to the track point corresponding to the last instruction of each track.
  • the controller 1154 Control selector by unconditional branch type on bus 1117 738 Select branch target on bus 1117 Put pointer 631 , jump to the target, the end track point will not be executed.
  • the controller 1154 controls the tracker pause according to the conditional branch type on the bus 1117, waiting for the processor core. 601 generated branch judgment signal 635. At this time, the type decoder 1152 decodes the instruction type on the bus 1121 as an unconditional branch, and the control selector 1116 selects the bus 1121. .
  • the controller 1154 controls the selector 738 to select the conditional branch target on the bus 1117 to place the pointer 631.
  • controller 1154 controls selector 738 to select the output of the 1116 selector, placing the unconditional branch target on bus 1121 on pointer 631. Level 1 cache 602 Press the pointer 631 to send the instruction for the processor core 601 to execute.
  • All of the above three methods are applicable to both fixed-length instructions and variable-length instructions. That is, it is not required to end the fixed position of the track point in the track. In addition, if the position of the end track point in the track is fixed, it can be based on the read pointer 631. The value of BN1Y determines if the last instruction has been reached. The fourth way is that there is only one unconditional branch type in the track table, but the tracker divides it into two types depending on where the type is in the track. In this way, the pointer BN1Y in 631 is sent to type decoder 1152 and the instruction type on bus 1121 does not need to be decoded.
  • Type decoder when the BN1Y points to the last entry in a track 1152 Control Selector 1116 Selects the branch target on bus 1121 and jumps directly to the branch destination address on bus 1121 after executing the instruction on bus 1117.
  • type decoder 1152 controls selector 1116 to select the output of increment 736.
  • the bus is executed 1117 After the corresponding instruction of the contents of the above table item, the next executed instruction is the next instruction in the order. In this way, the PC in the processor core always maintains the correct value. This way adapts to fixed length instructions.
  • the present invention can control the processor core 601 to perform a speculation execution along the branch (speculate execution) ) to improve processor execution efficiency.
  • FIG. 11B which is an embodiment of the present invention supporting guess execution.
  • the selector in Figure 11B adds a selector 1162 compared to the tracker in Figure 11A.
  • the register 1164 is used for selecting and storing another temporary storage that is not selected by the branch guessing, so as to be used when guessing the error. Guess the execution direction can be from existing static predictions, or dynamic branch predictions (branch The prediction technique can also be determined by the branch prediction field stored in the entry of the corresponding branch instruction in the track table.
  • the controller 1154 is translating the bus 1117.
  • the control selector 1162 and the register 1164 select the branch target address on the bus 1117 to be stored in the register 1164.
  • Simulant controller 1154 Control Selector 738 selects the output of the 1116 selector (which is the next instruction in the order of the branch instruction) for storage in register 740, causing pointer 631 to control level 1 cache 602
  • the next instruction in the order after the branch instruction is provided is executed by the processor core 601, and the instruction is marked to the processor core as a guess execution.
  • Pointer 631 also points to track table 610 After the branch instruction, the first entry is ordered so that it is placed on the bus 1117.
  • controller 1154 presses bus 1117
  • the type of instruction above determines the subsequent direction of the tracker and continues to provide instructions to the processor core. All of these instructions are marked as guess execution.
  • the controller 1154 compares the predicted branch direction with the branch direction on 635. If the comparison result is the same, continue execution in the original guess direction. If the comparison result is different, then the controller 1154 goes to the processor core 601. A 'guess error' signal is sent, causing the processor core to clear all instructions with guess execution flags and their intermediate execution results.
  • controller 1154 controls selector 738 select register 1164
  • the output of the branch that causes the branch to be unconfirmed is used to control the level 1 buffer 602 to provide instructions to the processor core 601 and continue execution there.
  • the controller 1154 is translating the bus 1117
  • the control selector 1162 and the register 1164 select the output of the 1116 selector (which is the next instruction of the branch instruction sequence) to be stored in the register. 1164.
  • the controller 1154 controls the selector 738 to select the branch target address on the bus 1117 for storage in the register 740, so that the pointer 631 controls the level 1 cache 602.
  • the branch target instruction that provides the branch instruction is executed by the processor core 601 and marks the instruction core to the processor core for guess execution. Pointer 631 also points to the track table pointed to by the branch destination address on bus 1117 The entry in 610 is placed on bus 1117.
  • controller 1154 presses bus 1117
  • the type of instruction above determines the subsequent direction of the tracker and continues to provide instructions to the processor core. All of these instructions are marked as guess execution.
  • the controller 1154 compares the predicted branch direction with the branch direction on branch decision signal 635. If the comparison result is the same, continue execution in the original guess direction. If the comparison result is different, then the controller 1154 goes to the processor core. 601 sends a 'guess error' signal, causing the processor core to clear all instructions with guess execution flags and their intermediate execution results.
  • controller 1154 controls selector 738 select register 1164
  • the output of the branch that causes the branch to be unconfirmed is used to control the level 1 buffer 602 to provide instructions to the processor core 601 and continue execution there.
  • Existing instruction set conversion techniques typically use a fixed instruction conversion module (sometimes referred to as a decoder) to convert an external computer instruction set into an internal instruction set (sometimes referred to as a micro-op) for execution of the internal instruction set.
  • the processor core executes.
  • the conversion module is located between the cache storing the external instruction and the processor core, and the external instruction address addressing buffer provided by the processor core reads the external instruction and is converted into an internal instruction by the conversion module to be supplied to the processor core for execution.
  • Repeated conversion of external instructions not only greatly increases power consumption, but also requires a deeper instruction buffer on the critical path of instruction execution. Instruction Buffer ), which greatly deepens the pipeline of the processor core, thereby increasing the hardware overhead and performance loss when branch prediction fails.
  • the internal instructions stored in the cache are directly executable by the processor core, but because internal instructions (usually fixed-length instructions) and external instructions (which can be variable-length instructions) are generally not one-to-one correspondence. Therefore, the branch instruction does not reliably generate the external instruction address of the branch target instruction (generally the branch offset generated by the external instruction compiler and the external branch instruction address are added,
  • the above methods and systems for converting the internal instruction address to an internal instruction address and thereby addressing the correct internal instruction in the cache The existing processor prefers to suffer the above-mentioned power consumption, performance, cost and the like caused by repeated conversion of the same instruction, and places the instruction conversion module between the cache and the processor core, and stores the external instruction in the first-level instruction buffer. s reason.
  • the instruction set conversion system and method of the present invention can store the converted internal instructions in the cache, and the address mapping module completes the conversion of the external instruction address generated by the processor core to the internal instruction address, so that the processor The core can directly address the internal instructions that have been stored in the cache, without requiring the processor core to repeatedly buffer the memory storing the external instructions.
  • the instructions are converted to internal instructions by the instruction converter for execution by the processor core.
  • the same external instruction in the L1 cache is repeatedly converted to avoid the above power consumption, long delay on the critical path, and additional hardware overhead cost.
  • the configurable instruction converter of the present invention can convert any kind of unspecified external instruction set into an internal instruction set according to the configuration.
  • the instruction set conversion system of the present invention mainly consists of two parts: a converter and an address mapping module.
  • the converter of the present invention may be fixed or configurable.
  • the configurable converter It can be used in conjunction with a processor core to convert external instructions into content instructions for execution by the processor core.
  • the branch target address corresponding to the branch instruction in the external command is the same as the branch target address of the internal instruction corresponding to the branch instruction, and mapping of the external address to the internal address is not required.
  • the map 12 is an embodiment of a processor system including a configurable converter of the present invention.
  • the external command 1205 is converted by the configurable converter 1202 and stored in the instruction memory 1203.
  • the internal memory is stored in the instruction memory 1203, and the function and structure of the converter 1202 can be configured and the converter 200 in the embodiment of FIG. Similar. Since the external instruction and the internal instruction correspond one-to-one, the external instruction address is the same as the internal instruction address.
  • the processor core 1201 executes a branch instruction, if the branch is not executed, the branch instruction address is incremented.
  • the address of the next instruction is sent to the instruction memory 1203 Read the internal instruction for the processor core 1201 Execution; if the branch is executed, the branch branch address of the external instruction branch generated by the branch offset of the external instruction plus the branch instruction address is the same as the internal instruction branch target address; therefore, the external instruction branch target address pair instruction memory can be directly used. 1203 Addressing, from which the branch target internal instructions are read. There is no need to convert an external instruction address to an internal instruction address. When a non-branch instruction is executed, the address of the next instruction is generated in the same manner as when the branch instruction does not execute the branch.
  • a processor system employing the configurable converter of the present invention can be configured as needed to execute different sets of external instructions.
  • the memory 201 stores the conversion rules of the internal instruction set and the external instruction set as described in FIG. Extractor 1302 ( That is, the opcode extractor 211, 213, 215) in Fig. 3 extracts the external instruction opcode from the external command sent from the bus 1205 as the addressing address via the bus 1307. And sent to the memory 201 to address the conversion rule corresponding to the external instruction, wherein the mask and the shift control signal control the shift module 1303 via the bus 1308 (ie, 221, 223 in FIG.
  • the configurable converter of the present invention completes the operation of converting an external command into an internal command; changing the memory 1301
  • the conversion rules in the middle allow the instruction converter to execute a different external instruction set in combination with the processor core executing the internal instructions.
  • a register can be added to the configurable converter for storing the external command to be fixed length (Fix Length ) It is still a variable length information.
  • this register is configured to be fixed length (for example configured as '0 When '), it means that the boundary of the external instruction in the external instruction block is aligned, so the conversion can be started from the start address of the external instruction block during conversion.
  • this register is configured to be variable length (for example configured as ' 1 ')
  • ' 1 ' it means that the boundary of the external instruction in the external instruction block is not necessarily aligned. At this time, only the instruction of the target instruction until the last instruction in the external instruction block has not been converted can be converted.
  • the registers 212, 214, 216 for extracting the external instruction opcode are controlled.
  • a register is added to store the memory 201 base address of the instruction set conversion rule corresponding to the thread.
  • the above registers are added to a complex array, and each group corresponds to an external instruction set. Selected by the selector.
  • a thread number memory (generally in TLB) in the memory manager MMU of the processor adds a storage domain for each thread, and stores a selection signal for selecting the above complex array register.
  • the map 13B is an embodiment of a memory in a configurable converter of the present invention.
  • the register group 1311 stores the opcode extraction position of the P instruction set and its corresponding instruction conversion rule in the memory 201.
  • the base address 'm' in the register group 1311 stores the operation code extraction position of the Q instruction set and the base address 'n' of the corresponding instruction conversion rule in the memory 201.
  • the selection signal of the J thread in the MMU 316 controls the selector. 315 Selects the output of register bank 1311.
  • the operation code extractor 1302 i.e., the operation code extractor 211, 213, 215 in Fig. 3
  • the opcode is added to the conversion rule memory 201 as an address by adding the base address 'm' also from the register group 1311 by the adder 1318,
  • the operation of the instruction converter is controlled, and the P instruction set instruction is converted into an internal instruction and stored in the instruction memory 1203 in FIG.
  • Control Selector 315 selects the output of Register Set 1313.
  • the opcode extractor 1302 is subjected to the register set 1313.
  • the control extracts the opcode to the converted external instruction; the opcode is added to the base address 'n', also from the register set 1313, by the adder 1318, and then addresses the translation rule memory 201 as an address.
  • the operation of the instruction converter is controlled, and the Q instruction set instruction is converted into an internal instruction and stored in the instruction memory 1203 of FIG. So when the processor core switches from the J thread to the K thread, it actually executes from P.
  • the instruction set instructions are converted to execute Q instruction set instructions. This enables a program to execute instructions containing a plurality of external instruction sets in a virtual machine of the present invention. Of course, the same function can be achieved by using a plurality of instruction converters, each responsible for converting an external instruction set.
  • the plurality of fields on an instruction in some computer instruction sets are orthogonal (Othogonal), that is, the fields are independent.
  • some instruction sets use code in some fields of the instruction to represent specific memory or registers in addition to the operation code field. These fields also need to be mapped by conversion rules. Instead of shifting the address in the external instruction, the internal instructions are met.
  • a plurality of conversion rule memories and corresponding logics may be used to correspond to the plurality of orthogonal instruction fields, so that the total number of entries (row numbers) of the conversion rule memory is controlled to a reasonable number.
  • Figure 13C It is another embodiment of the memory in the configurable converter of the present invention.
  • a conversion rule memory 1321 and its dedicated extractor 1322 are added to Figure 13C (and 1302 same function), and shift logic 1323 (same function as 1303). Also added is a register set like the register sets 1311 and 1313 in the example of Figure 13B ( Figure The 13C median display) controls the new memory 1321 and its corresponding logic.
  • the new logic in memory 1321 and the output of mask shift logic 1323 are sent to merge 1304 Merged with the output of the original memory 201 and mask shift logic 1303.
  • Two sets of memory and their corresponding logic can work together to process the same computer instruction set, each responsible for the conversion of partial fields on the external instruction, in the merger
  • the 1304 is merged into an internal instruction.
  • the two sets of memory and their corresponding logic can also operate independently, each independently responsible for converting an external instruction into an internal instruction, as shown in Figure 13B.
  • a writable register can be added, and the state of this register determines whether the instruction converter of Figure 13C operates in a cooperative or independent manner.
  • the merge module 1304 in FIG. 13A also generates a mapping relationship with an external command according to the conversion order of internal instructions, such as a map. 8A or the example shown in Figure 8B for filling in the block address offset mapper YMAP.
  • the merge module 1304 also generates a write address, and the control fills the internal command into the instruction memory 1203. Wait. If the internal instruction is fixed length, each pair of instruction memory 1203 writes an instruction, and the level 1 cache write address is added with a fixed length, such as 4 bytes.
  • Memory 1301 if internal instructions are variable length The length of the instruction is recorded in the conversion rule corresponding to the instruction, and each pair of instruction memory 1203 writes an instruction, the level 1 cache write address plus the length of the instruction output from the memory 1301, As the starting address of the next instruction. It is also possible to store a plurality of internal instructions of an internal instruction block in a buffer, and write the entire internal instruction block together into the instruction memory 1203. .
  • the above mapping relationship and write address can also be generated by other modules, as in Figure 7A, Figure 7B, which is responsible for scanning in the scan converter.
  • the processor system employing the configurable converter of the present invention can operate with the external instruction set in one-to-one correspondence with the instructions of the internal instruction set.
  • the instructions of the two instruction sets do not correspond one-to-one, there will be an external instruction being converted into multiple internal instructions, or multiple external instructions being combined into one internal instruction; or an external or internal instruction At least one of them is a variable length instruction; thus, it is possible that the branch target address of the external instruction does not correspond to the branch target address of the corresponding internal instruction.
  • the address mapping module and the instruction converter of the present invention can be used to implement instruction set conversion and mapping of instruction addresses. Please refer to the map 14.
  • the external instructions are converted by the converter 1202 and stored in the instruction memory 1203.
  • Medium for processor core 1201 to execute directly That is, the internal memory is stored in the instruction memory 1203, and the instruction memory 1203
  • the corresponding internal instructions are output according to the internal instruction address.
  • converter 1202 also generates a correspondence between external instructions and corresponding internal instructions to store in address mapping module 1404.
  • Processor core 1201 When the internal instruction in instruction memory 1203 is executed in the instruction sequence, its program counter PC increments '1' each time, so that the corresponding internal instruction address is incremented by '1', thus the instruction memory 1203 Addresses to read the next internal instruction.
  • the processor core 1201 executes a branch instruction to generate a branch target address
  • the branch target address is represented by an external instruction address
  • it is sent to the address mapping module.
  • 1404 is converted to the corresponding internal instruction address as described above and then sent to the instruction memory 1203 for addressing to read the corresponding internal instruction (ie, the branch target instruction).
  • the address mapping module 1404 The mapping relationship corresponding to the external instruction address is already stored, and the internal instruction corresponding to the external instruction has been stored in the instruction memory 1203.
  • the external instruction address can be directly converted to an internal instruction address output. If address mapping module 1404 If the mapping relationship corresponding to the external instruction address is not stored, it indicates that the external instruction has not been converted into an internal instruction.
  • the converter 1202 Converting at least one external instruction including the external instruction to the instruction memory 1203, and storing the corresponding mapping relationship to the address mapping module 1404 In this way, the external instruction address can be converted to an internal instruction address output.
  • the converter 1202 may be a converter that converts a specific external command into an internal command, or may be FIG. 2 and FIG.
  • the address mapping module 1404 Can consist of a mapping table.
  • the mapping table can be addressed by an external instruction address, and the address of the corresponding internal instruction is stored in the entry. Based on this, the mapping table can have multiple implementations.
  • Each entry in the mapping table is addressed by a minimum unit (for example, a byte) of an external instruction address, and each entry stores an internal location of an internal instruction corresponding to the external instruction corresponding to the entry.
  • the block address of the instruction block ie the internal instruction block is in the instruction memory 1203 The block number in ), and the intra-block address offset address of the internal instruction in the internal instruction block.
  • the table entry of the mapping table can be addressed according to the external instruction address, and the internal instruction block address and the intra-block offset address in the corresponding entry are read, and the address conversion is completed.
  • the mapping table of the mode 1 may be compressed to eliminate the empty entry.
  • the external instruction as the byte addressing as an example, since the external instruction length is not fixed, only the external address start address byte occupies an entry, and the intra-block offset of the external instruction is stored and the corresponding internal instruction block is stored. The offset address, while the remaining external address non-start address bytes do not occupy the entry.
  • each row of the mapping table corresponds to an external instruction block, which can be addressed by an external instruction block address.
  • the row of the mapping table can be addressed according to the block address of the external command, and the entire line of content can be read.
  • the offset address in the external instruction block in all the entries of the row is matched by the intra-block offset address of the external instruction, and the internal instruction address stored in the matching success item is selected and outputted, and the address conversion is completed.
  • Each row in the mapping table consists of two parts.
  • the first part contains the same number of bits of data as the outer instruction block contains (for example, the number of data bits is the same as the number of bytes contained in the external instruction block).
  • the second part contains the same number of bits of data as the maximum number of internal instructions an internal instruction block may contain.
  • the data corresponding to the start address of each external instruction ie, the start byte
  • the rest are '0'
  • the data corresponding to the first internal instruction corresponding to each external instruction in the second part is set to '1', and the rest are '0'.
  • the specific format can refer to Figure 8B. .
  • the row of the mapping table can be addressed according to the block address of the external instruction, and the entire line content (including two parts) is read. Then, according to the intra-block offset address of the external instruction, the first part is up to the data corresponding to the offset address byte in the block. 1 'Add '1' count, then count the result according to '1' in the second part minus '1' until the count is '0' ' At this time, the count position in the second part corresponds to the intra-block offset address of the internal instruction, and the address conversion is completed.
  • the above diagram can be efficiently performed by the device of Figure 8C.
  • the external instruction block can be fixedly associated with the internal instruction block (for example, a L2 cache block storing an external instruction can be equally divided into two L2 cache sub-blocks, wherein each sub-block corresponds to a level of storing internal instructions. Cache block).
  • the mapping operation of the external instruction and the internal instruction can be decomposed into a mapping operation of the block address (it is easy to implement because there is a correspondence relationship), and two parts of the mapping of the offset address in the block are implemented to simplify the mapping.
  • Such a level 1 cache block does not necessarily have a valid internal instruction for each entry. The following is an offset from the smallest block in the first instruction block with a level one instruction (generally ' 0' ) Start the ascending order.
  • each instruction block also needs to store the offset address of the instruction whose offset address is the largest to remind the system to provide the first-level cache block address of the next instruction in the program sequence in the next cycle.
  • an intra-block offset mapper is also required to provide an intra-block offset map of the branch target according to a mapping relationship between the second-level instruction sub-cache block and its corresponding first-level instruction cache block (such as the above three methods).
  • FIG. 15 It is another embodiment of a processor system including a configurable converter and an address mapping module of the present invention.
  • the converter 1202, the instruction memory 1203, and the processor core 1201 are both shown. The same is true for 12 and 14, and a specific implementation of the address mapping module is also given.
  • the instruction memory 1203 If it is missing, the corresponding external instruction address can be sent to the external memory to obtain the corresponding external instruction block.
  • the instruction converter 1202 is converted and filled into the instruction memory 1203 as described above. Medium.
  • the description of the following embodiments assumes that the instruction memory 1203 always hits.
  • the address mapping module is composed of a tag memory 1505 (corresponding to the active table 604 in the foregoing embodiment) and an intra-block offset mapper. 1504 (for simplicity, 1504 includes the functions of the 618 offset address mapping module and the 622 offset address mapper in Figure 6) and the end flag memory 1506
  • the three rows correspond to the internal command blocks in the instruction memory 1203. Wherein, each line of the end flag memory 1506 stores the instruction memory 1203.
  • the flag memory 1506 can be ended at the same time as the processor core 1201 reads the internal instructions. Check if the internal instruction is the last one in the current internal instruction block. If the internal instruction is not the last one in the current internal instruction block, the intra-block offset address of the next internal instruction is the offset address of the internal instruction plus one; otherwise, the next internal instruction is the next internal instruction block. An internal instruction.
  • Tag memory 1505 Each row in the row stores an external instruction block address (ie, a tag), so the internal instruction block corresponding to the instruction block in which the external instruction is located can be found by the tag matching in the instruction memory 1203.
  • the position in , and the corresponding mapping relationship in the intra-block offset mapper 1504 in the same row as the internal instruction block, the end flag memory 1506 The position information of the last internal instruction in the instruction block.
  • tag memory 1505 and instruction memory 1203 Can have different structures. Specifically, taking a direct mapping storage structure as an example, the block address of the external instruction may be further divided into a label and an index number, and the label memory 1505 according to the index number.
  • the row addressing in the row reads the contents of the corresponding row and compares it with the label in the block address. If they are equal, the matching is successful, otherwise the matching is unsuccessful.
  • the external instruction address can be used to obtain the corresponding external instruction block from the lower instruction memory.
  • 1202 is converted into an internal instruction block and then written into the instruction memory 1203 according to the cache replacement rule, and the instruction converter 1202 is written by writing the label in the external instruction to the same line of the tag memory 1505.
  • the generated intra-block offset mapping relationship is stored in the intra-block offset mapper 1504, and the intra-block offset of the last instruction of the instruction block generated by 1202 is stored in the end flag memory 1506. The same line.
  • tag memory 1505 and instruction memory 1203 It can also be organized into any other suitable organizational structure (for example, a group association or a fully associative structure), and the specific matching method is the same as the matching method in the case of the corresponding organizational structure in the cache, and details are not described herein again.
  • the direct mapping structure is taken as an example in the following embodiments, and it is assumed that the tag matching is successful.
  • Processor core 1201 passes through bus 1508 depending on whether it needs to branch or jump Provide different instruction addresses.
  • an instruction address is output via bus 1508 to control instruction memory 1203 to read instructions for execution by processor core 1201, 1508
  • the upper block address is also sent to the end flag memory 1506 to address the end address of the line, with 1508
  • the offset address within the internal instruction block is matched to check if the internal instruction is the last one in the internal instruction block. If the instruction is not the last instruction in the internal instruction block, then the 1507 of the flag memory output is ended.
  • the signal control processor core 1201 has the same instruction block address for the next clock cycle, and the intra-block offset is incremented by '1' to put the bus 1508 on the next cycle.
  • the end of the flag memory output 1507 Signal Control Processor Core 1201 outputs the external instruction block address of the next instruction block in the next cycle (increased by the current instruction block address '1') and is '0' 'As the intra-block offset address of the internal instruction, it is combined into the instruction address and placed on the bus 1508.
  • 1507 also controls to send the instruction block address on 1508 to the tag memory 1505.
  • a match such as a match, is the correct address for the next instruction on bus 1508.
  • the branch decision signal 1509 controls the selector 1510 to select the bus 1508.
  • the upper intra-block offset address addresses the instruction memory 1203 to read the internal instructions of the next cycle for execution by the processor core 1201.
  • the block address used for instruction memory 1203 is from the bus at all times. 1508.
  • the mapping device in the inverse operation that is, the internal instruction address is sent to the decoder 805, and the internal instruction mapping relationship is sent to 807 as an input, and the mapping of the external instruction is used to control the matrix 803.
  • the output of the device is the external command address. It is also possible to add an offset within the outer block of the outer branch instruction to the branch offset of the branch instruction when the instruction conversion is performed, and record the sum as the branch offset into the internal branch instruction. So in the processor core 1201 When executing a branch instruction, only the instruction block address (intra-block offset is '0') ') Add the corrected branch offset recorded in the branch instruction, and the sum is the correct external branch target address, eliminating the operation of mapping the offset within the branch instruction internal instruction block to the offset within the external instruction block.
  • the block address in the external instruction branch target address is sent to the tag memory 1505 via bus 1508.
  • the match is also sent to the intra-block offset mapper 1504.
  • the mapping of the row is read to map the offset within the outer block on 1508 to the offset within the internal instruction block 1512.
  • Branch judgment signal 1509 Control selector 1510 selects 1512 as an intra-block offset to be sent to instruction memory 1203.
  • the block address on the 1508 bus is also sent to the instruction memory 1203.
  • Tag memory If the matching is successful, the branch target instruction is taken by the address for the processor core to execute.
  • the bus 1508 The block address of the next instruction above (including the tag and index part of the instruction address) is always the external instruction address.
  • the index part is used for all memories such as 1505, 1504, 1516 and 1203 Do row addressing.
  • the intra-block offset address of the next instruction above may be an external instruction address or an internal instruction address depending on the type of the instruction or the like. If the current instruction type is a non-branch instruction or a branch instruction but does not execute a branch, and the instruction is not the last instruction in the internal instruction block, the intra-block offset address of the next instruction is the internal instruction format (the current instruction address is incremented' 1' , pointing to the next internal instruction of the current internal instruction).
  • the intra-block offset address of the next instruction' 0 ' can be considered as an external instruction format or as an internal instruction format. If the current instruction type is a branch instruction and the branch is executed, the intra-block offset address of the next instruction is the external instruction format, and the intra-block offset mapper is passed. 1504 is mapped to an internal intra-block offset instruction address that can be used to read instructions from instruction memory 1203. If the index portion in the external address is regarded as the block address of the internal instruction address, the instruction memory 1203 It is addressed by the internal instruction address at any time.
  • the block address of the internal instruction is similarly the way number ( Way number ) consists of the index part of the external instruction. That is, the address mapping module in the virtual machine disclosed in this embodiment can directly map the external instruction address generated by the external instruction compiler to the internal instruction address to access the instruction memory storing the internal instruction for execution by the processor core.
  • the block address of the internal instruction address may be regarded as a block address (including a tag portion and an index portion) equivalent to the external instruction address.
  • the virtual machine avoids the inefficiency of the existing software virtual machine to map the external instruction address into the internal instruction through software and the overhead of storing the huge address mapping table; also avoids the existing hardware virtual machine being stored by the external instruction address pair.
  • the instruction memory of the external instruction is addressed, the external instruction is read, and it is converted by the instruction converter into an internal instruction and then executed by the processor core because of the high power consumption caused by repeated conversion of the same instruction multiple times.
  • One of the technical features of this virtual machine is that external instructions are first stored in the instruction cache after being converted by the instruction converter. Therefore, the instruction cache stores internal instructions, which can be directly executed without instruction conversion.
  • the branch target table can also be added to record the internal instruction address of the branch target instruction, so that it is not necessary to convert the external instruction address of the branch target instruction into the internal time each time the branch instruction of the same branch instruction is repeatedly executed. Instruction address.
  • the 1505, intra-block offset mapper 1504 and end marker memory 1506 are the same as in Figure 15.
  • the difference is the addition of branch target storage (BTB) 1607 and selector The 1608 is connected differently than the 15 selectors 1510.
  • BTB branch target storage
  • branch target memory 1607 The branch target history information recorded in the form of an internal instruction address is stored, that is, the internal instruction address of the branch instruction itself, the internal instruction address of the branch target thereof, and the prediction information whether or not the branch instruction was previously executed. Branch target memory 1607 does not necessarily correspond to other memory rows. Branch target memory 1607 outputs its branch prediction signal 1511 to control selector 1608 selection from bus 1508 Or the instruction address of the branch target memory 1607.
  • the processor core 1201 outputs the internal instruction address to the instruction memory 1203 via the bus 1508.
  • the internal instruction address is also sent to the branch target memory 1607 Matches the internal instruction address of all the branch instructions stored therein, and outputs the branch target internal instruction address and prediction information included in the matching success item.
  • Control Selector 1608 Selects the instruction address on bus 1508 to access instruction memory 1203, its operation and Figure 15
  • the operation of the embodiment is the same when the same instruction is executed, and details are not described herein again.
  • the branch prediction selection signal 1511 controls the selector 1608 to select the branch target memory.
  • 1607 Output internal instruction branch target address access instruction memory 1203.
  • the current instruction is a branch instruction but does not match the hit in branch target memory 1607, then in branch target memory 1607 An entry is allocated by permutation rule to store the internal instruction address of the branch instruction. If the branch is judged to be 'execution branch', then as shown in Figure 15, the processor core 1201 generates an external instruction address via bus 1508. Send it out.
  • the instruction block address confirmed by the tag memory 1505 as shown in Fig.
  • the branch prediction value is stored in the corresponding field in the new allocation entry in the branch destination memory 1607.
  • the internal instruction branch target address is also branched to the target memory 1607
  • the bypass accesses the instruction memory 1203 via the selector 1608. If the branch judges to be 'no branch', the new entry in the branch target memory 1607 is invalidated, and the branch prediction selection signal is 1511.
  • the control selector 1608 selects the instruction address on bus 1508 (in this case, the address of the next sequential internal instruction of the branch instruction) to access instruction memory 1203; now 1508
  • the address of the above instruction is the same as that generated in the example of Figure 15 under the same conditions, and will not be described again.
  • the processor core 1201 Clears the intermediate result of the instruction executed by the error prediction, executes the correct branch, and updates the branch prediction stored in branch target memory 1607.
  • FIG. 17 It is another embodiment of the processor system including the branch target table and the tracker of the present invention.
  • the 1505, intra-block offset mapper 1504, end flag memory 1506, and branch target memory 1607 are the same as in FIG.
  • this example also includes the lower block address memory. 1709, selector 1711, or logic 1707 and the tracker, and the internal instruction address is generated by the tracker, so that the processor core 1701 only needs to output the external instruction address.
  • the new lower block address memory 1709 and the tag memory 1505 and the intra-block offset mapper 1504 are added in this embodiment.
  • the end flag memory 1506 corresponds to the row row.
  • the lower block address memory contains two parts per line: the first part 1801 The X address of the last internal instruction block of the internal instruction block corresponding to the row is stored; the second portion 1802 stores the X of the next internal instruction block of the internal instruction block corresponding to the row. Address.
  • the block address of the current internal instruction block ie, the X address of the tracer output
  • the corresponding X of the previous and next internal instruction blocks of the sequential address can be read. Address.
  • the selector 1711 based on the branch transfer of the processor core 1201, generates a TAKEN signal 1713 for the next internal instruction block of the lower block address memory 1709.
  • the first internal instruction block of the next internal instruction block formed by the X address and the Y address ' 0 ', and the branch target internal instruction address output by the branch target memory 1607 are sent to the selector 1705 .
  • OR Logic 1707 Control selector 1705 selects the input from selector 1711 when the current internal instruction is the last instruction of the internal instruction block or a branch transition occurs.
  • the tracker is composed of a register 1701, an incrementer 1703, and a selector 1705. Among them, register 1701 The memory is stored, and the current internal instruction address 1723 composed of a block address (hereinafter referred to as an X address) and an internal instruction block offset address (hereinafter referred to as a Y address) is output. Current internal instruction address 1723 Used to address the instruction memory 1203 to read the internal instructions in one of the rows and send them to the processor core 1721 for decoding, and simultaneously access the lower block address memory 1709 to end the flag memory 1506. The corresponding line is also sent to the branch target memory (BTB) 1607 for matching.
  • BTB branch target memory
  • the X address in the 1723 address end flag memory 1506 reads the contents of the corresponding line and The Y address on 1723 is matched to check if the instruction is the last one in the internal instruction block. If the instruction is not the last one and the processor core is 1721 If the decoding result of the instruction is judged to be not a branch instruction, then the logic 1707 controls the selector 1705 to select the X address output from the register 1701 and the increment of the output of the increment 1703. The Y address is stored in register 1701 as the current internal instruction address for the next clock cycle.
  • selector 1705 selects the selector under the control of OR logic 1707.
  • the output of 1711 is stored in register 1701 as the current internal instruction address for the next cycle. Specifically, if the branch judgment signal (TAKEN) 1713 is 'no branch' at the time, the selector is controlled. 1711 Selects the address of the first internal instruction of the next internal instruction block provided by the current internal instruction address 1723 in the lower block address memory 1709, via selector 1705 After selection, it is stored in register 1701. If the branch decision signal (TAKEN) 1713 is the 'execution branch', the control selector 1711 selects the branch target memory 1607.
  • branch target internal instruction address obtained by the current internal instruction address 1723 match is selected by the selector 1705 and stored in the register 1701.
  • Branch target memory 1607 can also be used
  • the branch prediction value stored in the processor replaces the branch judgment signal generated by the processor and 1721. 1713 Control selectors 1711 and 1705 . This approach requires a mechanism to verify that the branch prediction is correct and that can be corrected once the prediction error is correct.
  • the internal instruction address of the control instruction memory 1203 or the like is provided by the tracker.
  • Processor core 1721 Only the current internal instruction address 1723 matches the branch target memory 1607 content, or in the lower block address memory 1709 The address encounters an invalid entry, and the external command address 1708 needs to be provided as the address of the next cycle when the branch judgment and the end instruction determine that the above miss or invalid instruction address is selected.
  • the processor core 1721 calculates the external instruction branch target address 1708 in the same manner as in the 16th example, and sends it to the tag memory 1505 for matching, and also to the intra-block offset mapper. 1504 mapping.
  • the internal instruction branch target address obtained by the matching mapping is stored in the branch target memory 1607 entry in the same manner as in the 16th example, and is stored in the register 1701 as the current internal instruction address. 1723.
  • the processor core 1721 calculates the block address of the external instruction 1708 to the tag memory in the same manner as in FIG. 16 1505 matches.
  • the block address of the internal instruction obtained by the matching is stored in the 1802 field in the above invalid entry, and the block address of the address block is also stored in the lower block address memory 1709.
  • the internal instruction block of each sequential address passes through the lower block address memory 1709.
  • the information stored in the data is linked, that is, the address of the next internal instruction block can be read out by addressing the lower block address memory 1709 according to the X address of the current internal instruction block. .
  • the lower block address memory 1709 can be addressed according to the X address of the internal instruction block to read the X address of the last internal instruction block stored therein. 1801, and then accessing the lower block address memory 1709 according to the X address in the 1801 to find a corresponding row, where the next internal instruction block (ie, the replaced internal instruction block) X address portion is stored in the row. 1802 is set to invalid, reflecting the replacement address relationship.
  • the row address of the next instruction block of an instruction block is incremented by '1' from the row address of the instruction block, which may be defaulted;
  • the 1801 and 1802 fields record the way number ( Way number ) to achieve its function.
  • Figure 19 It is an embodiment of a processor system including a two-layer instruction memory as described herein.
  • converter 1202, instruction memory 1203, processor core 1201, intra-block offset mapper 1504, end flag memory 1506, branch target memory 1607, lower block address memory 1709, selector 1711, or logic 1707 and tracker are shown in Figure 16 The same in the middle.
  • the instruction memory 1203, the intra-block offset mapper 1504, the lower block address memory 1709, the end flag memory 1506, and the branch target memory 1607 Together form the first level instruction storage hierarchy, and instruction memory 1903, tag memory 1905, and block address mapping module 1904 (with 620 in Figure 6) Similar functions) together form a second level of instruction storage hierarchy.
  • the instruction memory hereinafter referred to as the first-level instruction buffer to be clearly distinguished from 1903
  • 1203 stores the internal instruction
  • the external instructions are stored in 1903.
  • the external instructions in instruction memory 1903 are converted to the corresponding internal instructions by converter 1202 before being executed by processor core 1201 and stored in the first level instruction buffer.
  • the processor core 1201 is used in 1203.
  • one external instruction block may correspond to a plurality of internal instruction blocks.
  • the instruction memory 1903 It contains the external instructions corresponding to all internal instructions in the Level 1 Instruction Buffer 1203, so a Tag Memory 1905 can be used to serve both storage levels simultaneously.
  • the row of the tag memory 1905 and the instruction memory 1903 The external instruction blocks in the one-to-one correspondence store the tag address of the corresponding external instruction block.
  • a block address mapping module 1904 is also added, which is also associated with the tag memory 1905.
  • each row stores 1X of the corresponding singular or plural internal instruction blocks of the external instruction block in the level one instruction buffer 1203. Address and valid signal (When an internal command block corresponding to the external command block has not been stored in 1203, the valid signal of its corresponding 1X address is invalid).
  • Figure 18C is a schematic diagram of an external instruction address format in the two storage tier virtual machine systems.
  • the external instruction address consists of the block address, sub-block number 1813, and intra-block offset address 1814. Composition.
  • the block address corresponding to the external instruction block in the instruction memory 1903 can be further divided into a label 1811 and an index number 1812, and the label memory can be based on the index number 1812.
  • the row addressing of 1905 reads the tag information stored therein and compares it with tag 1811 in the address to determine if the external instruction block address matches successfully.
  • Index number 1812 can also be used for block address mapping module 1904
  • the memory address in the middle selects one of the lines, and the sub-block number 1813 selects one of the columns in the memory.
  • FIG. 20 is a block address mapping module 1904 according to the present invention.
  • the block address mapping module is composed of a write module 2001 and an output selector 2007.
  • each external instruction block is divided into two sub-blocks, and the external instructions in each sub-block are converted by the instruction converter 1202 into an internal instruction stored in a level one instruction block in the level one instruction buffer. Therefore
  • Each row of memory in 1904 corresponds to one (secondary) external instruction block in the second level instruction cache 1903, and the memory is also divided into two columns 2003 and 2005. Two sub-blocks corresponding to each external instruction block are selected by sub-block number 1813.
  • Each entry of the memory corresponds to a sub-block in which the first-level instruction block address of the (primary) internal instruction block corresponding to the external instruction sub-block is stored (1X) Address).
  • a block address mapping module 1904 The external instruction block address can be mapped to its corresponding internal instruction block address, and the external instruction sub-block can be associated with its corresponding internal instruction block.
  • the corresponding internal instruction block of an external instruction sub-block can be placed in any one-level cache block in the first-level instruction cache, so the first-level instruction cache can be in a fully associative organization.
  • the write driver is controlled by the sub-block number 1813 in the external instruction address when the memory is written in the block address mapping module 1904.
  • 2001 Select Drive Memory Column 2003 or 2005 select a row in memory by index address 1812 for writing the corresponding internal instruction 1X address (ie 1X in Figure 20) ).
  • the block address mapping module 1904 when the memory is read, one row of the memory is selected by the index address 1812, and the subroutine block 1813 is controlled by the external instruction address. Data output for 2003 or 2005.
  • the 1X address will be read out via bus 1906 Returning the first level instruction storage hierarchy to fill the invalid lower block address memory 1709 entry; or addressing the intra-block offset mapper 1504 with the 1X address to bus 1708
  • the offset address in the external instruction block is mapped to the offset address in the internal instruction block, and the 1X address together with the offset address in the internal instruction block constitutes the internal instruction branch target address (1Y) Address) Stores the branch target storage 1607 entry that matches the miss.
  • the operation is the same as in the example of Fig. 17.
  • the external cache address on bus 1708 is used to cache the secondary cache. Addressing, sending the corresponding external instruction sub-block to the instruction conversion period 1202, converting to the internal instruction block, storing the first-level cache block specified by the cache replacement logic in the first-level instruction buffer 1203, and 1X of the first-level cache block The address is stored in the entry pointed to by the external instruction in 1904 (that is, the entry that was originally read invalid), and the address is set to be valid.
  • the intra-block offset mapping relationship generated during the instruction conversion process, and the end flag are also written into the intra-block offset mapper.
  • the read 1X address is sent back to the first level instruction storage level via the bus 1906 as in the previous example, and stored in the invalid lower block address memory 1709.
  • the middle table entry, or the offset address within the internal instruction block generated along with the mapping, is stored in the branch target memory 1607 entry that matches the miss. The operation is the same as in the example of Fig. 17.
  • the bus 1708 The upper external instruction address is sent to the lower level memory and the external instruction block is filled in a second level cache block specified by the cache replacement logic in the second level instruction cache 1903. Simultaneously labeling external instructions on bus 1708 1811 is stored in the tag memory 1905 corresponding to the above-mentioned L2 cache block, and the block address mapping module 1904 The two entries corresponding to the above secondary cache block are invalid. Then, the above-mentioned tag matching hit is performed but the 1X address in the block address mapping module obtained by the addressing is invalid.
  • the boundary of the external instruction block or subblock coincides with the starting point of an external instruction. Therefore, whether the sequential execution is entered or the branch is transferred to the external instruction block (or sub-block), the complete block or sub-block can be converted into the corresponding internal instruction from the boundary of the external instruction block or sub-block.
  • the block is stored in the internal instruction memory.
  • the external instruction set is a variable length instruction set, the starting address of the first external instruction in an external instruction block (or sub-block) may not necessarily coincide with the boundary of the block (or sub-block).
  • External instruction block 2101 is instruction memory 1903 A row of external instruction blocks (or sub-blocks), the internal instruction block 2102 is a level one instruction cache 1203 and an external instruction block 2101 Corresponding line of internal instruction blocks.
  • the target instruction of the first branch transfer is the external instruction 2105, which can be from the target instruction 2105.
  • the conversion is completed until the instruction block is converted and stored in the internal instruction cache block. Therefore, the internal instructions can still be stored in the order of increasing address, but the highest address of all internal instructions and the internal instruction block 2102 will be converted.
  • the highest bit (MSB) of the address ie, the rightmost bit of internal instruction block 2102 in Figure 21
  • MSB The highest bit of the address
  • the internal instruction 2106 corresponding to the external instruction 2105 is stored as shown in FIG.
  • the location shown, and the internal instructions corresponding to all external instructions starting from instruction 2105 in external instruction block 2101 are stored in address order in internal instruction block 2102 as shown in Figure 21. In the shaded part shown.
  • the instruction memories 1903 and 1203 Each row has a pointer added to point to the first external instruction that has been converted in the external instruction block (see pointer 2103 of internal instruction 2105 in Figure 21). ), and the first internal instruction that has been stored in the internal instruction block (see pointer 2104 of internal instruction 2106 in Figure 21) ).
  • pointer 2103 of internal instruction 2105 in Figure 21 the first external instruction that has been converted in the external instruction block
  • pointer 2104 of internal instruction 2106 in Figure 21 the offset address and the pointer in the external instruction block at the time of entry can be compared. , to determine if the target instruction has been converted.
  • Intra-block offset mapper 1504 The internal instruction mapping relationship is also stored in a high-order alignment, consistent with the internal instruction cache block. The above two pointers can be implemented in each of the intra-block offset mappers 1504.
  • each block of the lower block address memory includes, in addition to the first portion 1801 and the second portion 1802 of the embodiment of Fig. 18A.
  • a third portion 1803 is added for storing the 1Y address of the first internal instruction in the next internal instruction block of the internal instruction block corresponding to the row.
  • the second part 1802 and the third part 1803 together constitutes the address of the first internal instruction of the next internal instruction block, so that the internal instruction does not have an internal instruction block LSB because the external instruction boundary is not aligned.
  • the lower address memory can still be addressed to the corresponding address according to the block address (ie 1X address) of the current internal instruction block to find the first instruction of the next internal instruction block.
  • Figure 21 and Figure 18B The format can also be applied to the case where the external instruction start address is not aligned with the outer instruction block boundary in the embodiment of Fig. 15, Fig. 16, and Fig. 17.
  • Figure 21 and Figure 18B An embodiment of solving the problem of misalignment of instructions and block boundaries in the case where the external instruction sub-block and the internal instruction block have a strict one-to-one correspondence relationship is described.
  • Figure 22 For another embodiment of the block address mapping module of the present invention, an implementation of an elastic mapping between an external instruction block and an internal instruction block to solve the problem of misalignment of instructions and block boundaries can be applied to FIG. In the embodiment.
  • the instruction in an external instruction block can be converted to an internal instruction by placing up to three (which can be any number of) internal instruction blocks.
  • the main part of the block address mapping module is divided into three memories.
  • Each of the three memories corresponds to an external block of instructions, and each row consists of two banks for storing the block external offset of the external instruction of the external block in its external block. Move the address (2Y in the figure) ), and the block address of the internal instruction block corresponding to the sub-block in the level one instruction buffer 1203 (1X in the figure).
  • the offset address is offset from the outer block of the branch target ( 2Y All the complete instructions from the external block are converted to internal instructions and placed in an internal block.
  • the 2Y value and the block address (1X) of the above internal instruction block are stored in the memory 2201 of FIG.
  • the line pointed to by the middle instruction block address ( 2X ) is the first internal instruction in the internal instruction block of the 1X record block address.
  • the block offset in the 2X external instruction block is 2Y.
  • External instructions If there are more internal instructions filled in the above internal instruction block, another internal instruction block is allocated to store these overflow internal instructions, and the intra-block offset address of the first external instruction corresponding to the first instruction in the internal instruction is overflowed.
  • ( 2Y) is stored in the memory 2202 with a block address (1X) of the newly allocated internal instruction block.
  • the intra-block offset mapping relationship between external instructions and internal instructions is also stored in Figure 19.
  • the external offset address 2Y of the branch target and the corresponding internal instruction block address 1X in the intra-block offset mapper 1504 The mapping relationship pointed to is mapped to an offset within the internal instruction block, 1Y .
  • the external instruction block starting from the branch target has been converted to an internal instruction by the instruction converter 1202; the external instruction block address 2X
  • the block address mapping module 1904 is also mapped to the internal instruction block address 1X
  • the external instruction block offset address 2Y is also mapped to the internal instruction offset address by the intra-block offset mapper 1504.
  • the branch target internal instruction address 1X, 1Y can be stored in the branch prediction module 1607 for use by the tracker.
  • the external block address 2X in the access address is used to store the memory 2201.
  • the 2202 and 2203 address readouts are sent to the comparator 2204.
  • the instruction offset address 2Y in the external instruction block in the access address is in the comparator 2204 and each 2Y read from each memory. For comparison, select the 1X value stored in the first memory whose 2Y value is less than the 2Y value in the access address as the output of the block address mapping module 1904 1906 . Subsequent operations are the same as described above. If 2Y in memory 2201 (the value is the smallest of all memory in block address mapper 1904) is still greater than the BN2Y of the access address The access target instruction is still not converted to an internal instruction.
  • the system control instruction converter 1202 will start from the access target until the 2Y stored in the memory 2201.
  • the external instructions before the value are converted to internal instructions stored in the level 1 cache block specified by the level 1 cache block replacement logic.
  • the external instruction block address 2X in the access address is used.
  • the pointed line is moved right to the same line in 2203, and the line pointed to by the 2X in memory 2201 is moved right to the same line in 2202, and the 2Y value of the access target is the newly specified 1X
  • the value is stored in memory 2201.
  • Such an external instruction block is converted into a number of internal instruction blocks starting from the starting point of multiple accesses, and its mapping relationship is also recorded in the block address mapping module 1904 of the structure of FIG. Medium.
  • a track table can also be incorporated into the processor system.
  • a processor system including a track table of the present invention.
  • the track table since the track table itself of the present invention already includes the branch target address information, the next instruction block address information, and the end track point information, the track table can be used.
  • the 2301 replaces the lower block address memory 1709, the end flag memory 1506, and the branch target memory 1607.
  • the tag memory 1905 and the block address mapping module 1904 Converter 1202, Level 1 Instruction Buffer 1203, Processor Core 1201, Intra Block Offset Mapper 1504, Selector 1711, or Logic 1707 and Tracer are shown in Figure 19 The same in the middle.
  • a scanner 2302 is also added, which is used to review the converted external command and calculate the external command address of the branch target BN2 for the branch instruction. After that, it is converted to the corresponding internal instruction address BN1.
  • the internal instruction address BN1 is the address of the level one instruction buffer 1203, the level one instruction buffer 1203
  • the internal instruction in the track corresponds to the track point in the track table 2301, and the track point corresponding to the branch instruction includes the internal command address of the branch target, so the track table can be tracked by the tracker as described above. Addressing the contents of the track point and selecting the current tracking address increment '1' or the branch target tracking address in the track point as the tracking address of the next internal instruction according to the execution of the branch instruction.
  • the content of the middle track point determines if the last instruction of the internal instruction block is reached. For example, a flag bit may be used in the track point to indicate whether the track point corresponds to the last instruction, when the tracker read pointer points to the track point, according to the bus The flag value read on 2313 determines that the last instruction has been reached.
  • the track table 2301 can simultaneously output the tracker read pointer via the bus 2311.
  • the branch target tracking BN1 address and the next internal instruction block BN1 address are simultaneously supplied to the selector 1711 as in the embodiment.
  • a further difference between this embodiment and the embodiment of Fig. 19 is the addition of a selector 2315 for the block address mapping module.
  • 1904 BN1 from the BN1X sent by bus 1906 and the BN1Y address sent by the intra-block offset mapper 1504
  • the internal instruction address (also the first-level instruction cache address) and the BN2 secondary cache address output by the scanner 2302 are selected and stored in the track table 2301.
  • the scanner 2302 sends the secondary instruction cache 1903 to the primary instruction buffer 1203
  • the branch instruction is used to calculate the external instruction address of the branch target by the external branch instruction address plus the external branch offset carried in the instruction.
  • the calculated external branch instruction address index portion addresses the tag memory, and the read content matches the tag portion in the external branch instruction. If not, the external instruction block is read from the lower level memory by the external instruction and stored in the second level cache.
  • the secondary cache block specified by the cache block replacement logic in 1903; and the tag portion of the external instruction is stored in a corresponding row of the tag memory 1905, in the block address mapping module 1904 The corresponding line will have all valid locations 'invalid'.
  • the index number 1812 of the external instruction (if the second level cache 1903 is organized in group connection, together with the road number) is the second level cache block address BN2X Sub-block number 1813 and the intra-block offset address 1814 together with BN2Y form the second-level cache address BN2.
  • the BN2 is stored on the track table 2301 In the entry of the internal branch instruction corresponding to the external branch instruction.
  • an external branch instruction is converted into an internal instruction, it is stored in the first level instruction buffer 1203.
  • the secondary cache address BN2 of the branch target already exists in the corresponding track table entry of the internal branch instruction.
  • the tracker read pointer 1723 (level 1 cache address BN1) addresses the level 1 instruction buffer 1203 While the internal branch instruction is read for execution by the processor and 1721, the track table 2301 is also addressed to read the track entry corresponding to the instruction.
  • the output 2311 of the track table 2301 is BN2
  • the selector 1711 puts the BN2 on the bus 2304, and the BN2 pairs the block address mapping module 1904. Addressing, if the mapped output is 'invalid', indicates that the instruction block in which the branch target instruction is located has not yet been converted into an internal instruction block and stored in the first level instruction buffer 1203.
  • the processor system controls to address the L2 cache with the BN2. 1903 reads the external instruction block and sends it to the scanner 2302.
  • the branch target of the branch instruction in the foregoing calculation block is also sent to the instruction converter 1202 as described above.
  • the conversion to an internal instruction block is stored in the first-level instruction cache as indicated by the BN1X address given by the cache block replacement logic in the first-level instruction buffer.
  • the system also stores the BN1X address in the block address mapping module 1904.
  • the original 'invalid' entry also stores the offset address mapping relationship generated by the instruction converter 1202 into the intra-block offset mapper 1504. The line that points to.
  • the virtual machine system control maps the external instruction offset address 1814 to the internal instruction BN1Y according to the mapping relationship row in 1504 pointed to by BN1X above.
  • BN1X The first-level cache address BN1 of the branch target internal instruction formed with BN1Y is written to the track table entry of the corresponding branch instruction instead of the original BN2 .
  • the branch target external instruction and the subsequent external instruction block have been converted into internal instruction blocks and stored in the level 1 cache.
  • the level 1 cache address of the internal branch target instruction has been stored in the track table entry corresponding to its branch source instruction.
  • the first-level cache address 1723 (BN1) of the tracker output is addressed, the first-level instruction buffer 1203 is addressed. While the internal branch instruction is read for execution by the processor and 1721, the track table 2301 is also addressed to read the track entry corresponding to the instruction.
  • the output 2311 of the track table is in BN1 format, the BN1
  • the control selectors 1711 and 1705 are selected by the branch judgment signal 1713, and if 1713 is 'no branch', the tracker reads the pointer, and the level 1 cache address 1723 is incremented by 1703.
  • Level 1 cache address 1723 directly addresses the Level 1 Instruction Buffer 1203 and reads the internal instructions for execution by the Processor Core 1721.
  • Figure 6 shows the Figure 23 A concrete implementation of the structure.
  • the end track point in the track table is also treated in the same way, that is, when the external command is converted into an internal command and stored in a level 1 cache block, the scanner 2302 also calculates the external address of the next instruction block in its order (the current external instruction block address is incremented by one) and sends it to the tag memory 1905 Match. If there is no match, the external instruction block is fetched from the lower layer memory into the second level buffer 1903 as described above.
  • the cache block replacement logic caches the block specified by the BN2X address and updates the tag memory 1905. And the corresponding row in block address mapping module 1904. The BN2X thus obtained or the BN2X obtained at the time of matching is stored in the track table 2301.
  • the BN2 is read from the end of the track table track 2309, its BN2X
  • branch target instruction address BN2 is read out by 2311, it is generally sent to the block address mapping module 1904 via bus 2304 to map to BN1X (such as the BN1X). If the address is invalid, as in the previous example, the BN2 addresses the L2 buffer 2302 to convert the external instruction into an internal instruction and store it in the first-level instruction buffer 1203.
  • the cache block replacement logic is BN1X.
  • the BN1X and BN1Y constitute a BN1 address and are stored in the track table 2301 via the selector 2315 to replace the original BN2. .
  • the above branch destination address or lower block address can check the corresponding block address mapping module 1904 when the tag matching is performed for the first time. Whether the contents of the entry are valid, if valid, the branch target instruction or the next block instruction has been stored in the first level instruction buffer 1203 in the form of internal instructions, and then the BN1X in the 1904 entry will be as described above.
  • BN2Y is mapped to BN1Y and BN1 is directly stored in the track table.
  • FIG. 24 is an embodiment of a processing system for implementing a stack operation function by using a register file according to the present invention.
  • register file 2402 in the processor core can be configured for stack use.
  • the stack controller 2404 is based on the decoding result of the instruction and the current register file.
  • the adjusted output addresses 2405 and 2406 are sent to the register file 2402 as the stack top pointer value and the stack bottom pointer value, respectively.
  • the specific structure of the stack controller 2404 can be as shown in controller 1019, register 1011 in Figure 10A. , Decrementer 1031 , Incrementer 1041 and selector implementation.
  • the register 1011 stores the current stack top pointer value.
  • the most basic stack operations include popping (POP) and pushing (PUSH) ) two.
  • the decrementer 1031 and the incrementer 1041 respectively reduce the current stack top pointer value by '1' and increment '1', respectively corresponding to the popup (the stack top pointer value minus '1') and the push stack (the stack top pointer value is increased by ' 1 ')Case.
  • the operands read from the memory 2403 can be sequentially pushed onto the register file 2402 (the stack top pointer values are sequentially increased accordingly). ') to implement stack-based data reading; it is also possible to pop out several operands from the register file 2402 (the top-of-stack pointer value is correspondingly reduced by '1') to the execution unit 2401. After the corresponding arithmetic logic operation, it is pushed back to the register file 2402 (the stack top pointer value is increased by '1 ') to implement the stack-based operation; the stack operand can also be stored from the register file 2402 to the memory.
  • the stack top pointer value is reduced by 1 ' ' to implement stack-based data storage.
  • the stack top pointer value operation can be performed on the read port or the write port by controlling three bits in the register file address field of each read or write port by the register file processor instruction (increase '1' ', unchanged, or minus '1').
  • the stack top pointer value and the stack bottom pointer value can be compared to determine whether the stack is full (or nearly full) and whether it is empty (or close to empty).
  • the stack controller 2404 When the constructed stack is full (or nearly full), some data near the bottom of the stack can be temporarily stored in the memory 2403 under the control of the stack controller 2404, and the bottom pointer of the stack is pointed to the bottom of the new stack, thereby making the register file
  • the stack formed by 2402 vacates a portion of the storage space for subsequent stack operations. Can pass in memory 2403
  • the storage space is organized in the form of a stack, and the method of storing the data to be temporarily stored is performed by a stack operation (pushing and popping), and the original order information of the data is maintained.
  • register heap 2402 If the constructed stack is empty (or nearly empty), a number of previously stored data stores can be read back from the stack of the memory 2403 in the order of the stack under the control of the stack controller 2404. In the corresponding register, and adjust the bottom pointer to point to the new stack bottom, that is, restore the state of the data before being temporarily stored in the memory 2403, thereby making the register file 2402 A portion of the data still exists in the constructed stack for subsequent stack operations. In this way, the stack operation function can be implemented using the register file.
  • the software interpreter translates the intermediate code into several machine instructions in real time and then executes them by the hardware platform. Therefore, the execution of the intermediate code is not efficient.
  • stack operation instructions can be directly executed (ie, each stack operation instruction is converted into a corresponding one internal instruction), thereby greatly improving the execution efficiency of the processor system.
  • the multi-instruction set processor system of the present invention implements the virtual machine entirely in hardware as compared to the prior art, which typically uses software to implement a virtual machine.
  • the virtual machine system is used to execute a program consisting of variable length instructions, that is, the external instructions are variable length instructions.
  • the value of the value is written.
  • the value of the register that controls the start of the conversion of the instruction starts from the entry address of the (branch target or sequence).
  • the processor core 1201 The required variable length instruction has been stored in the instruction memory 1903, and the instruction memory 1903 is addressed.
  • the instruction block in which the variable length instruction is located is sent to the scanner 2302 and the converter 1202.
  • the scan/conversion if the first level instruction buffer 1203 The internal instruction corresponding to the branch target has been stored, and the variable length instruction address of the branch target can be translated by address (as described above by tag memory 1905, block address mapping module 1904, and intra-block offset mapper) 1504 Completion) Get the corresponding internal instruction address BN1 as the track point content is stored in the track table. If the first instruction buffer 1203 does not store the internal instruction corresponding to the branch target, but the instruction memory If the branch target has been stored in 1903, the variable target address BN2 of the branch target can be stored as a track point content in the track table.
  • the branch target can be filled from the outer memory to the row determined by the replacement algorithm in the instruction memory 1903, and the variable target address BN2 of the branch target is The content as track points is stored in the track table.
  • the track table 2301 contains the address information of the branch target of the variable length branch instruction.
  • the tracker reads the content and processor core 1201 based on the track table 2301.
  • the control level one instruction buffer 1203 outputs the corresponding internal instructions for execution by the processor core 1201.
  • the internal instruction address of the branch target BN1 can be output according to 2311 of the track table 2301.
  • the corresponding internal instructions are found directly from the level one instruction cache 1203 for execution by the processor core 1201.
  • the track table 2301 outputs the variable length instruction address of the branch target BN2 If the internal instruction corresponding to the variable length instruction has been stored in the instruction memory 1203 during the previous operation, the variable length instruction address can be converted to the corresponding internal instruction address by the address conversion as described above. And according to the address, the corresponding internal instruction is found from the first level instruction buffer 1203 for execution by the processor core 1201.
  • variable length instruction address from the instruction memory 1903 Find the corresponding variable length instruction, and scan/convert the variable length instruction from the variable length instruction until the last untransformed instruction block in the instruction block as described above, and store the corresponding internal instruction block to the first level instruction cache. 1203 And the corresponding track is established in the track table 2301, and the internal instruction obtained by converting the variable length instruction is supplied to the processor core 1201 for execution.
  • Processor core 1201 Execution of the internal instruction generates a corresponding execution result, such as a TAKEN signal that is generated when a branch transfer occurs when a branch internal instruction is executed, and is sent to the tracker.
  • the tracker is based on the TAKEN signal and track table as described above. 2301
  • a signal sent via bus 2313 indicating whether the last instruction of the instruction block has been reached selects multiple address sources, thereby controlling the flow of the program to continue.
  • the processor system executes the program consisting of the variable length instruction
  • the program consisting of the fixed length instruction is executed instead.
  • the operation of the processor core is stopped, and the state in the processor core and each memory is invalidated, and the instructions of the fixed length instruction set and the internal instruction set are set.
  • Import conversion rules and register settings to the converter 1202 In the memory and in the register to replace the original storage variable length corresponding conversion rules.
  • the value of the register that controls the start point of the instruction conversion is the conversion from the lowest address of the external instruction block or sub-block.
  • Processor core if executing fixed length instructions 1201 The required fixed length instruction has been stored in the instruction memory 1903, and the instruction memory 1903 is addressed.
  • the instruction block in which the fixed length instruction is located is sent to the scanner 2302 and the converter 1202. Scanning/converting the block of the fixed length instruction block, calculating the branch target address of the branch instruction and converting it to the corresponding internal instruction address, and simultaneously storing the converted internal instruction block to the first level instruction buffer according to the replacement algorithm
  • the corresponding track is created in the corresponding row of track table 2301.
  • the track created in 2301 is basically the same, except that it is a scan conversion of the entire fixed length external command block.
  • the tracker reads the content and processor core 1201 based on the track table 2301.
  • the process of controlling the first-level instruction buffer 1203 to output the corresponding instruction for execution by the processor core 1201 is the same as the execution of the variable length instruction described above.
  • the converter 1202 can be reconfigured by switching the instruction set.
  • the specific manner is similar to the above description of changing one instruction set to another instruction set, but the track table 2301 and the instruction buffer 1203 are not required in the process. All memory such as 1903 is cleared. Due to track table 2301 The different thread tracks do not interfere with each other, and the other memories are related to the track table, so the threads are independent of each other and have independent track spaces. As long as the instruction set or thread is switched, just follow the tracer of one thread.
  • a memory can be used in the tracker to hold the read pointer of each thread of the tracker so that the corresponding read pointer can be conveniently restored when the thread (or virtual machine) switches.
  • it can be a processor core 1701
  • Each status register establishes a memory corresponding to each thread. When switching between different threads, the time interval is only the read pointer, the processor core status register and the read pointer memory, and the time required to exchange data between the status memories.
  • the processor system of the present invention can also be combined with the method of Figure 13B by converter 1202.
  • the external instructions are converted according to different thread numbers by using corresponding instruction set correspondences, so that the processor system does not need to reconfigure the converter by suspending the processor core if the instruction sets corresponding to different threads are different.
  • the instructions can be executed without interruption.
  • the correspondence between all possible external instruction sets can be plotted before executing the program.
  • the method described in the embodiment of 13B is imported into the memory of the converter 1202 in the memory space addressed by the thread number.
  • When converting external instructions first use the thread number pair converter 1202
  • the memory addressing finds a corresponding storage space, and then converts the external instruction into an internal instruction according to the correspondence in the storage space according to the foregoing method.
  • the apparatus and method proposed by the present invention can be used in various applications related to instruction set conversion, and the efficiency of the processor system can be improved.
  • the apparatus and method proposed by the present invention can also be used in various virtual machine related applications, implementing virtual machines in hardware, and improving the efficiency of the virtual machine system.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)
  • Executing Machine-Instructions (AREA)

Abstract

提供了一种指令集转换系统和方法,能将外部指令转换为内部指令供处理器核(1201)执行,且可以通过配置,很方便地扩展处理器系统支持的指令集;还提供了内部指令地址和外部指令地址的实时转换方法,使得处理器核(1201)能直接在较高层次缓存中读取内部指令,减少流水线深度。

Description

一种指令集转换系统和方法 技术领域
本发明涉及计算机,通讯及集成电路领域。
背景技术
目前,如果需要在某个处理器核上执行属于不同指令集的程序,最常用的方法是使用软件虚拟机(或虚拟层)。虚拟机的作用是对由处理器核不支持的指令集(外部指令集)组成的程序进行翻译或解释,生成处理器核本身支持的指令集(内部指令集)对应的指令后供执行。一般地,采用解释的方法,是在运行过程中实时地由虚拟机通过软件方法将外部指令中的包括操作码、操作数等各个域依次取出,然后用存储器中实现的栈结构,根据不同操作码对操作数进行相应操作。因此,需要执行很多条内部指令才能实现任意一条外部指令的功能,效率很低。而采用翻译的方法,在程序执行前先执行类似软件编译的过程,将该程序转换成完全由内部指令集组成的形式。这样在执行程序时,效率比较高,但软件编译本身依然有不小的开销。
第二种解决方法是在处理器核内部包含对应不同指令集的指令译码器,在执行不同指令集的指令时用相应的指令译码进行译码并交后续流水线操作。这种方法在执行效率上几乎没有损失,但增加的指令译码器会导致硬件开销增大,提高了处理器芯片的成本。此外,由于多种指令译码器是事先用硬件在处理器核内实现的,缺乏扩展性,无法支持新的指令集。
第三种解决方法是在处理器核的外部增加一个转换模块,将外部指令集转换为内部指令集后供处理器核执行。这种转换模块可以是用软件实现的,但一般来说,用软件进行解释虽然易于扩展,但效率太低。这种转换模块也可以是用硬件实现的,但难于扩展,且无法充分利用缓存存储转换得到的内部指令。
技术问题
具体地,若该转换模块位于缓存和处理器核之间,则缓存中存储的是外部指令,必须经过转换才能供处理器核执行。这样,无论是否缓存命中,都要经过该转换步骤,对同样的外部指令进行多次重复性的转换,不但增加了功耗,而且加深了处理器核的流水线,从而增加了硬件开销和分支预测失败时的性能损失。
若该转换模块位于缓存之外(即缓存位于转换模块和处理器核之间),则缓存中存储的是转换得到的内部指令,即根据内部指令地址对缓存寻址,而处理器核执行分支指令计算得到的分支目标指令地址是外部指令地址。由于内部指令和外部指令并不是一一对应的(如:一条外部指令可以对应多条内部指令),因此必须记录内部指令地址与外部指令地址对应关系,才能在分支转移时,将分支目标指令的外部指令地址转换为内部指令地址并以此在缓存中找到正确的指令。记录内部指令地址与外部指令地址对应关系的难点在于如何有效地储存,以及如何有效地转换。否则,一旦发生分支转移,只能根据外部指令地址从转换模块以外的更低层次存储器中读取指令经转换模块转换后再存储到缓存中并供处理器核执行,依然严重影响执行效率。该问题的一种方法是用基于程序执行路径跟踪缓存( trace cache )代替传统的基于地址匹配的缓存。但跟踪缓存中会存储大量地址重复、但位于不同路径上的指令,造成很大的容量浪费,导致跟踪缓存的性能不高。
技术解决方案
本发明提出的方法与系统装置能直接解决上述或其他的一个或多个困难。
本发明提出了一种指令集转换方法,所述方法包括:将外部指令转换为内部指令,并建立外部指令地址和内部指令地址之间的映射关系;将所述内部指令存储在处理器核能直接访问的缓存中;直接根据该内部指令地址对缓存寻址读出相应的内部指令供处理器核执行;或根据所述映射关系将处理器核输出的外部指令地址转换为内部指令地址后,对缓存寻址读出相应的内部指令供处理器核执行。
可选的,在所述方法中,根据程序执行流及处理器核执行指令的反馈向处理器核提供后续指令;所述处理器核执行指令的反馈可以是处理器核执行分支指令时产生的分支转移是否发生的信号。
可选的,在所述方法中,对于需要被转换的外部指令提取出外部指令中包含指令类型在内的各个指令域;根据提取出的指令类型查找对应的内部指令的指令类型和指令转换控制信息;根据所述指令转换控制信息对提取出的相应指令域进行移位;对所述内部指令类型及移位后的指令域进行拼接,构成相应的内部指令,从而将外部指令转换为内部指令。
可选的,在所述方法中,一条外部指令被转换为一条内部指令;其中,该外部指令的指令地址对应内部指令的指令地址;或一条外部指令被转换为多条内部指令;其中,该外部指令的指令地址对应所述多条内部指令中第一条内部指令的指令地址。
可选的,在所述方法中,多条外部指令被转换为一条内部指令;其中,所述多条外部指令中第一条外部指令的指令地址对应该内部指令的指令地址。
可选的,在所述方法中,建立外部指令地址和内部指令地址之间的映射关系。
可选的,在所述方法中,所述外部指令地址和内部指令地址之间的映射关系包括:外部指令地址和内部指令块地址之间的映射关系、外部指令块内地址和内部指令块内地址之间的映射关系。
可选的,在所述方法中,可以用一种数据结构表示外部指令地址和内部指令块地址之间的映射关系;所述数据结构中存储了内部指令块地址,且所述内部指令块地址同时按外部指令块地址和外部指令块内地址进行排序。
可选的,在所述数据结构中,如果一个外部指令地址对应的内部指令块地址存在,则可以根据所述外部指令地址中的外部指令块地址和外部指令块内地址,在该数据结构中找到对应的位置,读出其中存储的内部指令块地址。
可选的,在所述数据结构中,如果一个外部指令地址对应的内部指令块地址不存在,则可以根据所述外部指令地址中的外部指令块地址和外部指令块内地址,找到其插入位置,并在位置中存储该外部指令地址对应的内部指令块地址。
可选的,在所述方法中,根据所述外部指令块地址和内部指令块地址之间的映射关系,可以对外部指令地址进行转换得到对应的内部指令块地址。
可选的,在所述方法中,根据所述外部指令块内地址和内部指令块内地址之间的映射关系,可以对外部指令块内地址进行转换得到对应的内部指令块内地址。
可选的,在所述方法中,对于任意一个外部指令地址,通过正向移位逻辑,从初始值开始,对从该地址所在的外部指令块起始地址开始至该外部指令地址之间的外部指令条数进行计数;其中,每经过一条所述外部指令,正向移一位,最终得到一个移位结果;通过反向移位逻辑,从所述外部指令块对应的内部指令块的起始地址开始对每条外部指令对应的第一条内部指令的条数进行计数;其中,每经过一条所述内部指令,反向移一位,直到移位结果恢复为所述初始值;此时对应的内部指令块内地址即对应所述外部指令的块内地址。
可选的,在所述方法中,通过地址计算,将栈寄存器操作转换为对寄存器堆的操作,使得处理器核内部的寄存器堆能作为栈寄存器使用。
可选的,在所述方法中,所述转换能将一种或多种指令集的指令转换为一种指令集的指令。
本发明还提出了一种指令集转换系统,所述系统包括:处理器核,用于执行内部指令;转换器,用于将外部指令转换为内部指令,并建立外部指令地址和内部指令地址之间的映射关系;地址映射模块,用于存储所述外部指令地址和内部指令地址之间的映射关系,并对外部指令地址和内部指令地址之间进行转换;缓存,用于存储转换得到的内部指令,并根据内部指令地址输出相应内部供处理器核执行。
可选的,在所述系统中,所述转换器进一步包括:存储器,用于存储外部指令类型与内部指令类型的对应关系,及相应外部指令和内部指令之间各个指令域的对应关系;对齐器,用于将外部指令移位对齐,并在外部指令跨越指令块边界的情况下,将该外部指令移位到一个指令块并对齐;提取器,用于提取出外部指令中的各个指令域;其中,提取出的指令类型被用于对所述存储器寻址,以读出所述外部指令对应的指令转换控制信息及相应的内部指令类型,并根据所述控制信息对提取出的指令域进行移位;指令拼接器,用于对所述内部指令类型和移位后的指令域进行拼接,构成内部指令。
可选的,在所述系统中,所述地址映射模块进一步包括:块地址映射模块,用于存储外部指令块地址与内部指令块地址之间的映射关系,并将外部指令块地址转换为内部指令块地址;偏移地址映射模块,用于存储外部指令块内地址与内部指令块内地址之间的映射关系,并将外部指令块内地址转换为内部指令块内地址。
可选的,所述系统还包括一个循迹系统;所述循迹系统根据存储在其中的程序执行流及处理器核执行指令的反馈,同时对所述程序执行流及缓存寻址,并从缓存中读出后续指令送往处理器核供执行;所述处理器核执行指令的反馈可以是处理器核执行分支指令时产生的分支转移是否发生的信号。
可选的,在所述系统中,地址映射模块中还包含一个正向移位逻辑和一个反向移位逻辑;对于任意一个外部指令地址,通过正向移位逻辑,从初始值开始,对从该地址所在的外部指令块起始地址开始至该外部指令地址之间的外部指令条数进行计数;其中,每经过一条所述外部指令,正向移一位,最终得到一个移位结果;通过反向移位逻辑,从所述外部指令块对应的内部指令块的起始地址开始对每条外部指令对应的第一条内部指令的条数进行计数;其中,每经过一条所述内部指令,反向移一位,直到移位结果恢复为所述初始值;此时对应的内部指令块内地址即对应所述外部指令的块内地址。
可选的,在所述系统中,处理器核内的寄存器堆可以被用做栈寄存器;所述系统还包含:栈顶指针寄存器,用于存储当前栈顶指针,该指针指向寄存器堆中的一个寄存器;加法器,用于计算栈顶指针加一的值,对应当前栈顶之上的寄存器的位置;减法器,用于计算栈顶指针减一的值,对应当前栈顶寄存器之下的寄存器的位置;栈底控制模块,用于检测栈寄存器是否即将为空或即将为满,并在栈寄存器即将为满时将栈底位置的至少一个寄存器的值送往存储器保存,并相应调整栈底指针,使得栈寄存器不会溢出;或在栈寄存器即将为空时,相应调整栈底指针,并将之前送到存储器保存的至少一个寄存器的值存回栈底,使得栈寄存器能继续提供操作数供处理器核执行。
可选的,在所述方法中,对填充到一级缓存的指令进行审查,提取出相应的指令信息;第一读指针根据所述指令信息而非指令本身的功能确定如何更新。
可选的,在所述方法中,当第一读指针指向一条有条件分支指令,且其后一条是无条件分支指令时,则根据处理器核对有条件分支指令的执行结果:若分支转移发生,第一读指针被更新为所述有条件分支指令的分支目标寻址地址值;若分支转移没有发生,第一读指针被更新为所述无条件分支指令的分支目标寻址地址值;使得处理器核不需要单独一个时钟周期执行所述无条件分支指令。
可选的,在所述方法中,当处理器核执行到分支指令时,根据分支预测选择顺序执行下一指令和分支目标指令中的一个作为后续指令执行,并保存另一个的寻址地址;若分支转移结果与分支预测一致,则继续执行后续指令;若分支转移结果与分支预测不一致,则清空流水线,并从所述保存的寻址地址对应的指令重新执行。
可选的,在所述系统中,第一读指针根据所述指令信息而非指令本身的功能确定如何更新。
可选的,在所述系统中,同时从轨道表中读出第一读指针指向的轨迹点及其后一个轨迹点中存储的所述指令信息。
可选的,在所述系统中,当第一读指针指向一条有条件分支指令,且其后一条是无条件分支指令时,则根据处理器核对有条件分支指令的执行结果:若分支转移发生,第一读指针被更新为所述有条件分支指令的分支目标寻址地址值;若分支转移没有发生,第一读指针被更新为所述无条件分支指令的分支目标寻址地址值;使得处理器核不需要单独一个时钟周期执行所述无条件分支指令。
可选的,在所述系统中,所述循迹系统还包括一个寄存器,用于存储顺序执行下一指令和分支目标指令中的一个寻址地址;当处理器核执行到分支指令时,根据分支预测选择顺序执行下一指令和分支目标指令中的一个作为后续指令执行,并将另一个的寻址地址存储在所述寄存器中;若分支转移结果与分支预测一致,则继续执行后续指令;若分支转移结果与分支预测不一致,则清空流水线,并从所述寄存器中保存的寻址地址对应的指令重新执行。
可选的,在所述系统中,所述轨道表中每条轨道的最后一个轨迹点之后再增加一个结束轨迹点;所述结束轨迹点的指令类型为无条件分支指令,其分支目标寻址地址为顺序执行下一轨道第一个轨迹点的寻址地址;当第一读指针指向结束轨迹点时,一级缓存输出空指令。
可选的,在所述系统中,所述轨道表中每条轨道的最后一个轨迹点之后再增加一个结束轨迹点;所述结束轨迹点的指令类型为无条件分支指令,其分支目标寻址地址为顺序执行下一轨道第一个轨迹点的寻址地址;当结束轨迹点之前的轨迹点不是分支点时,可以将该结束轨迹点的指令类型及分支目标寻址地址作为该轨迹点的指令类型及分支目标寻址地址。
本发明还提出了一种能执行一种或多种指令集的处理器系统,包括:一个第一存储器,用于存储属于第一指令集的复数条计算机指令;一个指令转换器,用于将所述属于第一指令集的复数条计算机指令转换为复数条内部指令,所述内部指令属于一种第二指令集;一个第二存储器,用于存储由指令转换器转换得到的所述复数条内部指令;一个连接所述第二存储器的处理器核,用于在不需要访问所述复数条计算机指令、以及不需要指令转换器参与的情况下,从第二存储器中读取并执行所述复数条内部指令。
可选的,在所述系统中,指令转换器包含一个存储器,所述存储器可以根据配置用于存储第一指令集和第二指令集之间的映射关系;指令转换器根据存储在其中的第一指令集和第二指令集之间的映射关系将属于第一指令集的所述复数条计算机指令转换为属于第二指令集的所述复数条内部指令。
可选的,所述系统进一步包括:一个连接指令转换器和处理器核的地址转换器,用于将所述复数条计算机指令中的目标计算机指令地址转换为所述复数条内部指令中的目标指令的内部地址。
可选的,在所述系统中,在地址转换器转换地址时:将所述目标计算机指令地址映射为内部指令块地址;将所述目标计算机指令地址映射为内部指令在所述块地址对应的指令块中的块内偏移地址;合并所述块地址和块内偏移地址,构成内部地址。
可选的,在所述系统中,根据所述计算机指令块地址和所述内部指令块地址之间的块地址映射关系映射产生所述块地址。
可选的,在所述系统中,由地址转换器存储所述块地址映射关系;由硬件逻辑根据一个映射关系表映射产生所述块内偏移地址。
可选的,所述系统进一步包括:一个结束标志存储器,用于存储内部指令块的结束指令的内部指令地址;所述结束指令就是转移到顺序地址的下一内部指令块前的最后一条内部指令。
可选的,所述系统进一步包括:一个下块地址存储器,用于存储顺序地址下一内部指令块的块地址;一个分支目标缓冲,用于存储了分支目标的内部指令地址。
可选的,在所述系统中,所述第一存储器存储了属于一个第三指令集的复数条计算机指令;指令转换器根据配置在所述存储器中存储了第三指令集和第二指令集之间的映射关系;指令转换器根据存储在其中的第三指令集和第二指令集之间的映射关系将属于第三指令集的所述复数条计算机指令转换为属于第二指令集的所述复数条内部指令。
可选的,在所述系统上运行一个第一线程指令序列和一个第二线程指令序列;其中:第一线程指令序列由第一指令集的复数条计算机指令构成;第二线程指令序列由第三指令集的复数条计算机指令构成; 所述指令转换器根据配置在所述存储器中同时存储了第一指令集和第二指令集之间的映射关系,以及第三指令集和第二指令集之间的映射关系;指令转换器根据线程号选择所述第一指令集和第二指令集之间的映射关系及第三指令集和第二指令集之间的映射关系中的一个,将该线程的所述复数条计算机指令转换为属于第二指令集的所述复数条内部指令。
可选的,在所述系统中,所述复数条计算机指令中的每一条均至少包含一个内容为指令类型的指令域;所述复数条内部指令中的每一条均至少包含一个内容为指令类型的指令域;所述复数条计算机指令和所述复数条内部指令一一对应;所述映射关系包括每条计算机指令的指令类型和每条内部指令的指令类型之间的映射关系,以及每条计算机指令中除指令类型之外的指令域与每条内部指令中除指令类型之外的指令域之间的映射关系。
可选的,在所述系统中,所述复数条计算机指令中的每一条均至少包含一个内容为指令类型的指令域;所述复数条内部指令中的每一条均至少包含一个内容为指令类型的指令域;所述复数条计算机指令和所述复数条内部指令的总数不相等;所述复数条计算机指令中的每一条均被映射为所述复数条内部指令中的一条或多条。
可选的,在所述系统中,所述映射关系包含一个移位逻辑;所述复数条内部指令中至少一条指令的一个指令域通过对相应计算机指令的相应指令域移位产生。
可选的,在所述系统中,所述计算机指令的指令域中至少包含一个指令类型;指令转换器至少利用所述指令类型对指令转换器中的存储器寻址读出相应的映射关系。
本发明还提出一种用于执行一种或多种指令集的处理器系统的方法,所述方法包括:将属于第一指令集的复数条计算机指令存储在一个第一存储器中;由一个指令转换器将所述复数条计算机指令转换为属于一个第二指令集的复数条内部指令;将由指令转换器转换得到的所述复数条内部指令存储在一个第二存储器中;由一个连接所述第二存储器的处理器核在不需要访问所述复数条计算机指令、以及不需要指令转换器参与的情况下,从第二存储器中读取并执行所述复数条内部指令。
可选的,在所述方法中,通过将第一指令集和第二指令集映射关系存储到指令转换器的存储器中,对指令转换器进行配置;指令转换器根据存储在其中的第一指令集和第二指令集之间的映射关系将属于第一指令集的所述复数条计算机指令转换为属于第二指令集的所述复数条内部指令。
可选的,在所述方法中,通过一个连接指令转换器和处理器核的地址转换器将所述复数条计算机指令中的目标计算机指令地址转换为所述复数条内部指令中的目标指令的内部地址。
可选的,在所述方法中,在地址转换器转换地址时:将所述目标计算机指令地址映射为内部指令块地址;将所述目标计算机指令地址映射为内部指令在所述块地址对应的指令块中的块内偏移地址;合并所述块地址和块内偏移地址,构成内部地址。
可选的,在所述方法中,根据所述计算机指令块地址和所述内部指令块地址之间的块地址映射关系映射产生所述块地址。
可选的,在所述方法中,由地址转换器存储所述块地址映射关系;由硬件逻辑根据一个映射关系表映射产生所述块内偏移地址。
可选的,所述方法进一步包括:由一个结束标志存储器存储内部指令块的结束指令的内部指令地址;所述结束指令就是转移到顺序地址的下一内部指令块前的最后一条内部指令。
可选的,所述方法进一步包括:由一个下块地址存储器存储顺序地址下一内部指令块的块地址;由一个分支目标缓冲存储了分支目标的内部指令地址。
可选的,在所述方法中,将属于一个第三指令集的复数条计算机指令存储在所述第一存储器中;由指令转换器根据配置在所述存储器中存储了第三指令集和第二指令集之间的映射关系;由指令转换器根据存储在其中的第三指令集和第二指令集之间的映射关系将属于第三指令集的所述复数条计算机指令转换为属于第二指令集的所述复数条内部指令。
可选的,在所述方法中,运行一个第一线程指令序列和一个第二线程指令序列;其中:第一线程指令序列由第一指令集的复数条计算机指令构成;第二线程指令序列由第三指令集的复数条计算机指令构成; 由所述指令转换器根据配置在所述存储器中同时存储第一指令集和第二指令集之间的映射关系,以及第三指令集和第二指令集之间的映射关系;由指令转换器根据线程号选择所述第一指令集和第二指令集之间的映射关系及第三指令集和第二指令集之间的映射关系中的一个,将该线程的所述复数条计算机指令转换为属于第二指令集的所述复数条内部指令。
可选的,在所述方法中,所述复数条计算机指令中的每一条均至少包含一个内容为指令类型的指令域;所述复数条内部指令中的每一条均至少包含一个内容为指令类型的指令域;所述复数条计算机指令和所述复数条内部指令一一对应;所述映射关系包括每条计算机指令的指令类型和每条内部指令的指令类型之间的映射关系,以及每条计算机指令中除指令类型之外的指令域与每条内部指令中除指令类型之外的指令域之间的映射关系。
可选的,在所述方法中,所述复数条计算机指令中的每一条均至少包含一个内容为指令类型的指令域;所述复数条内部指令中的每一条均至少包含一个内容为指令类型的指令域;所述复数条计算机指令和所述复数条内部指令的总数不相等;所述复数条计算机指令中的每一条均被映射为所述复数条内部指令中的一条或多条。
可选的,在所述方法中,所述复数条内部指令中至少一条指令的一个指令域通过对相应计算机指令的相应指令域移位产生。
可选的,在所述方法中,所述计算机指令的指令域中至少包含一个指令类型;指令转换器至少利用所述指令类型对指令转换器中的存储器寻址读出相应的映射关系。
对于本领域专业人士,还可以在本发明的说明、权利要求和附图的启发下,理解、领会本发明所包含其他方面内容。
有益效果
本发明所述处理器系统中最接近处理器核的缓存系统(即较高层次缓存)中存储的是处理器核本身支持的内部指令集,而存储在主存储器或较低层次缓存中的是外部指令集。通过对转换器配置,可以使得相应的外部指令集被转换为内部指令集供处理器核执行。因此,能很方便地扩展处理器系统支持的指令集。
本发明根据程序执行流及处理器核执行指令的反馈,直接由较高层次缓存向处理器核提供内部指令,降低了流水线深度,提高了流水线效率。特别在分支预测错误时,能减少浪费的流水线周期。
对于本领域专业人士而言,本发明的其他优点和应用是显见的。
附图说明
图 1 是本发明所述处理器系统的一个示意图;
图 2 是本发明所述转换器的一个实施例 ;
图 3A 是本发明所述对齐器的一个实施例;
图 3B 是本发明所述对齐器运行过程的一个实施例;
图 4A 是本发明所述提取器的一个实施例;
图 4B 是本发明所述提取器运行过程的一个实施例;
图 5A 是本发明所述映射信息的一个示意图;
图 5B 是本发明所述映射信息的另一个示意图;
图 5C 是本发明所述映射信息存储器运行的一个实施例;
图 5D 是本发明所述映射信息存储器运行的另一个实施例;
图 5E 是本发明所述映射信息存储器运行的另一个实施例;
图 5F 是本发明所述指令拼接器的一个实施例;
图 6 是本发明所述包含多层缓存的处理器系统的一个实施例;
图 7A 是本发明所述基于轨道表的缓存结构的实施例;
图 7B 是本发明所述扫描转换器的一个实施例;
图 8A 是本发明所述外部指令块与内部指令块对应关系的示意图;
图 8B 是本发明所述偏移地址映射关系存储形式的一个实施例;
图 8C 是本发明所述偏移地址转换器的一个实施例;
图 8D 是本发明所述块地址映射模块的一个实施例;
图 9A~9F 是本发明所述包含多层缓存的处理器系统运行过程的示意图;
图 10A 是本发明所述操作数栈的一个实施例;
图 10B 是本发明所述更新栈底的一个实施例;
图 10C 是 本发明所述更新栈底的另一个实施例;
图 11A 是本发明所述基于轨道表的缓存结构的另一个实施例;
图 11B 是本发明支持猜测执行的实施例;
图 12 是本发明所述包含可配置转换器的处理器系统的一个实施例;
图 13A 是本发明所述可配置转换器的一个框图实施例;
图 13 B 是本发明所述可配置转换器中存储器的一个实施例;
图 13 C 是本发明所述可配置转换器中存储器的另一个实施例;
图 14 是本发明所述包含可配置转换器和地址映射模块的处理器系统的一个实施例;
图 15 是本发明所述包含可配置转换器和地址映射模块的处理器系统的另一个实施例;
图 16 是本发明所述包含分支目标表的处理器系统的一个实施例;
图 17 是本发明所述包含分支目标表和循迹器的处理器系统的另一个实施例;
图 18A 是本发明所述下块地址存储器格式的一个实施例;
图 18B 是本发明所述下块地址存储器格式的另一个实施例;
图 18C 是所述两个存储层次处理器系统中外部指令地址格式的一个示意图;
图 19 是本发明所述包含两层指令存储器的处理器系统的一个实施例;
图 20 是本发明所述两个存储层次处理器系统中标签存储器结构的一个示意图;
图 21 是本发明所述外部指令边界不对齐情况下指令存储器存储内部指令的一个实施例;
图 22 是本发明所述块地址映射模块的另一个实施例;
图 23 是本发明所述包含轨道表的处理器系统的一个实施例;
图 24 是本发明所述利用寄存器堆实现栈操作功能的处理系统的一个实施例 。
本发明的最佳实施方式
图 14 显示了本发明的最佳实施方式。
本发明的实施方式
以下结合附图和具体实施例对本发明提出的高性能缓存系统和方法作进一步详细说明。根据下面说明和权利要求书,本发明的优点和特征将更清楚。需说明的是,附图均采用非常简化的形式且均使用非精准的比例,仅用以方便、明晰地辅助说明本发明实施例的目的。
需要说明的是,为了清楚地说明本发明的内容,本发明特举多个实施例以进一步阐释本发明的不同实现方式,其中,该多个实施例是列举式并非穷举式。此外,为了说明的简洁,前实施例中已提及的内容往往在后实施例中予以省略,因此,后实施例中未提及的内容可相应参考前实施例。
虽然该发明可以以多种形式的修改和替换来扩展,说明书中也列出了一些具体的实施图例并进行详细阐述。应当理解的是,发明者的出发点不是将该发明限于所阐述的特定实施例,正相反,发明者的出发点在于保护所有基于由本权利声明定义的精神或范围内进行的改进、等效转换和修改。同样的元器件号码可能被用于所有附图以代表相同的或类似的部分。
本发明所述的指令地址( Instruction Address )指的是指令在主存储器中的存储地址,即可以根据该地址在主存储器中找到这条指令。在此为简单明了起见,均假设虚拟地址等于物理地址,对于需要进行地址映射的情况,本发明所述方法也可适用。在本发明中,当前指令可以指当前正在被处理器核执行或获取的指令;当前指令块可以指含有当前正被处理器执行的指令的指令块。
为便于说明,在本说明书中,术语'外部指令集( Guest Instrution Set )'代表本发明所述处理器系统执行的程序对应的指令集,'外部指令集'中包含的指令即'外部指令';术语'内部指令集( Host Instruction Set )'代表本发明所述处理器系统中处理器核本身支持的指令集,'内部指令集'中包含的指令即'内部指令';术语'指令块'代表指令地址高位相同的一组连续的指令;术语'指令域'是指令字中的代表同一内容的连续区域( Field ),如第一操作码( Op-code )域、第二操作码域、第一源寄存器( Source Register )域、第二源寄存器域、目标寄存器( Target Register )域、立即数( immediate )域等。此外,在本发明中,内部指令集是定长指令集,即每条目标指令的字长是固定的(如: 32 位);而外部指令集可以是定长指令集,也可以是变长指令集。如果外部指令集是变长的,且一条变长外部指令所占的所有字节的地址高位不完全相同,即该指令跨越了两个指令块,则将该外部指令作为前一个指令块的最后一条指令,而该指令之后的一条指令作为后一个指令块的第一条指令。
在本发明中,分支指令( Branch Instruction )或分支点( Branch Point )指的是任何适当的能导致处理器核改变执行流( Execution Flow )(如:非按顺序执行指令或微操作)的指令形式。分支指令地址指分支指令本身的指令地址,该地址由指令块地址和指令偏移地址构成。分支目标指令指的是分支指令造成的分支转移所转向的目标指令,分支目标指令地址指的是分支目标指令的指令地址。
根据本发明技术方案,每条外部指令先被转换成单数条或复数条内部指令;或复数条外部指令被转换成单数条或复数条内部指令;再由处理器核执行,从而实现与直接执行所述外部指令相同的功能。请参考图 1 ,其为本发明所述处理器系统的一个示意图。其中,存储器 103 中存储了需要被执行的程序的可执行代码,且该可执行代码是由外部指令集的指令构成;每条所述外部指令先被送到转换器 200 转换为对应的单数条或复数条内部指令,再被送到处理器核 101 执行。在本发明中转换器 200 可以是固定结构的,即仅支持将特定的外部指令集转换为内部指令集;也可以是可配置的,即可以根据配置,将一种或多种外部指令集转换为内部指令集。在此,可以认为所述固定结构的转换器是可配置的转换器的特例,因此在本说明书,仅对可配置的转换器进行说明。
请参考图 2 ,其为本发明所述转换器的一个实施例。在本实施例中,转换器 200 由存储器 201 、对齐器 203 、提取阵列 205 、指令拼接器 207 和操作码拼接器 209 构成。其中,对齐器 203 将外部指令移位对齐,并在外部指令跨越指令块边界的情况下,将该外部指令移位到一个指令块并对齐。
请参考图 3A ,其为本发明所述对齐器的一个实施例。在本实施例中,对齐器 203 由控制器 301 、缓冲 303 、 305 和循环移位器 307 构成。在此,假设一条外部指令的字长的单位为字节,且一个指令块可以容纳最长的外部指令的全部字节。因此,本实施例采用两个缓冲分别存储两个连续的指令块。这样,正被处理的一条外部指令可以完全位于缓冲 303 中一个指令块中;或跨越了指令块边界(即该指令的头在一个缓冲 303 中指令块的尾部,而剩余部分在缓冲 305 中指令块头部)。选择器 312 、 314 、 316 、 318 和 320 按顺序由左到右各对应一个字节,在译码器 327 控制下对其相应字节选择缓冲 303 或 305 中内容送往循环移位器 307 的输入。
控制器 301 中有寄存器 321 及加法器 323 ,其位数为 m ,而 2m 等于存储器 303 , 305 的字节宽度。寄存器 321 存储了当前被转换的外部指令的起始偏移地址( SA , Start Address )。该 SA 经编码器 327 编码后作为选择信号控制缓冲 303 及 305 的输出选择器 312 、 314 、 316 、 318 和 320 ,相应从缓冲 303 中选出偏移地址大于等于 SA 的字节及从缓冲 305 中选出偏移地址小于 SA 的字节,一同送往循环移位器 307 。经总线 313 被送到循环移位器 307 作为移位位数( Shift Amount )。
这样,循环移位器 307 的输入中,偏移地址大于等于该 SA 的部分就是所述外部指令的头部 353 ,而偏移地址小于该 SA 的部分就是所述外部指令的尾部 355 ,且可能在所述尾部之后还有后续外部指令的一部分内容。因此,循环移位器 307 根据从总线 313 接收到的移位位数(即 SA )进行循环左移,即可将所述外部指令的头部 353 移至指令块的起始位置,将该外部指令的尾部 355 置于同一个指令块中头部右面的位置,并将该指令块从循环移位器 307 输出。
该指令块中外部指令经检测后其外部指令长度会由存储器 201 经总线 325 送出。该长度经总线 325 送到控制器 301 中的加法器 323 与总线 313 上的移位位数相加,其结果就是下一条外部指令的起始偏移地址 SA 存入寄存器 321 中。此外,若加法器 323 的进位输出为' 0 ',表示所述下一条外部指令的起始位置位于缓冲 303 中,可直接按上述方法进行对齐。若加法器 323 的进位输出为' 1 ',则表示所述下一条外部指令的起始位置位于缓冲 305 中。此时,在该进位输出的控制下,缓冲 305 中的内容被填充到缓冲 303 中,同时新的一个后续指令块被填充到缓冲 305 中,使得所述下一条外部指令的起始位置仍然位于缓冲 303 中,并按上述方法进行对齐。
请参考图 3B ,其为本发明所述对齐器运行过程的一个实施例。外部指令 351 跨越了指令块边界。其中,头部 353 位于指令块 357 中,尾部 355 位于指令块 359 中。根据本发明技术方案,指令块 357 和 359 分别存储在缓冲 303 和 305 中,并经选择器选择、拼接后形成如指令块 361 的形式作为循环移位器 307 的输入。此时,指令块 361 由三部分组成,从左到右依次是外部指令 351 的尾部 355 、外部指令 351 的后续指令的一部分 363 ,以及外部指令 351 的头部 353 。
移位器 307 根据所述外部指令头部 353 起始字节在指令块中的偏移地址 313 为移位位移进行循环左移,使得外部指令 351 的起始地址与指令块的起始位置对齐。在本实施例中,循环移位得到的指令块 365 中除了外部指令 351 外,还有其后续指令的一部分 363 ,该部分对后续操作没有任何影响,可以忽略。
回到图 2 ,经对齐器 203 移位对齐后的外部指令被送到提取阵列 205 中根据指令类型提取出各个指令域。提取阵列 205 由若干个结构相同的提取器构成。在此,提取器的数目大于等于外部指令集中任意指令包含的指令域的最大数目。在本发明所述处理器系统支持的所有外部指令集中,若指令最多包含 n 个指令域,则提取阵列 205 由 n 个提取器构成,且每个提取器均接收同样的外部指令作为输入,并根据存储器 201 送来的控制信号输出需要提取出的信息。
请参考图 4A ,其为本发明所述提取器的一个实施例。在本实施例中,提取器由循环移位器 401 和掩码器 403 构成。其中,循环移位器 401 根据接收到的移位位数,将输入的外部指令字循环移位,从而将指令中的特定指令域移至相应位置。掩码器 403 则对移位后的指令与掩码字进行按位与( Bit AND )操作,使得提取器的输出中除了所述特定指令域之外,其他部分为全' 0 '。这样,就可以将外部指令的指令域移至内部指令对应的指令域的位置。
请参考图 4B ,其为本发明所述提取器运行过程的一个实施例。本实施例对外部指令 451 中的指令域 453 的移位、掩码进行说明。其中,循环移位器 401 的移位位数等于指令域在内部指令和外部指令中的差值。例如,指令域 453 位于外部指令 451 的第 10 、 11 、 12 位( Bit ),而在对应的内部指令中该指令域应位于第 6 、 7 、 8 位,则对应的移位位数为左移 4 位(即' 10 '减' 6 ')。这样,外部指令 451 经循环移位器 401 移位后得到如图 4B 中的移位后指令 455 的形式。
在本实施例中,由于该指令域位于内部指令的第 6 、 7 、 8 位。因此,掩码字 457 的第 6 、 7 、 8 位均为' 1 ',其他位均为' 0 '。这样,移位后指令 455 在掩码器 403 中与掩码字 457 按位与后作为提取器的输出,即如图 4B 中的提取器输出 459 的形式。
回到图 2 ,提取阵列 205 中的一部分提取器用于提取外部指令的操作码域,另一部分提取器则提取外部指令的其他指令域。例如,假设外部指令集的指令中操作码域最多有三个,那么在提取阵列 205 中,提取器 211 、 213 和 215 用于提取操作码域(称为操作码提取器),剩余的提取器(如提取器 221 、 223 、 225 和 227 )用于提取其他指令域(称为其他域提取器)。在此,提取器 211 、 213 和 215 提取出的操作码分别被移位至不同位置互不重叠,并被送到操作码拼接器 209 进行按位或( Bit OR )操作,从而得到完整的操作码。该完整操作码被作为寻址地址送到存储器 201 。
当一个提取器被用于提取操作码域时,该提取器的控制信号(如:移位位数、掩码字等)均来源于相应的寄存器。例如,在图 2 中,寄存器 212 中的控制信号经选择器 222 选择后用于控制提取器 211 ;寄存器 214 中的控制信号经选择器 224 选择后用于控制提取器 213 ;寄存器 216 中的控制信号经选择器 226 选择后用于控制提取器 215 。
当一个提取器被用于提取其他指令域时,该提取器的控制信号均来源于存储器 201 。存储器 201 由若干行映射信息构成,分为直接访问区及间接访问区。每行映射信息对应一个寻址地址。由于每个寻址地址对应一个完整的内部指令操作码,因此,一行或多行映射信息对应所述外部指令集之中的一条或多条外部指令,其中存储了相应的提取信息。所述提取信息包括该外部指令对应的内部指令的操作码、该外部指令除操作码域以外的各个指令域起始位置及宽度、所述指令域与相应的内部指令的指令域的位置关系等。
在本发明中,可以根据外部指令的操作码直接对存储器 201 的直接访问区寻址,找到对应的那行映射信息。具体地,可以将操作码拼接器 209 输出的完整操作码本身作为对直接访问区的寻址地址,读出相应行中的映射信息。而存储器 201 的间接访问区则必须根据其他行映射信息中的索引值(即行地址信息)访问。例如,当一条外部指令对应多条内部指令时,可以先以该外部指令的完整操作码作为寻址地址在直接访问区中读出所述多条内部指令中第一条内部指令对应的映射信息,从而转换出第一条内部指令。而该映射信息中包含了所述多条内部指令中第二条内部指令对应的映射信息在间接访问区中的索引值。因此,根据该索引值,即可在间接访问区中找到所述第二条内部指令对应的映射信息,从而转换出第二条内部指令。如此反复,直到转换得到所述多条内部指令中的最后一条内部指令为止。
请参考图 5A ,其为本发明所述映射信息的一个示意图。图 5A 所示的映射信息一行即对应一条外部指令,即该外部指令对应一条内部指令。映射信息 501 由内部指令操作码 503 、外部指令长度 505 、若干个提取器配置信息(如提取器配置信息 507 、 509 、 511 、 513 )和结束标志 515 构成。其中,内部指令操作码 503 就是所述外部指令对应的内部指令的操作码。外部指令长度 505 是所述外部指令本身的指令字长度,被送到对齐器 203 作为外部指令长度值 325 与当前指令起始点相加用于计算下一外部指令的起始点。结束标志 515 中存储了全' 0 ',用以表示该行是外部指令对应的最后一行内部指令映射信息。
映射信息 501 中提取器配置信息的数目与提取器的个数相同,且一一对应。每个提取器配置信息由三部分构成:移位位数( R )、掩码值中' 1 '的开始位置( B )和掩码值中' 1 '的个数( W )。其中,移位位数 R 被送到相应提取器用于控制循环移位器 401 的移位;开始位置 B 和个数 W 则被用于确定掩码值中' 1 '的位置,即从 B 开始的连续 W 个掩码位的值为' 1 ',其余掩码位的值为' 0 '。
请参考图 5B ,其为本发明所述映射信息的另一个示意图。图 5B 所示的映射信息多行对应一种外部指令,即该外部指令对应多条内部指令。在此以一条外部指令对应三条内部指令为例,这三条内部指令相应信息对应的映射信息分别为映射信息 551 、 561 和 571 。其中映射信息 551 位于存储器 201 中的直接访问区可由外部指令中提取的操作码译码后直接寻址的地址区域。而映射信息 561 和 571 位于存储器 201 中的间接访问区,必须根据直接访问区内映射信息(如映射信息 551 )中存储的索引值寻址访问。与图 5A 中的映射信息 501 类似,映射信息 551 也由内部指令操作码 503 、外部指令长度 505 、若干个提取器配置信息(如提取器配置信息 507 、 509 、 511 、 513 )和结束标志 515 构成。而映射信息 561 和 571 中也包含了内部指令操作码 503 、若干个提取器配置信息(如提取器配置信息 507 、 509 、 511 、 513 )和结束标志 515 ,但可以不包含外部指令长度 505 。其中,三行映射信息中的内部指令操作码 503 分别对应了所述外部指令对应的三条内部指令的操作码。映射信息 551 中的外部指令长度 505 是所述外部指令本身的指令字长度,被送到对齐器 203 作为外部指令长度值 325 用于计算下一外部指令的起始点。在本实施例中,映射信息 551 和 561 的结束区标示并非结束而是指向下一条映射信息的地址。对于其他情况也可以此类推。映射信息 551 和 561 的指令字长度 505 中各存储了指向后续映射信息的索引。即,映射信息 551 的指令字长度 505 中存储了映射信息 561 在存储器 201 中的索引值,映射信息 561 的指令字长度 505 中存储了映射信息 571 在存储器 201 中的索引值。而映射信息 571 中作为多条内部指令对应一条外部指令的最后一条内部指令信息,其指令字长度 505 才给出该外部指令的指令长度。作为所述外部指令对应的最后一行映射信息,映射信息 571 的结束标志 515 中存储了全' 0 '。这样,根据操作码提取器提取到的完整操作码可以找到第一行映射信息,之后在各行映射信息的结束标志 515 的控制下,存储器 201 可以正确地输出一条外部指令对应的所有内部指令的映射信息,从而正确地进行指令集转换。
回到图 2 ,对于对齐器 203 输出的任意一条外部指令,由操作码提取器提取出的完整操作码作为寻址地址可以从存储器 201 中读出相应的内部指令操作码 503 经总线 230 送往指令拼接器 207 ,及读出对应该外部指令各个指令域的提取信息并分别送到各个其他域提取器。每个其他域提取器根据所述提取信息中的域起始位置、域宽度及移位位数信息,将外部指令相应的指令域移至特定位置,并进行掩码操作,使得其他域提取器的输出除了所述移位后的指令域之外均为' 0 '。
这样,所述外部指令中除操作码域以外的所有指令域在各个其他域提取器中被移至内部指令所需的指令域后,输出到指令拼接器 207 进行按位或操作,并拼接在存储器 201 输出的内部指令操作码之后,形成符合内部指令集格式的内部指令。该内部指令被送往处理器核执行,从而实现相应的外部指令的功能。
请参考图 5C 、 5D 和 5E ,其为本发明所述映射信息存储器运行的三个实施例。在这些实施例中,存储器 201 分为直接访问区 531 和间接访问区 533 。其中,间接访问区的地址比直接访问区高,例如外部指令操作码所形成的地址为 n 位,存储器 201 的地址为 n+1 位。当地址的最高位为' 0 '访问直接访问区 531 ,当地址的最高位为' 1 '访问间接访问区 533 。
存储器 201 中每行映射信息都包含了一个两位的结束标志(如图由 Y 位和 Z 位构成),用于表示与该行映射信息对应的外部指令及内部指令之间的转换关系,即是一条外部指令对应一条内部指令,还是一条外部指令对应多条内部指令,还是多条外部指令对应一条内部指令,控制转换器以哪一种方式处理下一条指令。具体地,图 5C 中的标志 535 的值' 00 '表示该行映射信息对应当前的外部指令,即一条外部指令对应一条内部指令;图 5D 中的标志 545 的值' 10 '表示该行映射信息不但对应当前的外部指令,还对应其后一条外部指令,即多条外部指令对应一条内部指令;图 5E 中的标志 555 的值' 01 '表示该行映射信息及该行映射信息中索引值指向的映射信息一同对应当前的外部指令,即一条外部指令对应多条内部指令。所述标志中的 Y 位被用于表示是否对下一条外部指令进行转换。具体地,若 Y 位为' 0 ',表示对当前外部指令(或包含当前外部指令在内的若干连续外部指令)的转换已经完成,下一周期开始对下一外部指令的转换。若 Y 位为' 1 ',表示当前外部指令的转换尚未完成,下一周期将继续进行相关转换,不能开始下一外部指令的转换。
在本实施例中,所述标志被存储到寄存器 537 ,同时该行映射信息中索引值被存储到寄存器 539 供下一条指令转换使用。寄存器 537 中存储的前一外部指令的所述标志就可以在处理当前外部指令时被用于控制选择器 541 (由所述标志中的 A 位控制)和地址拼接逻辑 543 (由所述标志中的 B 位控制)。
图中寄存器 537 的 Y 输出控制一个二路选择器。当 Y 值为' 0 '时,选择来自于从外部指令中操作码作为存储器 201 的地址,当 Y 值为' 1 '时,选择存在寄存器 539 中的从存储器 201 中前一条指令的索引值作为当前指令转换时的存储器 201 地址。 Z 值是作为一个地址高位拼接到来自于从外部指令中操作码而形成的地址。当 Z 值为' 0 '时,存储器 201 上的地址指向直接访问区,当 Z 值为' 1 '时,存储器 201 上的地址指向间接访问区。图中圆圈表示总线拼接。
在图 5C 实施例中,前一外部指令对应的结束值 YZ 为' 00 ',且该内部指令可以根据相应的映射信息按之前所述方法产生,那么当前外部指令应对应至少一条新的内部指令。此时地址拼接逻辑 543 的一个输入是来源于操作码拼接器 209 的当前外部指令的完整操作码,另一个输入是寄存器 537 中标志的 Z 位(' 0 '),即在所述完整操作码之前拼接全' 0 ',因此地址拼接逻辑 543 的输出依然是当前外部指令的完整操作码,对应直接访问区 531 的地址。而选择器 541 受标志中的 Y 位(' 0 ')控制,选择来源于或逻辑的输出作为存储器 201 的寻址地址。这样,即可从存储器 201 的直接访问区 531 中读出当前外部指令对应的映射信息,按之前所述方法对相应指令域移位及掩码,并送往指令拼接器 207 。又因为标志中的 Y 位为' 0 ',因此下一周期即可开始对下一外部指令进行转换。
请参考图 5F ,其为本发明所述指令拼接器的一个实施例。其中,寄存器 563 中存储的是已经转换完毕得到的内部指令或尚位转换完毕得到的中间转换结果。所述标志中的 Z 位被存储在寄存器 561 中,并在下个周期送往与逻辑 567 ,以及经反相器反相后输出作为表示寄存器 563 中的内部指令是否已转换完成的信号。与逻辑 567 的另一个输入来源于寄存器 563 中存储的值,其输出被送往或逻辑 569 。或逻辑 569 的另一个输入是来源于总线 559 的从各个提取器送来的移位掩码得到的结果。寄存器 563 的输出就是指令拼接器 207 的输出 667 。
对于图 5C 实施例所述情况,由于标志的 Z 位为' 0 ',因此在下一周期,与逻辑 567 的输出为' 0 ',则或逻辑 569 的输出就是各个提取器移位掩码得到的结果。这些结果在寄存器 563 中拼接成一条完整的内部指令。此时反相器 565 输出的值为' 1 '(即上述 Z 位的反相值),表示转换完毕,且寄存器 563 中存储的内容就是转换得到的内部指令。这样,就完成了一条外部指令向一条内部指令的转换,并在下一周期输出,同时转换器开始读下一条外部指令的转换。
在图 5D 实施例中,前一外部指令对应的标志值为' 10 ',表示该外部指令对应多条内部指令,且上一次映射信息对应的内部指令尚不足以完成该转换,那么在完成对前一外部指令的转换之前,不能对当前外部指令进行转换。此时,寄存器 539 中存储的就是所述上一次映射信息中包含的索引值,即所述上一次映射信息的后一个映射信息(这两个映射信息均对应所述前一外部指令)在间接访问区 533 中的寻址地址。选择器 541 则受标志中的 Y 位(' 1 ')控制,选择寄存器 539 输出的所述索引值。由于该索引值对应的存储器 201 地址空间位于间接访问区 533 ,因此可从间接访问区 533 中读出所述前一外部指令对应的映射信息,按之前所述方法对相应指令域移位及掩码,并送往指令拼接器 207 。由于标志中的 Y 位为' 1 ',因此下一周期继续对当前外部指令转换,不能开始下一外部指令的转换。
此时,在指令拼接器 207 中,标志的 Z 位为' 0 ',因此在下一周期,与逻辑 567 的输出为' 0 ',则或逻辑 569 的输出就是各个提取器移位掩码得到的结果。这些结果在寄存器 563 中拼接成一条完整的内部指令。此时反相器 565 输出的值为' 1 '(即上述 Z 位的反相值),表示转换完毕,且寄存器 563 中存储的内容就是转换得到的内部指令。这样,在一条外部指令向对应多条内部指令转换过程中,生成了所述多条内部指令中的一条,并在下一周期输出。同时,从下一周期开始,重复上述过程,直到对应的映射信息中的标志的 Y 位为' 0 ',表示该映射信息对应的内部指令是所述多条内部指令中的最后一条,并在下一周期输出该内部指令,完成了一条外部指令向多条内部指令的转换,同时转换器开始读下一条外部指令的转换。
在图 5E 实施例中,前一外部指令对应的标志值为' 01 ',表示该外部指令及其后一条外部指令(即当前外部指令)对应同一条内部指令,那么应该继续对当前外部指令进行转换,直至产生所述多条外部指令对应的同一条内部指令。此时,地址拼接逻辑 543 的一个输入是来源于操作码拼接器 209 的当前外部指令的完整操作码,另一个输入是寄存器 537 中标志的 Z 位(' 1 '),即在所述完整操作码之前拼接一个额外地址,使得地址拼接逻辑 543 的输出是对应间接访问区 533 的地址。而选择器 541 受标志中的 Y 位(' 0 ')控制,选择来源于或逻辑的输出作为存储器 201 的寻址地址。这样,即可从存储器 201 的间接访问区 533 中读出相应的映射信息,即根据所述前一外部指令及当前外部指令共同对应的映射信息。之后,按之前所述方法对相应指令域移位及掩码,并送往指令拼接器 207 。
此时,在指令拼接器 207 中,标志的 Z 位为' 1 ',因此在下一周期,与逻辑 567 的输出是存储在寄存器 563 中的值(即转换的中间结果),则或逻辑 569 的输出就是当前各个提取器移位掩码得到的结果与所述中间结果的经组合(如按位或操作)得到的结果。这些结果在寄存器 563 中进一步被拼接成新的中间结果。此时反相器 565 输出的值为' 0 '(即上述 Z 位的反相值),表示转换仍未完毕。又由于标志中的 Y 位为' 0 ',转换器开始对下一外部指令开始转换,则重复上述过程,连续多条外部指令相应指令域的移位掩码结果经或逻辑 569 被组合到一起,将多条外部指令转换为一条内部指令,直到所述 Z 位为' 0 ',表示当前外部指令是所述内部指令对应的多条外部指令中的最后一条。此时,反相器 565 输出的值为' 1 '(即上述 Z 位的反相值),表示转换完毕,且寄存器 563 中存储的内容就是转换得到的内部指令。这样,完成了多条外部指令向一条内部指令的转换,
需要说明的是,在本发明中,存储器 201 可以由可重复擦写的随机存储器( RAM )构成,根据所需支持的不同外部指令集向该随机存储器写入不同的映射信息;也可以由只读存储器( ROM )构成,即固定支持一种或多种外部指令集;还可以由能够实现同样功能的逻辑电路构成,固定支持一种或多种外部指令集。可以将缓冲器的一部分指定作为存储器 201 使用而不做缓存使用。
此外,如果外部指令是定长的,且规定提取器的长度等于指令字的长度,则转换器 200 中可以省去对齐器 203 。根据本发明技术方案,转换器 200 可以根据配置支持不同的外部指令集。那么当其中一种外部指令集的指令长度与提取器的长度相同时,可以由选择器 204 直接选择外部指令送往各个提取器;否则选择器 204 选择对齐器 203 的输出送往各个提取器。其他操作与之前实施例所述相同,在此不再赘述。
根据本发明技术方案,可以在处理器系统不同层次的缓存中存储不同的指令集的指令,以提高处理器系统的性能。例如,可以在处理器系统的二级缓存中存储外部指令,在一级缓存中存储内部指令,并在所述外部指令被填充到一级缓存的过程中进行指令集转换。请参考图 6 ,其为本发明所述包含多层缓存的处理器系统的一个实施例。
在 图 6 中,处理器系统由处理器核 601 、主动表 604 、扫描转换器 608 、轨道表 610 、替换模块 611 、循迹器 614 、块地址映射模块 620 、偏移地址映射模块 618 、偏移地址转换器 622 、减法器 928 、一级缓存 602 ,二级缓存 606 和选择器 640 、 660 、 680 、 638 、 692 、 694 及 696 构成。图 6 中的空心圆圈表示总线的拼接。图 6 中没有显示的还有一个控制器,该控制器接收从块地址映射模块 620 、扫描转换器 608 、主动表 604 、轨道表 610 及替换模块 611 的输出控制各功能模块的操作。
在本发明中,二级缓存 606 中存储的是外部指令,而一级缓存 602 中存储的是相应的内部指令。可以用第一地址和第二地址来表示指令在一级缓存或二级缓存中的位置信息。在此,第一地址和第二地址可以是一级缓存的寻址地址,也可以是二级缓存的寻址地址。
当一条内部指令已经被存储在一级缓存 602 中时,可以用 BN1X 表示该内部指令所在指令块的一级块号(即指向一级缓存中相应的一个一级指令块),并用 BN1Y 表示该内部指令的一级块内偏移量(即该内部指令在一级指令块中的相对位置)。当一条外部指令已经被存储在二级缓存 606 中时,可以用 BN2X 表示该外部指令所在指令块的二级块号(即指向二级缓存中相应的一个二级指令块),并用 BN2Y 表示该外部指令的二级块内偏移量(即该外部指令在二级指令块中的相对位置)。为了便于说明,可以用 BN1 代表 BN1X 和 BN1Y ,用 BN2 代表 BN2X 和 BN2Y 。由于本发明中凡在一级缓存中的内部指令对应的外部指令在二级缓存中均有存储,因此对于一级缓存中存储的内部指令,可以用 BN1 或 BN2 表示。
主动表 604 中的表项与二级缓存 606 中的存储块一一对应。主动表 604 中的每个表项存储了一个二级指令块地址与一个二级块号 BN2X 的匹配对,指明了该指令块地址对应的二级指令块存储在二级缓存 606 中的哪个存储块中。在本发明中,可以根据一个二级指令块地址在主动表 604 中进行匹配,并在匹配成功的情况下得到一个 BN2X ;也可以根据一个 BN2X 对主动表 604 寻址,以读出对应的二级指令块地址。
当外部指令从二级缓存 608 向一级缓存 602 填充时,扫描转换器 608 计算外部指令中分支指令的分支目标地址,外部指令且由 608 中的指令转换器 200 转换成内部指令。计算得到的分支目标地址被送至主动表 604 与其中存储的指令块地址匹配确定该分支目标是否已经存储在二级缓存 606 中。如果匹配不成功,则分支目标指令所在的指令块尚未被填充到二级缓存 606 中,那么在将该指令块更低层存储器填充到二级缓存 606 中的同时,在主动表 604 中建立相应的二级指令块地址与二级块号的匹配对。
扫描转换器 608 对从二级缓存 606 向一级缓存 602 填充的指令块(外部指令)进行转换和审查,并提取出对应内部指令的轨迹点信息填充到轨道表 610 的相应表项中,从而建立该二级指令块对应的至少一个一级指令块的轨道。具体地,在建立轨道时,首先由替换模块 611 产生一个 BN1X 指向一条可用轨道。在本发明中,替换模块 611 可以根据替换算法(如 LRU 算法)确定可用轨道。
具体地,扫描转换器 608 对从二级缓存 606 填充到一级缓存 602 的每一条外部指令进行审查,并提取出某些信息,如:指令类型、指令源地址和分支指令的分支增量,并基于这些信息计算出分支目标地址。对于直接分支指令,可以通过对该指令所在指令块的块地址、该指令在指令块中的偏移量和分支增量( Branch Offset )三者相加得到分支目标地址。所述指令块地址可以是从主动表 604 中读出并被直接送往扫描转换器 608 中加法器的。也可以在扫描转换器 608 中增加用于存储当前指令块地址的寄存器,这样主动表 604 就不需要实时地送出指令块地址。在本实施例中,直接分支指令的分支目标地址由扫描转换器 608 产生,而间接分支指令的分支目标地址由处理器核 601 产生,且这两者对应的都是外部指令地址。此外,扫描转换器 608 还将所述每一条外部指令转换为对应的一条或多条内部指令,且在转换过程中不改变分支指令的分支增量,即外部分支指令中的分支增量与其对应的内部分支指令中的分支增量相等,保证处理器核 601 产生的间接分支指令的分支目标地址的正确性。
块地址映射模块 620 中对应每个二级缓存块的每一行有复数个表项,每个表项中存储与此二级缓存块中一部分(称为二级缓存块的子块)对应的一级缓存块的一级块号( BN1X )以及该二级缓存子块在二级缓存块内的起始偏移量( BN2Y )。其中各表项中的 BN2Y 由左向右递增排列。当一个新的表项被加入块地址映射模块 620 中的一行时,其 BN2Y 会由比较器 924 与该行上现有其他表项的 BN2Y 值比较,并由移位器 926 将 BN2Y 值大于新表项的 BN2Y 值的表项右移,空出位置供新表项存放。
块地址映射模块 620 中的行与主动表 604 中的行及二级缓存 606 中的存储块一一对应,并由同一个 BN2X 指向。块地址映射模块 620 用于存储对应二级块号与一级块号的对应关系,如图 6 所示,其表项格式 680 包括一级块号 BN1X 和二级块内偏移量。这样,对于一个 BN2 ,可以根据其中的 BN2X 找到块地址映射模块 620 中的一行,再用其中的 BN2Y 在该行各个表项中存储的有效的 BN2Y 进行比较,即可读出相应比较成功的表项中的 BN1X (即该 BN2Y 对应的外部指令的相应内部指令对应的 BN1X ),从而将 BN2X 转换为相应的 BN1X ,或得到比较不成功的结果(即该 BN2Y 对应的外部指令的相应内部指令尚未存储在一级缓存 602 中)。
本实施例中,轨道表 610 的格式为 686 或 688 。 686 由三部分构成:格式( TYPE ),二级块号( BN2X )及二级块内偏移( BN2Y )。其中格式中含指令类型地址,包括非分支指令,无条件直接分支指令,有条件直接分支指令,无条件间接分支指令,有条件间接分支指令。在此,有条件直接分支指令、无条件直接分支指令、有条件间接分支指令和无条件间接分支指令可以统称为分支指令,其对应的轨迹点为分支点。格式中还包含地址类型,其在 686 格式中是二级缓存地址 BN2 。 688 的格式也由三部分构成:格式( TYPE ),一级块号( BN1X )及一级块内偏移( BN1Y )。 688 格式中指令类型与 686 的相同,但是地址类型在 688 中固定为一级缓存地址 BN1 。本实施例中,块地址映射模块 620 中的存储器 920 的格式如 684 所示,其为一级缓存块地址 BN1X 与二级缓存块内偏移地址 BN2Y 的组合。
轨道表 610 含有复数个轨迹点( track point )。一个轨迹点是轨道表中的一个表项,可含有至少一条指令的信息,比如指令类别信息、分支目标地址等。在本发明中轨迹点本身的轨迹点地址与该轨迹点所代表指令的指令地址相关( correspond );而分支指令轨迹点中含有分支目标的轨迹点地址,且该轨迹点地址与分支目标指令地址相关。与一级缓存 602 中一系列连续内部指令所构成的一个一级指令块相对应的复数个连续的轨迹点称为一条轨道。该一级指令块与相应的轨道由同一个一级块号 BN1X 指示。轨道表含有至少一条轨道。一条轨道中的总的轨迹点数可以等于轨道表 610 中一行中的表项总数。这样,轨道表就成为一个以轨道表项地址对应分支源地址、表项内容对应分支目标地址来代表一条分支指令的表。此外,在轨道表 610 的每一行中还可以额外增加一个二级块号表项,用于记录该行第一个轨迹点对应的外部指令的 BN2 。这样,就可以在某个一级指令块被替换时,将以该行为分支目标的其他轨道表行中的该 BN1 转换为相应的 BN2 ,使该行可以被其他指令行写入而不致引起错误。
轨道表 610 中记录了程序运行的可能路径或程序执行流的可能流向,因此循迹器 614 可以根据轨道表 610 中的程序流和处理器核 601 的反馈沿程序流循迹。因为在一级缓存器 602 中存有与轨道表表项相应的内部指令,一级缓存器 602 以循迹器 614 的输出总线 631 为读地址,跟随循迹器 614 所遵循的程序流而通过总线 695 送出指令供处理器核 601 执行。轨道表 610 中某些分支目标是用二级缓存器地址 BN2 记录的,其目的是仅将需用的外部指令转换成内部指令存到一级缓存器,使得一级缓存器可以有较二级缓存器更小的容量和更快的速度。当循迹器 614 读出的表项中分支以 BN2 记录时,此时要将该 BN2 送往块地址映射模块 620 等模块匹配或扫描转换模块 608 转换获得 BN1 地址,将指令填充到一级缓存 602 中 BN1 地址,也将该 BN1 地址填回轨道表中该表项中,循迹器 614 沿该 BN1 并根据处理器核 601 反馈的指令执行结果(如:分支指令的执行结果),控制一级缓存 602 向处理器核 601 输出指令供执行。
在本发明中,可以用所述第一地址和第二地址来表示轨迹点在轨道表中的位置信息。而直接分支点的指令类型中则还可以包含分支目标寻址地址是以 BN1 表示(即分支目标为 BN1 的直接分支指令)还是以 BN2 表示(即分支目标为 BN2 的直接分支指令)的信息。当一个分支点中存储的是 BN1 时,说明该分支点的分支目标内部指令所在的内部指令块已经被存储在一级缓存 602 中由该 BN1X 指向的存储块中,且根据该 BN1Y 可以从中找到所述分支目标内部指令。当一个分支点中存储的是 BN2 时,说明该分支点的分支目标外部指令所在的外部指令块已经被存储在二级缓存 606 中由该 BN2X 指向的存储块中,且根据该 BN2Y 可以从中找到所述分支目标外部指令,但无法直接确定该分支目标外部指令对应的内部指令是否已经存储在一级缓存 602 中。
偏移地址映射模块 618 中的行与轨道表 610 中的行及一级缓存 602 中的存储块一一对应,并由同一个 BN1X 指向。偏移地址映射模块 618 用于存储二级缓存 606 中的外部指令偏移地址与一级缓存 602 中的内部指令偏移地址之间的对应关系。偏移地址转换器 622 则可以根据偏移地址映射模块 618 送来的由 BN1X 指向的映射关系(即 BN2Y 和 BN1Y 的映射关系),将接收到的 BN2Y 转换为相应的 BN1Y ,或将接收到的 BN1Y 转换为相应的 BN2Y 。
因此,当需要将 BN2 转换为 BN1 时,首先根据 BN2X 和 BN2Y ,在块地址映射模块 620 中转换得到 BN1X ,再根据偏移地址映射模块 618 中由该 BN1X 指向的行中的映射关系,将 BN2Y 转换为 BN1Y ,从而完成 BN2 向 BN1 的转换。
当需要将 BN1 转换为 BN2 时,首先从轨道表 610 中由 BN1X 指向的行中的所述额外表项中读出相应的 BN2 ,其中 BN2X 就是所述 BN1X 指向的内部指令块对应的外部指令块号, BN2Y 就是所述 BN1X 指向的内部指令块对应的外部指令在其所在外部指令块中的起始位置。根据偏移地址映射模块 618 中由该 BN1X 指向的行中的映射关系及所述起始位置,即可将 BN1Y 转换为 BN2Y ,从而完成 BN1 向 BN2 的转换。
在图 6 中,主要的总线有三类:外部指令地址总线、 BN1 总线和 BN2 总线。其中,外部指令地址总线主要有总线 657 、 683 和 675 ; BN1 总线主要有总线 631 和 693 ; BN2 总线主要有总线 633 和 687 。此外,还有其他一些总线,如 BN2X 总线 639 、 BN2Y 总线 637 ,以及映射关系总线 691 。
具体地,总线 675 上的内容是主动表 604 中由 BN2X 指向的行中存储的外部指令块地址(即二级缓存块地址)。该地址被送回扫描转换器 608 以计算直接分支指令的分支目标地址。
总线 657 上的内容是扫描转换器 608 在审查发现分支指令时输出的直接分支指令的分支目标指令地址,总线 683 上的内容是处理器核 601 在执行间接分支指令时输出的分支目标指令地址。总线 657 和 683 的格式均与外部指令地址格式相同。其中,块地址部分(高位部分)被选择器 680 选择后经总线 681 送往主动表 604 与其中储存的外部指令块地址匹配以获得一个二级块号 BN2X 并经总线 671 从二级缓存器 606 中读取外部指令。总线 671 的格式为 BN2X ,与总线 657 上外部指令地址偏移部分(低位部分)的 BN2Y 拼接成一个完整的 BN2 地址送往轨道表 611 存储。总线 671 上 BN2X 也被送到选择器 640 。选择器 640 选择总线 671 及轨道表 610 输出经总线 633 而来的 BN2X 中的一个作为 BN2X 放上总线 639 ,用以读取块地址映射模块 620 中的一行数据进行 BN2 到 BN1 的映射。
总线 637 是三输入选择器 638 的输出,三输入选择器 638 选择总线 633 、 657 或 683 上的 BN2Y 送到块地址映射模块 620 ,在由总线 639 上的 BN2X 指向的行中匹配出相应的 BN1X
总线 633 是轨道表 610 的输出,其格式可以是 BN1 或 BN2 。当其格式为 BN2 时被送到块地址映射模块 620 及偏移地址映射模块 618 中将 BN2X 映射为 BN1X 。其映射还需要将该 BN2 中的 BN2Y 经总线 637 送往减法器 928 与块地址映射模块 620 输出的相应的二级子存储块的起始地址相减以获得正确的净块内偏移地址供偏移地址转换器 622 使用,将 BN2Y 转换为 BN1Y 。所述 BN1X 和 BN1Y 合并为 BN1 被写回轨道表 610 。总线 633 上的 BN2X 还可以被送到主动表 604 读出对应的外部指令块地址经总线 657 送往扫描转换器 608 ,与总线 633 直接送往扫描转换器 608 的 BN2Y 一同构成外部指令地址。此外,总线 633 上的 BN2X 还可以经总线 673 送往二级缓存 606 读出对应的外部指令块。
总线 631 是循迹器 614 的输出,其格式为 BN1 。该输出被送到一级缓存 602 作为地址以读取指令供处理器核 601 使用。
总线 693 是替换模块 611 的输出,格式为 BN1X ,其意义为向扫描转换器 608 提供下一个可用的一级块号 BN1X (或轨道号),供扫描转换器 608 填充转换所得的内部指令。总线 693 上的 BN1X 也与来自总线 657 的 BN2Y 共同放上总线 665 (及构成块地址映射模块 620 中表项内容)送往选择器 940 以供按地址顺序存放于块地址映射模块 620 。因此, 665 总线上的格式为 BN1X 和 BN2Y 。总线 693 控制一级缓存 602 的写入块地址与来自扫描转换模块 608 输出的 BN1Y 总线 669 作为写入块内偏移地址,控制将扫描转换器 608 转换得到的内部指令经总线 667 填充到一级缓存 602 。同时,总线 693 与总线 669 还共同寻址将与内部指令相应的格式(由扫描转换模块 608 经总线 687 送出)、分支目标(由总线 671 上的 BN2X 与总线 657 上的 BN2Y 拼接到总线 687 )经总线 687 同步写入轨道表 610 。
总线 687 将其上的指令类型、 BN2Y 与来自总线 671 的 BN2X 拼接成一个完整的轨迹点内容送往轨道表 610 存储。
总线 954 是块地址映射模块 620 的输出,其中的 BN1X 用于从偏移地址映射模块 618 中读取相应的偏移地址映射信息送往偏移地址转换器 622 ;其中的 BN2Y 输出送往减法器 928 与总线 633 上送来的 BN2Y 值相减,其结果被送往偏移地址转换器 622 。偏移地址转换器 622 根据输入将总线 954 上的 BN2Y 映射成 BN1Y 地址。来自总线 954 的 BN1X 地址与偏移地址转换器 622 输出的 BN1Y 地址被拼接成完整 BN1 ,经总线 685 送往三输入选择器 692 的一个输入端。
选择器 692 选择总线 685 上的 BN1 、总线 687 上的 BN2 或总线 693 上的 BN1 (其中总线 693 送来的 BN1X 被加上为' 0 '的 BN1Y 拼接成完整的 BN1 )送往轨道表 610 作为写入的轨迹点内容。
请参考图 7A ,其为本发明所述基于轨道表的缓存结构的实施例。为了便于说明,在图 7A 中只显示了部分器件或部件。如之前实施例所述,轨道表 610 的行与一级缓存 602 的存储块一一对应,且轨道表行(即轨道)中的表项(即轨迹点)数目比一级存储块中的指令数目多一个。其中,轨道的最后一个轨迹点中存储了指向顺序执行的下一轨道的位置,其余表项与一级存储块中的指令一一对应,并存储了程序执行流信息(如指令类型、分支目标地址等),且轨道中从左向右的每个轨迹点对应的地址递增。
轨道表 610 的读端口在循迹器 614 输出的读指针 631 的寻址下,输出相应轨迹点内容并放上总线 633 ,而控制器则检测所述总线 633 上的内容。
如果该内容中的指令类型是非分支指令,选择器 738 选择增量器 736 的输出使得循迹器向右移动达到下一个地址(即更大的地址)。
如果该内容中的指令类型是无条件分支,则选择器 738 选择总线 633 上的分支目标地址,使得读指针 631 转到由总线 633 上分支目标地址对应的轨迹点位置。
如果该内容中的指令类型是有条件分支,则循迹器 614 暂停更新并等待,直到处理器核 601 产生分支转移是否发生的 TAKEN 信号 635 。如果分支转移没有发生,则如之前非分支指令的做法运行,如果分支转移发生,则如之前无条件分支指令的做法运行。
轨道表 610 写端口对应的写入地址有两个来源,分别是选择器 694 ( BN1X )和 696 ( BN1Y )。当建立轨道时,替换模块 611 输出行地址 BN1X ,而扫描转换器 608 输出列地址 BN1Y 。当循迹器 614 读出的轨迹点内容中存储的是 BN2 时,该 BN2 被送往块地址映射模块 620 或扫描转换器 608 等产生 / 生成 BN1 ,该 BN1 需要被写回该轨迹点中(即读出、修改及写回, read modify write );当循迹器 614 读出的轨迹点内容中的指令类型是间接分支指令时,将处理器核 601 产生的间接分支目标地址送往主动表 604 、块地址映射模块 620 等产生 / 生成 BN1 ,该 BN1 也需要被写回该轨迹点中。在这两种情况下,轨道表 610 的写入地址均是当时的读出地址。
轨道表 610 写端口本身有三个来源:总线 685 、 687 和 693 ,经选择器 692 选择后作为写入内容。其中总线 685 上的值是块地址映射模块 620 和偏移地址转换器 622 输出的 BN1 ,总线 687 上的值是二级缓存地址形式( BN2 )的分支目标地址,而总线 693 上的值是将被写入轨道最后一个表项中的指向顺序执行下一轨道的 BN1 。
在本实施例中,在外部指令被转换为内部指令的同时,扫描转换器 608 审查、提取出相应信息。本实施例中,轨道表内容共有三部分:若该内部指令是非分支指令或间接分支指令,则选择器 694 选择由替换模块 611 产生的该内部指令对应的 BN1X 693 作为轨道表 610 写地址中的第一地址,选择器 696 选择扫描转换器 608 输出的该分支内部指令在其所在指令块中的块内偏移量 669 作为轨道表 610 写地址中的第二地址,将该指令类型(即非分支指令或间接分支指令)作为写入内容写入轨道表 610 中,完成该轨迹点的建立。
若该内部指令是直接分支指令,则扫描转换器 608 计算分支目标地址。所述分支目标地址中的块地址经总线 657 被送到主动表 604 匹配。若匹配成功,得到匹配成功项对应的 BN2X 经总线 671 、 639 送往块地址映射模块 620 ,并将所述分支目标地址中的块内偏移量(即 BN2Y )经总线 657 、 637 送往块地址映射模块 620 。在块地址映射模块 620 中由所述 BN2X 指向的行中查找对应的 BN1X 。若存在有效的 BN1X ,则从偏移地址映射模块 618 中读出该 BN1X 指向的行中的映射关系并送往偏移地址转换器 622 将所述 BN2Y 转换为 BN1Y 。选择器 694 选择由替换模块 611 产生的该内部指令对应的 BN1X 693 作为轨道表 610 写地址中的第一地址,选择器 696 选择扫描转换器 608 输出的该分支内部指令在其所在指令块中的块内偏移量 669 作为轨道表 610 写地址中的第二地址,而所述 BN1X 和 BN1Y 被合并为 BN1 放上总线 693 并经选择器 692 选择后与所述提取出的指令类型一同作为轨迹点内容写入轨道表 610 中,完成该轨迹点的建立。此时该轨迹点中包含的是 BN1 。
若在块地址映射模块 620 不存在该 BN2X 及 BN2Y 对应的有效 BN1X ,则选择器 694 选择由替换模块 611 产生的该内部指令对应的 BN1X 693 作为轨道表 610 写地址中的第一地址,选择器 696 选择扫描转换器 608 输出的该分支内部指令在其所在指令块中的块内偏移量 669 作为轨道表 610 写地址中的第二地址,将总线 671 上的该 BN2X 和扫描转换器 608 输出的该 BN2Y 拼接成 BN2 放上总线 687 并经选择器 692 选择后与所述提取出的指令类型一同作为轨迹点内容写入轨道表 610 中,完成该轨迹点的建立。此时该轨迹点中包含的是 BN2 。
若所述分支目标地址中的块地址在主动表 604 中不成功,表示该分支目标地址对应的外部指令尚未存储在二级缓存 606 中,则根据替换算法(如 LRU 算法)分配一个二级存储块的块号 BN2X ,并将该分支目标地址送往更低层次的存储器取回相应指令块存储到二级缓存 606 由所述 BN2X 指向的存储块中。选择器 694 选择由替换模块 611 产生的该内部指令对应的 BN1X 693 作为轨道表 610 写地址中的第一地址,选择器 696 选择扫描转换器 608 输出的该分支内部指令在其所在指令块中的块内偏移量 669 作为轨道表 610 写地址中的第二地址,直接将该 BN2X 及所述分支目标地址中的块内偏移地址(及 BN2Y )合并为 BN2 放上总线 687 并经选择器 692 选择后与所述提取出的指令类型一同作为轨迹点内容写入轨道表 610 中,完成该轨迹点的建立。此时该轨迹点中包含的是 BN2 。
在上述过程中,轨道表 610 的写地址中的第一地址( BNX )还经总线 745 指向偏移地址映射模块 618 中的相应行,使得每个内部指令块与相应的外部指令的映射关系被存储到所述行中。此外,若被转换填充的外部指令对应的内部指令多于一个一级存储块可以容纳的数目时,超出部分依次填充到由替换模块 611 新产生的 BN1X 指向的一级存储块并建立相应轨道。重复上述过程,即可实现从二级缓存向一级缓存转换、填充指令并建立相应的轨道。
循迹器 614 则由寄存器 740 、增量器 736 和选择器 738 组成,其读指针 631 (即寄存器 740 的输出)指向轨道表 110 中处理器核 601 即将执行的指令(即当前指令)对应的轨迹点,并读出轨迹点内容经总线 633 送往选择器 738 。同时,读指针 631 对一级缓存 602 寻址,读出该当前指令并送往处理器核 601 供执行。
若所述轨迹点内容中的指令类型显示该指令为非分支指令,则选择器 738 选择来源于增量器 736 的对寄存器 740 的值增一的结果作为输出送回寄存器 740 ,使得下一周期寄存器 740 的值增一,即读指针 631 指向下一个轨迹点并从一级缓存 602 中读出对应内部指令供处理器核 601 执行。
若所述轨迹点内容中的指令类型显示该指令为分支目标为 BN1 的无条件直接分支指令,则选择器 738 选择该 BN1 作为输出送回寄存器 740 ,使得下一周期寄存器 740 的值被更新为该 BN1 ,即读指针 631 指向分支目标内部指令对应的轨迹点并从一级缓存 602 中读出该分支目标内部指令供处理器核 601 执行。
若所述轨迹点内容中的指令类型显示该指令为分支目标为 BN1 的有条件直接分支指令,则选择器 738 根据处理器核执行该分支指令时产生的表示分支转移是否发生的 TAKEN 信号 635 进行选择,同时暂停寄存器 740 的更新直至处理器核 601 送来有效的 TAKEN 信号 635 。此时,若 TAKEN 信号 635 的值为' 1 ',表示分支转移发生,选择轨道表输出的 BN1 作为送回寄存器 740 ,使得下一周期寄存器 740 的值被更新为该 BN1 ,即读指针 631 指向分支目标内部指令对应的轨迹点并从一级缓存 602 中读出该分支目标内部指令供处理器核 601 执行。若 TAKEN 信号 635 的值为' 0 ',表示分支转移没有发生,则选择增量器 736 的对寄存器 740 的值增一的结果作为输出送回寄存器 740 ,使得下一周期寄存器 740 的值增一,即读指针 631 指向下一个轨迹点并从一级缓存 602 中读出对应内部指令经总线 695 供处理器核 601 执行。
若所述轨迹点内容中的指令类型显示该指令为分支目标为 BN2 的直接分支指令(包括有条件、无条件两种情况),则该 BN2 被送往块地址映射模块 620 。在块地址映射模块 620 中,若存在该 BN2 对应的有效 BN1X ,则输出该 BN1X ,且由偏移地址转换器 622 将该 BN2 中的 BN2Y 转换为对应的 BN1Y ,并将所述 BN1X 和 BN1Y 合并为 BN1 放上总线 685 。此时,选择器 694 选择读指针 631 值(即分支指令本身对应的分支点 BN1 )中的 BN1X 作为写地址中的第一地址,选择器 696 选择读指针 631 值中的 BN1Y 作为写地址中的第二地址,选择器 692 选择总线 685 上的 BN1 作为写入内容回该分支点中。若不存在该 BN2 对应的有效 BN1X ,则由替换模块 611 产生一个 BN1X ,在轨道表 610 (及一级缓存 602 )中指定一条可用轨道(及对应的一个存储块)。同时,将二级缓存 606 中从所述 BN2 对应的外部指令开始直至其所在二级指令块结束的所有外部指令经扫描转换器 608 转换及审查,提取出对应内部指令的轨迹点信息填充到轨道表 610 中由所述 BN1X 指向的行,并将产生的 BN1X 和 BN2X 之间的映射关系存储到偏移地址映射模块 618 中,以及将转换得到的内部指令填充到一级缓存 602 中由所述 BN1X 指向的存储块中。需要说明的是,由于从分支目标外部指令开始转换、填充,因此该分支目标外部指令对应的内部指令在其所在一级存储块中必定是第一条指令,即 BN1Y 的值为' 0 '。这样,所述分支点的分支目标指令就被存储在一级缓存 602 中,且所述 BN2 中的 BN2X 被转换为分支目标内部指令对应的 BN1X (由替换模块 611 产生),与 BN1Y (值为' 0 ')一同合并为 BN1 放上总线 693 。此时,选择器 694 、 696 选择读指针 631 的值(即分支指令本身对应的分支点)作为写地址,选择器 692 选择总线 693 上的 BN1 作为写入内容写回该分支点中。如此,轨道表 610 输出的轨迹点内容包含的是 BN1 。之后的操作与上述分支目标为 BN1 的直接分支指令中的情况相同,在此不再赘述。
若所述轨迹点内容中的指令类型显示该指令为间接分支指令(包括有条件、无条件两种情况),则将处理器核 601 对该分支指令执行时产生的分支目标地址中的块地址送往主动表 604 匹配。若匹配成功,则可以得到匹配成功项对应的 BN2X ,并以分支目标地址中的块内偏移量作为 BN2Y ,并以该 BN2X 和 BN2Y 值送往块地址映射模块 620 匹配,如命中获得相应的 BN1 值,则之后的操作与上述分支目标为 BN1 的直接分支指令中的情况相同;若不命中,则之后的操作与上述分支目标为 BN2 的直接分支指令中的情况相同,在此不再赘述。若匹配不成功,表示该分支目标地址对应的外部指令尚未存储在二级缓存 606 中,则根据替换算法(如 LRU 算法)由主动表 604 分配一个二级存储块的块号 BN2X ,并将该分支目标地址送往更低层次的存储器取回相应指令块存储到二级缓存 606 由所述 BN2X 指向的存储块中。再按之前所述方法,将该外部指令块转换后填充到一级缓存 602 中并建立相应轨道、记录映射关系,以及将所述 BN2 被转换为 BN1 填回该分支点中(在此过程中产生的 BN2 并不会被填充到轨道表 610 中,而直接将对应的 BN1 填充到轨道表 610 中),使得轨道表 610 输出的轨迹点内容包含的是 BN1 。之后的操作与上述分支目标为 BN1 的直接分支指令中的情况相同,在此不再赘述。
若下一次循迹器重新读出含该间接分支目标的表项时,该表项的指令类型是间接分支指令,但是地址类型是 BN1 ,控制器据此认定该间接分支指令此前已访问过,可以用该 BN1 地址猜测执行,但是经过 BN1 地址反求出相应的外部指令地址(如:通过该 BN1X 对应的轨道中存储的 BN2X 对主动表 604 寻址读出外部指令块地址,并通过 618 转换得到外部指令块内地址,从而得到完整的外部指令地址),待处理器核 601 执行该间接分支指令产生分支目标地址时将该分支目标地址与反求出的外部指令地址比较。如果相同,则继续执行。如果不相同,则清空分支点后的指令,不保存其结果,从处理器核 601 提供的分支目标地址开始执行并将该地址如前例映射成 BN1 后存入该分支点。
回到图 6 ,扫描转换器 608 负责将外部指令转换成内部指令填充到一级缓存。过程中扫描转换器 608 也计算外部指令的分支目标地址,提取指令的类型并将目标地址与类型信息填充到与一级缓存内部指令填充的相应轨道表表项。请参考图 7B ,其为本发明所述扫描转换器的一个实施例。
在本实施例中,扫描转换器 608 接受来自两个来源的输入。第一个来源是当轨道表 610 经总线 633 送出一个直接分支外部指令地址 BN2 ,此 BN2 在块地址映射模块 620 中匹配未命中,此时所需的外部指令块已经存储在二级缓存 606 中,主动表 604 中也有相应的外部指令( PC )高位地址,但尚未被转换成内部指令存储在一级缓存 602 中。总线 633 上的 BN2X 地址被送往主动表 604 中读出相应的 PC 高位,经总线 675 送往扫描转换器 608 ,总线 633 上的块内偏移量 BN2Y 也被送到扫描转换器 608 中。此时,选择器 660 也选择将总线 633 上的 BN2X 放上总线 673 向二级缓存 606 提供块地址。
第二个来源是当轨道表 610 经总线 633 送出一个间接分支外部指令类型且其地址格式为外部指令地址格式,表示该间接分支指令的目标需由处理器核 601 计算。此时,控制器将处理器核 601 执行相应间接条件分支指令时得到的外部分支目标地址经总线 683 、选择器 680 、总线 681 送往主动表 604 匹配。如果不匹配,表示分支目标的外部指令块尚不在二级缓存 606 中,此时主动表将总线 681 上外部指令地址送往低层存储器读取相应指令块并填充到二级缓存 606 中由主动表 604 分配,经选择器 660 、总线 673 指向的二级缓存 606 中的二级缓存块。同时,该外部指令的高位被存入主动表中的对应标签域。如果匹配,主动表 604 经选择器 660 和总线 673 指向与匹配标签相应的二级缓存 606 中的二级缓存块。同时,总线 683 上的 PC 地址被送进扫描转换器 608 。
请参照图 7B 的扫描转换器 608 内部结构。扫描转换器 608 中包含转换器 200 ,直接分支目标地址计算器 792 ,块内偏移映射生成器 796 ,控制器 790 与输入选择器 798 , 799 。其中,控制器 790 接受各模块来的状态信号并控制各模块协同工作。
选择器 798 选择来自总线 675 或总线 683 的 PC 高位地址存入寄存器 788 。选择器 799 选择来自总线 633 或 683 的 PC 低位地址( BN2Y )存入寄存器 321 。其中,来自总线 675 及总线 633 的地址用于将轨道表中的 BN2 转换成 BN1 地址,过程中将相应的外部指令翻译成内部指令并存储到一级缓存 602 。而来自总线 683 的地址是用于将间接分支目标相应的外部指令翻译成内部指令并存储到一级缓存器并将该一级缓存器块号 BN1X 连同块内偏移量 BN1Y 存储到轨道表 610 中间接分支指令相应的表项。不管来自哪个来源,选择器 798 、 799 选择之后,其操作是相同的。以下以 BN2 转换成 BN1 地址为例。
二级缓存 606 的地址为 BN2 ,此例中其格式为' 8XYY '。其中' 8X '为块地址 BN2X ,其值为' 80 ' ~ ' 82 '。二级缓存 606 中每个二级缓存块(图中一行)有 32 个字节,其块内偏移量 BN2Y 为其块内字节( byte )地址' YY ',其值为' 0 ' ~ ' 31 ',字节中储存变长的外部指令。一级缓存 602 地址为 BN1 ,其格式为' 7XY ',其中' 7X '为块地址 BN1X ,其值为' 70 ' ~ ' 75 '。一级缓存 602 中的每个一级指令块(图中一行)有 4 条定长内部指令,其块内偏移量 BNY1 为其块内字( word )地址' Y ',为易于理解及与 BN2Y 区分,其值在此实施例中以字母为 A~D 标注;在此实施例中一条内部指令的长是一个字( word ),内部指令也可以有其他的长度。轨道表 610 中每行有 A~E 五个表项,其中 A~D 四个表项对应于一级缓存 602 中 A~D 四条内部指令,表项 E 用于存放该行顺序下一个一级缓存块的地址。
直接分支目标地址计算器 792 中有三输入加法器 760 用以计算直接分支目标地址。直接分支目标地址计算器 792 中还有一个边界比较器 772 ,其输入与总线 679 相连。边界比较器 772 中存储了一个二级缓存块中的最大地址(此实施例中为' 31 '),总线 679 上的 BN2Y 值越过二级缓存块的边界时(大于' 31 '),边界比较器 772 会产生一个二级缓存地址越界信号通知控制器 790 。直接分支目标地址计算器 792 还有一个选择器 774 ,控制器 790 可以控制该选择器选择转换器 200 输出的分支偏移量或者全' 0 ',送往加法器 760 。选择全' 0 '时,计算顺序下一外部指令块地址。
请参照图 6 ,设循迹器 614 指向轨道表中某表项并从该表项中读出其类型为直接分支指令,其分支目标为 BN2 地址' 8024 ',其意义为二级缓存 606 中第' 80 '号二级缓存块中块内偏移量为' 24 '的外部指令。该 BN2 地址经总线 633 送往块地址映射模块 620 匹配。其 BN2X 值经选择器 640 选择后经总线 639 选择块地址映射模块 620 中块地址存储模块 920 存储的' 80 '行表项内容中的 BN2Y ,与总线 633 上经选择器 638 选择后经总线 637 送进的 BN2Y 比较。比较结果为不命中,即该分支指令是存储于二级缓存器的外部指令,但尚未转换成内部指令存入一级缓存 602 。控制器接收到该不命中信号,即控制以总线 633 上的 BN2X 在主动表 604 中第' 80 '行读取其中标签(假设为' 9132 ')经总线 675 送至扫描转换模块 608 。请参考图 7B ,控制器也控制 608 中选择器 798 选择总线 675 ,选择器 799 选择总线 633 ,也通知扫描转换器 608 中控制器 790 开始转换指令。
控制器 790 控制寄存器 756 存入选择器 798 的输出(' 9132 '),也控制寄存器 321 存入选择器 799 的输出(' 24 ',二进制为' 1100 ')。即该分支目标的 PC 地址为' 913224 ',存储在二级缓存中第' 80 '行,故其 BN2 地址为' 8024 '。假设二级缓存 606 一次读取 16 个字节,寄存器 321 上的 4 位块内偏移地址仅有最高位被从总线 679 送往二级缓存 606 ,与来自总线 673 的块地址合成地址' 8016 ',从二级缓存 606 中读取相应字节经总线 677 送进转换器 200 中的对齐器 203 。此时对齐器 203 输入的最低字节为字节' 16 ',转换器 200 即以寄存器 321 上的低 3 位二进制' 100 '作为原始移位量将第' 24 '字节移到对齐器 203 输出的最低字节,开始指令转换。转换器 200 中的存储器 201 中对应每条指令会给出一个信号 786 控制块内偏移映射生成器 796 记录相应指令的块内偏移量。存储器 201 另外送出总线 788 用以控制 796 中逻辑门 780 与 764 以禁止某些块内偏移量的记录,以实现多条内部或外部指令与一条内部或外部指令对应时的映射。
寄存器 321 上的二进制值' 1100 '经总线 679 被送到译码器 762 译成独热码( one-hot-code )' 00000000000000000000000100000000 ',经与或门 764 存储进存储器 766 。相应地,计数器 776 在一段外部指令开始被转换时,被置位为' 0 ',其输出总线 669 上的值' 000 '也被译码器 778 译成独热码' 1000 '经逻辑门 780 送往存储器 782 存储。块内偏移映射生成器 796 中还有移位器 768 及寄存器 770 ,总线 679 上的值在一个外部指令段开始转换时被存入寄存器 770 ,以控制移位器 768 的移位。在此例中,' 1100 '被存入寄存器 770 控制移位器 768 左移 24 位,使寄存器 766 中对应于字节' 24 '的信息被移位到字节' 0 '的位置放上总线 691 。
替换模块 611 给正在转换生成的内部指令按置换规则分配了一级缓存 602 中的' 72 '号一级缓存块。控制器控制选择器 692 选择总线 693 上的 BN1X 地址' 72 '连同 BN1Y 地址 A (' 00 ')写入轨道表 610 中。此时选择器 694 、 696 选择总线 631 上地址,所以 BN1 地址' 72A '被写入某表项取代原来的 BN2 地址' 8026 ',但不改变原来的指令类型。若循迹器 614 在该某表项处根据指令类型和 / 或处理器核 601 的控制信号决定分支,则会以该' 72A' 放上总线 631 指向轨道表 610 中' 72 '行第一个表项继续执行。
替换模块 611 通过总线 693 送出 BN1X 地址' 72 '选择 602 中的' 72 '号一级缓存块,也选择了轨道表 610 中与偏移地址映射模块 618 中的第' 72 '行供扫描转换器 608 产生的内部指令及相应程序流,块内偏移信息填充。总线 669 被送出扫描转换器 608 ,送到一级缓存 602 及轨道表 610 作为一级缓存块的块内偏移地址 BN1Y 供填充一级缓存块及相应轨道表使用。位于二级缓存 606 中从 BN2 地址为' 8024 '开始的分支指令经转换器 200 转换生成了一条非分支内部指令,从总线 667 送往一级缓存器 602 填充进' 72 '号一级缓存块的 A 项(块内偏移为' 00 '),其相应的指令类型(非分支指令)也由存储器 201 输出经总线 687 送往轨道表 610 存储在' 72A '项。
控制器也控制将总线 633 上的 BN2Y 值' 24' 经选择器 698 选择后与总线 693 上的 BN1X 地址' 72 '拼接成 BN1X , BN2Y 的形式' 7224 '经总线 665 写入块地址映射模块 620 中的块地址存储模块 920 中由总线 633 上的 BN2X 经选择器 640 选择后由总线 639 寻址的' 80 '行中最左面的表项。该表项由总线 633 上的 BN2Y ' 24 '值经选择器 638 选择后经总线 637 送入块地址映射模块 620 中与该行中各表项的 BN2Y 值' 32 '比较决定而确定。该值及其位置表示二级缓存器中' 80 '号二级缓存块第' 24 '字节开始的外部指令段被存于' 72 '号一级缓存块中,且二级缓存器' 80 '行中字节地址小于' 24 '的外部指令还未被转换成内部指令。具体结构与操作见图 8 实施例。
转换器 201 在转换过程中检测到上述外部非分支指令的长度为 2 个字节,经总线 325 控制对齐器 203 将经总线 677 输入的外部指令继续左移 2 位开始指令转换。此字节长度也被送往加法器 323 与寄存器 321 的内容相加,其和' 26 '再次存入寄存器 321 。寄存器 321 的输出再次由译码器 762 译为独热码' 0000000000000000000000000100000 ',并与寄存器 766 中的内容经与或门 764 进行按位'或'操作,其结果' 0000000000000000000000010100000 '再次被存入寄存器 766 ,其意义为' 80 '号二级缓存块中第' 24 '字节与' 26 '字节各是一条外部指令的起始字节。
转换器 200 转换开始于' 26 '字节的外部指令,转换过程中发现该指令为一条 4 个字节长的直接分支指令,转换器对其分支偏移量不作任何修改与转换得到的内部指令的其他部分一并直接放上总线 667 。其分支指令类型也如前例由总线 687 输出。计数器 776 也在总线 786 的控制下增' 1 ',总线 669 值为' 001 '。控制器 790 根据该指令为分支指令控制加法器 760 将存储器 756 中的 PC 高位地址与寄存器 321 中的块内偏移量及从总线 667 中对应分支偏移量的部分 798 (假设此时该值为' 24 ')相加,其和( sum )即为分支目标的 PC 地址' 913316 '放上总线 657 输出。该和中的低位 BN2Y (不大于二级缓存块内字节数的部分)被拼接到总线 687 上输出。
总线 657 上的 PC 地址的高位经选择器 680 ,总线 681 被送往主动表 604 匹配,其结果为不命中。主动表 604 即将该' 9133 ' PC 高位地址经总线 681 送往低层存储器读取相应的外部指令块。主动表 604 也分配二级缓存器中' 81 '号二级缓存块供此外部指令块存放。二级缓存块号 BN2X (' 81 ')也经总线 671 送出与总线 687 上的低位 BN2Y (' 18 ')拼接成完整的 BN2 与在总线 687 上的直接分支指令类型,经选择器 692 送往轨道表 610 ,写入由替换模块 611 经总线 693 指向的' 72 '行中由总线 669 指向的 B 项(地址' 001 ')。同时,转换所得的内部分支指令经由总线 667 被写进一级缓存器 602 中' 72B '项。
总线 669 上的值' 001 '也被译码器 778 译为独热码' 0100 '与寄存器 782 中的值作'或'操作值' 1100 '存回寄存器 782 ,代表内部指令块中第一条和第二条都各自对应一条外部指令。如果一条内部指令对应的不是一条外部指令的起始字节(即一条外部指令转换成多条内部指令时相应的第一条内部指令后的内部指令),则存储器 201 的内容通过总线 788 送出的信号(如图 5D 中的结束值 YZ 为' 10 '的情况)会控制与或门 780 ,使得寄存器 782 中的信号与全' 0 '进行'或'操作,使寄存器 782 中相应该指令的位被记录成' 0 ',表示该内部指令不对应一条外部指令,使得该指令不会成为一个分支目标。另一方面,当有多条外部指令被融合成一条内部指令时,存储器 201 的内容通过总线 788 (如图 5E 中的结束值 YZ 为' 01 '的情况)送出的信号会控制与或门 764 '擦去'该多条外部指令第一条指令后的其他指令的相应记录,使得外部指令和内部指令的条数能够一致。当一段外部指令转换成相应的内部指令后,寄存器 782 与 766 中' 1 '的数目是一样的,虽然所处的位置不相同。在寄存器 766 中' 1 '的位置是代表外部指令起始字节的字节地址。在寄存器 782 中' 1 '的位置是代表内部指令起始指令的指令地址。
存储器 201 在转换过程中检测到上述始于 26 字节的外部指令的长度为 4 个字节,经总线 325 控制对齐器 203 将经总线 677 输入的外部指令再左移 4 位开始指令转换。此字节长度也被送往加法器 323 与寄存器 321 的内容相加,其和' 30 '再次存入寄存器 321 。寄存器 321 的输出再次由译码器 762 译为独热码,并与寄存器 766 中存储的内容进行按位'或'操作,得到的结果' 0000000000000000000000010100010 '再次被存入寄存器 766 。计数器 776 也依前例增' 1 ',使总线 669 指向 C 项。
转换器 200 在转换过程中从存储器 201 中经总线 325 读出上述始于' 30 '字节的外部指令的长度为 4 个字节,此字节长度也被送往加法器 323 与寄存器 321 的内容相加,其和' 34 '再次存入寄存器 321 。寄存器 321 的输出 679 与比较器 772 中存储的二级缓存块字节数' 31 '比较,此时根据比较结果通知控制器 790 已越过二级缓存块边界。控制器 790 据此控制选择器 774 选择全' 0 ',也控制加法器 760 将存储器 756 中的 PC 高位地址与寄存器 321 中的块内偏移量及从总线 667 中送来的全' 0 '相加以求顺序下一个外部指令块地址。其结果 PC 地址' 913302 '由总线 657 送出,其中 PC 地址中高位' 9133 '被送往主动表 604 匹配,得到 BN2X 值' 81 '(之前因 PC 地址' 913326 '匹配未命中,由主动表 604 分配)。该 BN2X 值经选择器 660 、总线 673 选择二级缓存 606 中' 81 '号二级缓存块,依前例转换器 200 依前例读取' 81 '号二级缓存块中' 0 ' ~ ' 15 '字节进转换器 200 ,从中提取' 0 ' ~ ' 1 '字节移位拼接到已在转换器 200 中的' 80 '号二级缓存块' 30 ' ~ ' 31 '字节后完成该外部指令的转换。转换得到的内部指令从总线 667 送入一级缓存 602 中' 72C '项存储。寄存器 782 中的内容也依前例更新为' 1110 '。
因在转换指令时已越过二级缓存块边界。控制器 790 此时据此控制转换器 200 停止转换指令,也控制计数器 776 再增一位,使总线 669 上的地址指向' 72D '项。控制器也使总线 671 上的 BN2X 值' 81 ',经选择器 640 及总线 639 送至块地址映射模块 620 中块地址存储模块 920 ,选择其中' 81 '行的内容读出与经总线 657 、选择器 638 、总线 637 送进块地址映射模块 620 的 BN2Y 地址' 02 '进行比较。如果匹配命中则将匹配所得的 BN1 连同控制器产生的无条件分支指令类型经总线 685 ,选择器 692 存进轨道表 610 中' 72D '表项。现在匹配结果为不命中,其意义为相应的外部指令块已在二级缓存中,但尚未被转换为内部指令。此时控制器 790 产生一个直接分支指令类型放上总线 687 与来自加法器 760 的低位 BN2Y (对应于块内偏移字节数)' 02 '一同由总线 687 输出。控制器使总线 671 上的 BN2X 值与已在总线 687 上的 BN2Y 地址拼合成 BN2 地址' 8102 ',连同无条件分支指令类型一起经选择器 692 写入轨道表' 72D '项。此时,并无相应的内部指令,所以一级缓存 602 中的' 72D '项没有被填充。
此时,控制器 790 也控制将寄存器 766 中内容经移位器 768 左移 24 位后,其值为' 10100010 ',此格式即图 8B 中行 751 的数据格式。放上总线 691 ;控制器 790 也控制将寄存器 782 中内容' 1110 '放上总线 691 。寄存器 782 中的格式即如图 8B 中行 771 的数据格式。总线 691 上的内容被送往偏移地址映射模块 618 中由一级缓存置换器 611 指向的第' 72 '行写入,以供以后对该行进行外部与内部指令的块内偏移映射时用。
至此,扫描转换器 608 协同其他模块完成了对一段外部指令的转换,提取该段指令中的程序流( program flow )信息,并将程序流信息与转换得到的内部指令存入轨道表 610 及一级缓存 602 中的相应表项。使得本实施例可以经由循迹器 614 读取并遵循轨道表 610 中程序流将相应内部指令供给寄存器核执行。此时块地址映射模块 620 与轨道表 610 中的数值可参考图 9A 。
在一段外部指令的转换过程中,也有可能一级缓存块先于二级指令段被填满。计数器 776 中也有相当于边界比较器 772 的比较器,在越过一级缓存块的边界的情况下通知控制器 790 。控制器 790 在此情况下向一级缓存块置换器 611 请求一个新的一级缓存块并控制将此新缓存块的 BN1X 地址连同为' 0 '的 BN1Y 地址经总线 693 和选择器 692 写入轨道表中填充满了的行的最后一项中。轨道表中每行都比相应的一级缓存块中多一项,以在一级缓存块写满的情况下程序流可以延续到下一新增轨道。因为新增的一级缓存块是从第一项开始填充,因此其 BN1Y 地址固定为' 00 '。此后,计数器 776 被重置。替换模块 611 通过总线 693 指向新的一级缓存块及相应轨道表中的行。之后转换出的内部指令及相应程序流信息就从总线 963 指向的缓存块及轨道表行的 A 表项开始填充。
请参考图 8A ,其为本发明所述外部指令块与内部指令块对应关系的示意图。在本发明中,外部指令集可以是定长指令集,也可以是变长指令集。为了不失一般性,在本说明书中主要以变长的外部指令集为例进行说明,定长外部指令集可以作为变长外部指令集的一种特例。
在本实施例中,假设一个外部指令块的长度为 16 个字节(从字节 0 到字节 15 ),且每条内部指令的长度为 4 个字节。如图 8A 所示,外部指令块 701 中包含了 6 条变长指令。如之前实施例所述,外部指令块中的字节 0 是上一条指令的最后一个字节,因此属于上一外部指令块,即本外部指令块中的外部指令从指令块的字节 1 开始。其中,外部指令 703 占 3 个字节(字节 1 、 2 和 3 ),外部指令 705 占 5 个字节(字节 4 、 5 、 6 、 7 和 8 ),外部指令 707 占 2 个字节(字节 9 和 10 ),外部指令 709 占 1 个字节(字节 11 ),外部指令 711 占 3 个字节(字节 12 、 13 和 14 ),外部指令 713 在本外部指令块中占 1 个字节,其余部分在下一外部指令块中。
在本实施例中,假设外部指令 705 可以被转换为 2 个内部指令(即内部指令 725 和 727 ),外部指令 703 、 707 、 709 、 711 和 713 均可以被转换为 1 个内部指令,分别为内部指令 723 、 729 、 731 、 733 和 735 ,则经扫描转换器 608 转换后得到的内部指令块 721 中包含了 7 条内部指令(从内部指令 0 到内部指令 7 )。此外,在扫描转换器 608 进行指令块转换的同时,也产生了外部指令块内偏移地址 BN2Y 和内部指令块内偏移地址 BN1Y 的对应关系。该对应关系被存储在偏移地址映射模块 618 中。
需要说明的是,在本发明中,一条外部指令可能被转换为一条或多条内部指令。为了不失一般性,在本说明书中主要以一条外部指令对应多条内部指令为例进行说明,而一条外部指令对应一条内部指令的情况是一种特例。即,当一条外部指令对应一条内部指令时,该外部指令对应的第一条内部指令和最后一条内部指令均是所述外部指令对应那一条内部指令。
请参考图 8B ,其为本发明所述偏移地址映射关系存储形式的一个实施例。在本实施例中,行 751 和 771 构成一组映射关系,分别对应外部指令块和内部指令块,以存储图 8A 实施例中的外部指令和内部指令之间的偏移地址映射关系。其中,行 751 有 16 个表项,每个表项内只存储一位( bit )数据(即' 0 '或' 1 '),其中' 0 '表示该表项对应的外部指令偏移地址不是一条外部指令的起始位置,' 1 '表示该表项对应的外部指令偏移地址是一条外部指令的起始位置。
每组映射关系中的第二行(即行 771 )中的每个表项对应一个内部指令偏移地址,即表项数目与内部指令块最大可能包含的内部指令个数相同。且每个表项内也只存储一位数据(即' 0 '或' 1 '),其中' 0 '表示该表项对应的内部指令不是其相应外部指令的第一条内部指令,' 1 '表示该表项对应的内部指令是其相应外部指令的第一条内部指令。
这样,通过分别对行 751 和 771 中的' 1 '进行操作就可以将外部指令偏移地址转换为内部指令偏移地址。请参考图 8C ,其为本发明所述偏移地址转换器 622 的一个实施例。在本实施例中,以外部指令偏移地址转换为内部指令偏移地址为例进行说明。其中,从偏移地址映射模块 618 送来的映射关系格式如图 8B 实施例所述。
选择器阵列 801 中选择器的列数与外部指令块包含的偏移地址个数相等而行数为列数加一,即 17 行和 16 列。为了清晰起见,在图 8C 中只显示了 4 行、 3 列,分别为自左向右的最初 4 行和自下向上的最初 3 列。行号以最下一行为第 0 行, 以上各行的行号依次递增。 列号以最左面一列为 0 列,其右方各列的列号依次递增,每列对应一个外部指令中的偏移地址。第 0 列各选择器的输入 A 和 B 均为' 0 ',除了第 0 行选择器的 A 输入为' 1 '。第 0 行所有选择器的输入 B 均为' 0 '。其他列选择器的输入 A 来源于前一列同一行选择器的输出值,输入 B 来源于前一列下一行选择器的输出值。
选择器阵列 803 的结构与选择器阵列 801 类似,具有相同的行数。不同之处在于选择器阵列 803 中选择器的列数与内部指令块包含的指令条数相等。同样地,为了清晰起见,在图 8C 中只显示了 4 行、 5 列,分别为自左向右的最初 4 行和自下向上的最初 5 列。行号与列号的设置与 801 相同。此外,选择器阵列 803 中的第 0 行所有选择器的输入 B 均为' 0 ';最后一行( 16 行)所有选择器的输入 A 均为' 0 ',且第 0 行各选择器的输出均被送到编码器 809 按输出的列的位置编码。其他选择器的输入 A 来源于前一列上一行选择器的输出值,输入 B 来源于前一列同一行选择器的输出值;且第 0 列的输入 A 来源于选择器阵列 801 上一行选择器的输出值,输入 B 来源于选择器阵列 801 同一行选择器的输出值。
译码器 805 对外部指令偏移地址进行译码,得到的掩码值送往掩码器 807 。由于一个外部指令块包含 16 个偏移地址,因此该掩码值的宽度为 16 位,其中该外部指令偏移地址对应的掩码位及其之前的掩码位的值均为' 1 ',该外部指令偏移地址对应的掩码位之后的掩码位的值均为' 0 '。之后,将该掩码值与从偏移地址映射模块 618 送来的映射关系中的行 751 进行按位与操作,从而保留行 751 中该外部指令偏移地址对应的掩码位及其之前掩码位对应的值,并将其余值清零,得到一个 16 位的控制字送往选择器阵列 801 。
该控制字的每一位控制选择器阵列 801 中的一列选择器。当该位为' 1 '时,相应列的选择器全部选择输入 B ;当该位为' 0 '时,相应列的选择器全部选择输入 A 。即,对于选择器阵列 801 中的每一列选择器,若对应的控制位为' 1 ',则选择来源于前一列下一行的输出值作为输入,使得前一列的输出值整体上移一行,并在最下一行补' 0 ',作为本列的输出;若对应的控制位为' 0 ',则选择来源于前一列同一行的输出值作为输入,保持前一列的输出值作为本列的输出。这样,控制字中有多少个' 1 ',选择器阵列 801 第一列的输入就会被上移多少行,即选择器阵列 801 的输入中的唯一一个' 1 '被上移了相应行数。由于选择器阵列 801 的行数和列数与外部指令块包含的偏移地址个数相等,因此选择器阵列 801 的输出中包含且仅包含一个' 1 ',且这个' 1 '所在的行的位置由控制字确定。
同时,从偏移地址映射模块 618 送来的映射关系中的行 771 直接作为控制字被送往选择器阵列 803 。与选择器阵列 801 中类似,该控制字的每一位控制选择器阵列 803 中的一列选择器。当该位为' 1 '时,相应列的选择器全部选择输入 A ;当该位为' 0 '时,相应列的选择器全部选择输入 B 。即,对于选择器阵列 803 中的每一列选择器,若对应的控制位为' 1 ',则选择来源于前一列上一行的输出值作为输入,使得前一列的输出值整体下移一行,并在最上一行补' 0 ',作为本列的输出;若对应的控制位为' 0 ',则选择来源于前一列同一行的输出值作为输入,保持前一列的输出值作为本列的输出。这样,每经过控制字中的一个' 1 ',选择器阵列 803 的输入就会被下移一行,即所述输入中的唯一一个' 1 '被下移了一行。因此,当编码器 809 接收到从选择器阵列 803 最下一行送来的' 1 '时,即可根据该' 1 '所在的列的位置生成对应的内部指令偏移地址。
以图 8B 中的映射关系为例,若外部指令偏移地址为' 9 '(即对应外部指令块中的第十个字节也即第三条指令),则掩码器 807 输出的掩码值为' 1111111111000000 ',与行 751 中的值' 0100100001011001 '进行按位与操作后得到' 0100100001000000 ',即控制字中有三个' 1 '。这样,选择器阵列 801 的输入中的' 1 '被上移三行,即输出的' 1 '位于的第 3 行。因此,所述' 1 '在选择器阵列 803 中经 3 个值为' 1 '的控制位对应的选择器列之后到达编码器 809 ,因为行 771 中的值为 1101111 ,使得选择器阵列 803 在第 0 ,第 1 及第 3 列对输入的' 1 '各降一行,最后在第 3 列向编码器 809 输出的值为' 1 ',对应内部指令块中的第四条指令(偏移地址为' 3 ')。编码器 809 按此编码得到' 3 ',从而将外部指令偏移地址值' 4 '转换为内部指令偏移地址值' 3 '。
根据本发明技术方案,可以通过将待排序的 BN2Y 值与块地址映射模块 620 各表项中存储的 BN2Y 值比较以将当前被写入的 BN1X 和 BN2Y 存储到正确位置。请参考图 8D ,其为本发明所述块地址映射模块的一个实施例。
在本实施例中,块地址映射模块 620 包含块地址存储模块 920 、比较模块 924 、移位器 926 、多路选择器 940 、多路选择器 942 及一些选择器控制逻辑。各个功能模块又分成基本相同的复数个列(如: R 、 S 和 T )。其每列中各有其自有的块地址存储模块 920 、比较模块 924 、移位器 926 、多路选择器 940 及多路选择器 942 。其中块地址存储模块 920 为一个由复数个表项组织成复数行与复数列组成(如图 8D 中存储模块 970 、 971 及 972 )的存储器阵列。其每个表项中有两部分:一级缓存块号( BN1X )及二级缓存块内位移( BN2Y )。存储器阵列由地址 639 选出其中一行由总线 950 输出;也同样由总线 639 选出一行将总线 952 上的数据写入该行。块地址存储模块 920 中某列被排序功能模块中每一列各有其相应的比较模块 924 用于比较块内偏移 BN2Y 。除比较模块 924 外各功能模块与总线的位宽均等于块地址存储模块 920 表项宽度用于传输表项。比较模块 924 是位宽为 BN2Y 的大于比较器,当某列中总线 950 上的 BN2Y 大于从总线 635 上送入的 BN2Y 时,该列比较器输出为' 1 ';当总线 950 上的 BN2Y 小于等于总线 635 上的 BN2Y 时,该列比较器输出为' 0 '。当比较器输出为' 0 '时,选择器 940 选择本列总线 950 上的表项内容放上总线 952 。当比较器输出为' 1 '时,其右面一列的选择器选择比较器所在列 950 上数据经移位器 926 移位后的数据放上总线 952 。即当比较器输出为' 1 '时,控制器将本列 950 上数据右移一列。当某列的比较器输出为' 1 ',而其左面一列比较器输出为' 0 '时,则该某列选择总线 665 上数据放上总线 952 。总线 952 按列将选择器 940 的输出送至块地址存储模块 920 。例如:选择器 976 的输出只送回存储模块 970 、选择器 977 的输出只送回存储模块 971 。当某列的比较器输出为' 0 ',而其右面的一列比较器的输出为' 1 '时,则控制逻辑选择该列总线 950 上数据放上总线 954 送往轨道表 610 与块内偏移映射器逻辑 618 等。
假设每行二级指令块中的最大偏移地址为' 31 '(即偏移地址范围为' 0 ' ~ ' 31 '),则当一个二级指令块被写入二级缓存 606 时,其二级块内偏移地址( BN2Y ) 982 均被设为' 32 ',其意义是本行最大偏移地址加' 1 '。现假设总线 639 上的高位( BN2X )为' 81 '选出 620 中的一行,其中存储模块 970 、 971 和 972 列中表项内的 BN2Y 均为' 32 '。从总线 637 送进的是 BN2Y ,此时值为' 18 '。其意义为以 BN2 地址' 8118 '匹配排序。比较模块 924 比较的结果为比较器输出 973 、 974 和 975 均为' 1 '(输出 973 为' 1 '即表示在块地址存储模块 920 中尚无对应总线 637 上的 BN2Y 的有效表项),控制选择器 940 中选择器 977 和 978 选择 C 输入,即移位器 926 的输出放上总线 952 ;而选择器 976 选择总线 665 上的数据放上总线 952 。总线 952 上的数据被写入块地址存储模块 920 中刚才读出的同一行。其结果是存储模块 970 中表项存储了从总线 665 送进的数据,存储模块 971 中表项存储了原来存储模块 970 中表项数据,存储模块 972 中表项存储了原来存储模块 971 中表项数据。图中未显示的右方各列相应的比较器的来自总线 950 的 BN2Y 输入都为' 32 '大于' 18 ',所以比较结果都为' 1 ',各自控制相应列的数据右移。即 BN2Y 值大于新来的数据的 BN2Y 值的都被右移使得包括新数据在内的各表项按 BN2Y 值的升序排列。控制器检测比较模块 924 中最左边的一个比较器的输出 973 以判定输入的 BN2Y 值有无对应的一级缓存块。如比较器输出 973 为' 1 ',表示输入的 BN2Y 无对应的一级缓存块。如比较器输出 973 为' 0 ',表示输入的 BN2Y 有对应的一级缓存块。
假设上述行又被总线 639 上' 81 '号地址读出,此时存储模块 970 、 971 和 972 表项中相应 BN2Y 值为' 18 '、' 32 '和' 32 ',与从总线 637 送来的 BN2Y 值' 27 '由比较模块 924 中的相应比较器相比。其结果为比较器输出 973 为' 0 ',比较器输出 974 及 975 均为' 1 '。比较器输出 973 使得选择器 976 选 A 输入将本列总线 950 上数据放上总线 952 ;比较器输出 974 使得选择器 978 选 C 输入,即移位器 926 的输出;比较器输出 975 使得其右方一列的选择器选 C 输入,即移位器的输出。而比较器输出 973 为' 0 '与比较器输出 974 为' 1 '使得选择器 977 选择 B 输入即总线 665 上的数据。写回块地址存储模块 920 之后存储模块 970 中表项数据的 BN2Y 值为' 18 ',存储模块 971 中表项数据的 BN2Y 值为' 27 ',存储模块 972 中表项数据的 BN2Y 值为' 32 ',在其他右方各项中均为' 32 '。如此则表项中数据是根据其 BN2Y 值排序,其相应的一级缓存块号也被按二级存储块内偏移排序,使得可以根据一个外部指令的 BN2 地址映射得到相应的内部指令的 BN1 地址。
假设一个新的 BN2 地址' 8123 '从总线 639 及 637 送入。此时' 81 '行被读出,存储模块 970 、 971 和 972 表项中的 BN2Y 值分别为' 18 '、' 27 '和' 32 '。总线 637 中送入的 BN2Y 值为' 23 '。经比较模块 924 比较得到比较器输出 973 为' 0' 、输出 974 和 975 均为' 1 '。此时选择器 954 的控制上只有信号 979 为' 1 '(信号 979 是比较器输出 973 与输出 974 的异或),存储模块 970 中表项上的内容被放上总线 954 送往块内偏移映射逻辑(包含块内偏移映射模块 618 ,偏移地址转换器 622 和减法器 928 )。表项内容中的一级缓存块号 BN1X 被作为地址从块内偏移映射模块 618 中读出与该一级缓存块相应行中的映射关系送往偏移地址转换器 622 。总线 637 上的 BN2Y (二级缓存块内偏移量)由减法器 928 减去总线 954 上的 BN2Y (其为该二级缓存块内的与该一级缓存器对应的二级子缓存块的起始地址),其差( 23-18=5 )即为总线 637 上的 BN2Y 在该二级子缓存块的净地址偏移量。偏移地址转换器 622 根据该偏移量及上述映射关系即可求出相应的一级缓存块内偏移量 BN1Y (该对应关系中二级缓存偏移量为字节 5 处必为' 1 '标识一条外部指令的第一个字节开始,由偏移地址转换器 622 求出与该外部指令对应的内部指令一级缓存偏移量)。由总线 954 上的 BN1X 与此 BN1Y 拼接即获得指向与上述二级缓存地址' 8123 '对应的一级缓存地址 BN1 。该 BN1 可被放入轨道表 611 中表项以便循迹器查找。
以下结合图 6 、图 8D 、图 9A~ 图 9F 进行说明,其中 图 9A~9F 为图 6 实施例运行过程的示意图。
在图 9A~ 图 9F 中显示了运行时块地址存储模块 920 、二级缓存 606 、偏移地址映射模块 618 、轨道表 610 及一级缓存 602 中的相应内容。其中,块地址存储模块 920 中每一行与二级缓存 606 中一个二级缓存块对应, 也与主动表 604 中一个外部指令块地址对应。偏移地址映射模块 618 与轨道表 610 的一行对应于一级缓存 602 中一个一级缓存块。图 6 中主动表 604 还负责按置换规则为新取进的外部指令块在二级缓存 606 中分配二级缓存块,替换模块 611 负责按置换规则为内部指令在一级缓存 602 中分配一级缓存块。图中一级缓存器 601 中的阴影部分表示已填充的内部指令。
二级缓存 606 的寻址地址为 BN2 ,其格式为' 8XYY '。其中' 8X '为块地址 BN2X 。为便于说明,此例中二级缓存 606 是一个路组缓存,其块地址即为索引地址( index ),其值为' 80 ' ~ ' 82 ', 其相应标签(即块地址)存放在主动表中相同索引地址的行。二级缓存 606 中每个二级缓存块(图中一行)有 32 个字节,其块内偏移量 BN2Y 为其块内字节( byte )地址' YY ',其值为' 0 ' ~ ' 31 '。其中储存变长的外部指令,图中每个分隔代表一条不同长度的外部指令,在此实施例中外部指令的长度从 2 个字节到 8 个字节不等。
一级缓存 602 则是在轨道表 610 与块地址存储模块 920 协同控制下的一个全相连缓存,其地址为 BN1 ,其格式为' 7XY ',其中' 7X '为块地址 BN1X ,其值为' 70 ' ~ ' 75 '。一级缓存 602 中的每个一级指令块(图中一行)有 4 条定长内部指令,其块内偏移量 BNY1 为其块内字( word )地址' Y ',为易于理解及与 BN2Y 区分,其值在此实施例中以字母为 A~D 标注;在此实施例中一条内部指令的长是一个字( word ),内部指令也可以有其他的长度。轨道表 610 中每行也有 A~D 四个表项对应于一级缓存 602 中 A~D 四条内部指令。轨道表 610 中每行还有一个 E 表项,用于存放其下一指令块的地址。轨道表 610 中的每个表项储存一个类型,循迹器根据类型决定下一步的地址。表项还可以存储一个指针指向该表项所代表的指令的目标地址,其格式既可以为 BN2 ,也可以为 BN1 。偏移地址映射模块 618 每行与一个一级缓存块及其相应的轨道表中一行对应。
块地址存储模块 920 中每一行与二级缓存 606 中一个二级缓存块对应。二级缓存 620 中每一行中有复数个表项(如: R 、 S 、 T 、 U 、 V )。每个表项可与一级缓存中的一个一级指令块相对应。块地址存储模块 920 中各表项内容含有其相应的一级缓存块的块地址 BN1X ,及该一级缓存块中的第一个内部指令在二级缓冲块中相应的外部指令在该二级缓存块中的地址 BN2Y 。当一个二级缓存块被写入时,其块地址存储模块 920 中相应的行中的 BN2Y 地址全被重置为' 32 ',其意义为其顺序下一个二级缓存块中的第一个字节。
图 9A 为开始状态,其时二级缓存 606 中二级缓存块' 80 '已被填充,而二级缓存块' 81 '和' 82 '尚未填充。' 80 '号块中从字节' 24 '开始的外部指令正被扫描转换器 608 转换为内部指令格式经总线 667 向一级缓存 602 中一级缓存块' 72 '按顺序填充。' 80 '块中字节' 24 ' ~ ' 25 '是一条外部指令,其相应内部指令被填充进' 72 '号块 A 项;' 80 '块中字节' 26 ' ~ ' 29 '是一条外部指令,其相应内部指令被填充进' 72 '号块 B 项;' 80 '块中从字节' 30 '开始是一条 4 个字节长的外部指令,其相应内部指令将被填充进' 72 '号块 C 项。
转换格式过程中,扫描转换器 608 发现开始于' 80 '号块' 26 '字节的外部指令是一条分支指令,并以该缓存块在主动表 604 中存储的缓存块地址加块内偏移量' 26 ',再加上分支偏移量计算出其分支目标。该分支目标高位经总线 657 送往主动表 604 中匹配未命中,经主动表 604 按分配新的二级缓存块' 81 '号缓存块(即 BN2X 为' 81 ');主动表 604 也向低层存储器送出该分支目标高位以读取相应外部指令块存入' 81 '号缓存块。相应扫描转换器 608 中' 81 '行的 BN2Y 全被重置为' 32 '。该新分配的二级缓存块号由主动表 604 经总线 671 送出,与扫描转换器 608 输出的总线 657 上的分支目标低位(' 18 '号字节)在总线 687 上拼接成一个 BN2 地址。扫描转换器 608 也得出与外部指令' 8026 '(即' 80 '块' 26 '字节)相应的内部指令地址为' 72B '(即' 72 '号一级存储块中第二个字),于是扫描转换器 608 的地址总线 669 指向轨道表 610 中' 72 '行 B 列中表项,写入经总线 687 传来的表项内容。因此,轨道表 610 中' 72B '表项内容为 BN2 地址' 8118 '。
总线 657 上的分支目标的低位( BNY2 值' 18 ')被选择器 638 选择后放上比较模块 924 的一个输入 637 与来自块地址存储模块 920 的' 81 '行(由主动表 604 分配的 BN2X 值' 81 '被选择器 640 选择及经总线 639 送达)的各表项内容比较,发现该为' 18 '的值小于所有表项内容(即' 18 ' < ' 32 '),因此 BN1X 值' 72 '与 BN2Y 值' 18 '(地址在第' 18 '号字节的分支目标外部指令被写入了' 72 '号第一存储块)被写入块地址存储模块 920 中的' 81 '行中的 R 项。此时, R 项的值为' 7218 '。
扫描转换器 608 继续转换格式到二级缓存 606 中' 80 '块的字节' 30 ',发现该指令长度为 4 个字节,超出本块 2 个字节,于是在本二级缓存块地址加' 30 '(块内偏移)加' 4 '(指令字节数),产生下一外部指令块地址。该下一缓存块地址也被总线 657 送往主动表 604 匹配,发现该外部指令块已在(或正在从低层存储器读入)' 81 '号二级缓存块,扫描转换器 608 即从' 81 '号缓存块中读取需要的数据以完成开始于' 80 '号二级缓存块字节' 30 '的外部指令的转换,并将转换所得的内部指令按顺序填充进一级缓存器' 72 '块 C 项。因为这是' 80 '号二级存储块上最后一条外部指令,扫描转换器 608 要向轨道表 610 提供顺序下一条指令地址。此时,匹配所得的 BN2X 地址被主动表 604 从总线 671 送出,与总线 657 上的低位 BN2Y ( 30+4=34 ,抛弃超出 32 个字节宽的部分, 得值' 2 ')在总线 687 上合成一个 BN2 地址' 8102 '。本实施例处理指令流从一个指令块的最后一条指令转移到其顺序下一条指令的方式是将其视为一条无条件分支指令,即把总线 687 上的 BN2 地址作为一个目标地址,放到轨道表中一个指令块最后一道指令(地址' 72C ')之后的表项,且类型设为无条件分支。因此,扫描转换器 608 经总线 661 送出其值为' 72D '的地址,控制轨道表在' 72 '行 D 项写入 BN2 地址' 8102 '。
循迹器 614 从轨道表' 72 '行 A 项开始读取轨道表中内容,因该行中 A 项不是分支指令,循迹器继续往右读取。循迹器 614 从' 72 '行 B 项读出' 8118 '判断是一个 BN2 地址,即将该地址经总线 631 送往块地址存储模块 920 及二级缓存 606 。该 BN2 地址从块地址存储模块 920 中读出其' 81 '行的表项内容。控制逻辑发现块地址存储模块 920 中' 81 '行所有的一级缓存块号都为无效,据此判断该 BN2 地址的相应外部指令尚未被转换成内部指令,即控制二级缓存 606 顺序读出地址从' 8118 '开始的部分' 81 '号二级缓存块直到' 8131 '(' 81 '块最后一个字节)上的外部指令提供给扫描转换器 608 进行格式转换。
扫描转换器 608 也因此向替换模块 611 请求一个可被替换的一级指令块号。替换模块 611 遵循一定的规则,比如 LRU 替换算法,以确定可替换的一级存储块,此时按顺序是' 70 '、' 71 '、' 73 '、' 74 '、' 75 '。因此,按顺序提供' 70 '号一级存储块供填充。扫描转换器 608 即据此将从二级缓存 606 中' 8118 '开始的外部指令转换成的内部指令按顺序填入一级缓存 602 中' 70 '号存储块中 A 、 B 、 C 、 D 各项, 并将 BN1 地址' 70A '写入轨道表 610 中' 72B '表项,代替原 BN2 地址' 8118 '。这是基于在二级缓存器中' 8118 '地址开始的外部指令的相应内部指令被存储在从' 70A '开始的一级缓存块内。请见图 9B 。
扫描转换器 608 发现' 70 '号一级存储块的 D 项被填充后,' 81 '号二级存储块中地址为' 8118 ' ~ ' 8131 '的指令尚未被转换完毕,只转换到地址为' 8126 '的外部指令。于是向替换模块 611 请求一个可被替换的一级指令块号。替换模块 611 按顺序提供' 71 '号一级存储块。于是控制器将替换模块 611 产生的 BNX 值' 73A '连同控制器产生的无条件分支指令类型 ' 71A '(' 71 '号一级缓存块中的第一条指令的地址)依前例写入轨道表 610 中' 70 '行中 E 项以供循迹器 614 执行到此时跳转到' 71 '号缓存块的第一条指令。扫描转换器 608 也继续转换外部指令并按顺序填入' 71 '号一级存储块。扫描转换器 608 也将地址为' 8118 ' ~ ' 8126 '中每条外部指令的第一个字节的块内偏移地址 BN2Y 以及相应内部指令的块内偏移地址 BN1Y 以图 7B 例中的格式存入块内偏移映射器 618 中循迹器指针 631 指向的' 70 '行。
从总线 657 中送出的 BNY2 值' 27 '被送往比较模块 924 与' 81 '行各表项比较。结果发现该 BNY2 值大于 R 表项中的 BNY2 值' 18 ',但小于 S 表项及其他表项中的 BNY2 值(均为' 32 ')。值' 7127 '被填入块地址存储模块 920 中' 81 '行的 S 表项,原 R 表项值' 7018 '不变,原 T 、 U 、 V 表项的值均右移一个表项。
因为扫描转换器 608 在' 8118 ' ~ ' 8131 '的外部指令中未发现分支指令,所以轨道表 610 中' 70 '行中的 A 、 B 、 C 、 D 各项中没有分支目标的记录。
扫描转换器 608 发现' 81 '行中从' 26 '字节开始的外部指令结束于' 31 '字节,没有延伸到下一个指令块,而且该外部指令相应的内部指令结束于' 71 '号存储块 B 项。因此,将如前例计算,匹配而分配得到的下一外部指令地址' 8200 '存入轨道表 610 中' 71 '行 C 项。主动表 604 如前例向低层存储器读取' 82 '号二级缓存块的相应外部指令块以填充' 82 '号二级缓存块。请见图 9C 。
处理器核执行轨道表中' 72B '项中的分支指令,其判断结果经信号 635 送往循迹器 614 。此时,该结果为不分支。循迹器 614 据此移向轨道表中同一行中下一轨迹点' 72C '读出后,发现为非分支指令,移向下一个表项' 72D '。读出后发现是一条目标为' 8102 '的无条件分支地址。控制器判断此为 BN2 地址,经总线 633 送出。总线 633 中高位被送到块地址存储模块 920 ,读出其中' 81 '行各表项内容送入比较模块 924 的一组输入端,而总线 633 中低位(其值为' 02 ')经选择器 638 选择后送到比较模块 924 的另一个输入端 637 相比较。比较结果为 637 上的 BNY2 值小于所有表项中的值,控制逻辑据此判断 BN2 地址为' 8102 '的外部指令尚未有相应的内部指令存于一级指令块中。控制逻辑控制二级缓存 606 从总线 633 送来的 BN2X 地址' 81 '及总线 679 上送来的地址' 00 '开始送外部指令到扫描转换器 608 以转换为内部指令。
扫描转换器 608 如前例请求并获得' 73 '号一级缓存块以顺序填充转换所得的内部指令。同时,因为总线 637 上的 BNY2 地址' 02 '小于所有' 81 '行中所有表项内容,如前例,值' 7302 '(代表 BNY2 为' 02 '的外部指令的相应内部指令被放入' 73 '号一级指令块)被放入' 81 '行 R 表项中,而原' 81 '行各表项都各右移一个表项。并且新值被写入的表项(此时为 R 表项)中的 BNY2 值' 18 '被送往扫描转换器 608 以通知扫描转换器 608 只需转换到' 18 '字节前一个字节,即' 17 '字节即可。
在转换所得的内部指令被填充入' 73 '号一级缓存块的同时,替换模块 611 产生的 BNX 值' 73A '连同控制器产生的无条件分支指令类型被写入轨道表 610 中' 72D ',将其中的 BN2 值' 8102 '替换为 BN1 值' 73A '。循迹器 614 读指针 631 此时仍指向' 72D '项,所以在总线 633 上读出了' 73A '的值。控制逻辑判断这是 BN1 值,据此控制一级缓存用 73A' 地址读出相应内部指令供处理器核 601 使用。
扫描转换器 608 转换到' 81 '行第' 9 '字节结束的外部指令时,发现第' 73 '号一级指令块已填到 D 项,据此请求得到' 74 '号一级指令块继续转换并填充从第' 10 '字节开始的外部指令。如前例替换模块 611 产生的 BNX 值' 74A '连同控制器产生的无条件分支指令类型填入轨道表 610 中' 73 '行 E 项。从总线 657 中送出的 BNY2 值' 10 '如前例被送往比较模块 924 与' 81 '行各表项比较。结果发现该 BNY2 值大于 R 表项中的 BNY2 值' 02 ',但小于 S 表项中的 BNY2 值' 18 '及其他表项中的 BNY2 值。依前例,值' 7410 '被填入块地址存储模块 920 中' 81 '行的 S 表项,原 R 表项值不变,原 T 、 U 、 V 表项的值均右移一个表项。
扫描转换器 608 继续转换外部指令并填充到一级缓存 602 。在字节' 17 '结束的外部指令是填充到' 74 '号一级缓存块中 B 项。此时,扫描转换器 608 发现已遇到此前比较模块 924 送来的限度' 18 ',并以该限度在块地址存储器 920 中 81 行匹配得到' 70 ',即以' 70A '即无条件分支指令类型存入轨道表 610 中' 74 '行 C 项存储。另一种实施方式可以将 BN2 地址' 8118 '存入轨道表 610 中' 74 '行 C 项存储留待循迹器将其读出时在映射为。请见图 9D 。
在上述指令转换与一级缓存 602 填充的同时,循迹器 614 在继续沿' 73 '号轨道前行,因为轨道表中' 73B '、' 73C '、' 73D '表项都为非分支指令,循迹器在这些表项处都不停留,从' 73E '表项读出了无条件分支指令目标' 74A ',即转移到' 74 '行从 A 项开始前行。循迹器在' 74C '表项读出无条件分支指令目标' 70A '。即转移到' 70 '行继续前行,在' 70E '表项读出无条件分支转移指令,目标' 71A '。循迹器 614 即转移到' 71 '行继续前行在' 71C '表项读出表项内容为无条件分支指令,目标' 8200 '。控制器判断该目标为二级缓存块地址,于是通过总线 631 将该地址送往块地址存储模块 920 ,匹配发现' 82 '号二级缓存块并无有效的一级缓存块。该匹配结果使扫描转换器 608 开始将' 82 '号缓存块中所有外部指令转换为内部指令,从替换模块 611 提供的' 75 '号一级存储块开始填充进一级缓存器 602 。同时,扫描转换器 608 也将转换时提取的指令类型及计算得到的分支目标同步填充进轨道表 610 中相应表项。控制器也控制将置换模块 911 产生的 BN1 地址' 75A '连同无条件分支指令类型,写入轨道表 610 中循迹器 614 正指向的表项' 71C '。该表项新内容被从轨道表中读出,经总线 631 直接送往一级缓存 602 读出内部指令供处理器核 601 使用。
请见图 9E. 循迹器 614 沿' 75 '行前行在' 75B '处遇到一条条件分支指令,其目标为' 8116 ',该值为' 8116 '的 BN2 被送往块地址存储模块 920 匹配,发现其 BN2Y 值' 16 '大于' 81 '行 S 表项中 BN2Y 值 '10' ,但小于 T 表项中 BN2Y 值 '18' 。
经图 8D 中比较模块 924 比较得到比较器输出 973 和 974 均为' 0' 、输出 975 为' 1 '。此时选择器 954 的控制上只有信号 981 为' 1 '(信号 981 是输出 974 与输出 975 的异或),存储模块 971 中表项上的内容' 7410 '被放上总线 954 送往块内偏移映射逻辑(包含块内偏移映射模块 618 ,偏移地址转换器 622 和减法器 928 )。表项内容中的一级缓存块号 BN1X 被作为地址从块内偏移映射模块 618 中读出第 74 行中的映射关系送往偏移地址转换器 622 。总线 637 上的 BN2Y (二级缓存块内偏移量)由减法器 928 减去总线 954 上的 BN2Y (其为该二级缓存块内的与该一级缓存器对应的二级子缓存块的起始地址),其差( 16-10=6 )即为总线 633 上的 BN2Y 在该二级子缓存块的净地址偏移量。偏移地址转换器 622 根据该净偏移量及上述映射关系即可求出相应的一级缓存块内偏移量 BN1Y 。由总线 954 上的 BN1X 与此 BN1Y 拼接即获得指向与上述二级缓存地址' 8116 '对应的一级缓存地址 BN1 值' 74B '。该 BN1 值可被放入轨道表 611 中' 75B '表项取代原有的' 8116 '以便循迹器 614 根据此 BN1 值及处理器核 601 的反馈控制一级缓存 602 读取指令。扫描转换器 608 继续转换二级缓存器 606 上' 82 '行上的外部指令,在填完' 75 '号一级缓存块后获分配' 77 '号缓存块作为下一顺序缓存块。请参照图 9F 。
在轨道表中循迹器需用的分支指令地址都由 BN2 转换为 BN1 后,循迹器 614 读出该等值后即可直接无间断(除等待处理器核 601 经总线 635 送来的条件分支决定外)控制一级指令缓存向处理器核 601 提供指令。
进一步地,根据本发明技术方案,所述处理器系统不但可以支持对应不同处理器平台的各种外部指令集(二进制码指令集),也可以支持对应虚拟机的字节码指令集,如作为 JAVA 解释器输入的字节码指令。此时,可以采用与之前实施例相同的方法将一条字节码指令转换为一条或多条内部指令供处理器核执行。鉴于字节码指令的特殊性,还可以在转换过程中做一些改进以提高性能。例如,对于一条需要常数进行运算的字节码指令,因为该常数是存储在存储器中的常量池内,因此按之前实施例所述方法会被转换为一条数据读取指令及相应的运算指令。然而,可以在扫描转换器审查发现该字节码指令是读取常数的指令时,提前将该常数从存储器中填充到数据缓存中。这样,当处理器核执行到该字节码指令对应的第一条内部指令(即数据读取指令)时,不会发生因数据读取造成的缓存缺失。
更进一步地,还可以在提前从存储器中获取到该常数时,直接将该常数以立即数的形式嵌入到相应的内部指令(即运算指令)中,从而可以省去所述数据读取指令。这样,当处理器核执行到该字节码指令对应的内部指令(即已嵌入该常数的运算指令)时,可以直接进行运算,从而进一步提高了处理器系统的性能。
此外,对于字节码指令中的栈运算指令,也可以用本发明所述的方法转换为对应的内部指令供处理器核执行,从而省去将字节码指令翻译为机器码指令的过程。在本发明中,一次栈运算被转换为一条内部指令,且该类内部指令的操作数不是寄存器堆中的寄存器值,而是操作数栈中位于栈顶的若干个寄存器值。此时,可以对处理器核中现有的寄存器堆增加相应的控制逻辑,使得该寄存器堆能用做栈寄存器。
请参考图 10A ,其为本发明所述操作数栈的一个实施例。在本实施例中,以一个栈运算最多需要两个操作数并得到一个运算结果为例进行说明。对于其他情况,也可以此类推。
在图 10A 中,寄存器堆 1001 同时支持两个读操作和一个写操作。其中,译码器 1003 、 1005 分别对送来的两个寄存器号译码后分别送往第一读端口和第二读端口,从总线 1013 和 1015 读出对应的寄存器值。译码器 1007 则对将被写入的寄存器的寄存器号进行译码,并送往写端口,使得总线 1017 上的值可以被写入对应的寄存器。寄存器 1011 中存储了栈顶指针值,即该寄存器堆作为操作数栈使用时栈顶指向的寄存器号。寄存器 1011 中的值通过总线 1045 被送到选择器 1053 、 1055 和 1057 ,以及减量器 1031 、增量器 1041 和控制器 1019 。其中,减量器 1031 和增量器 1041 分别对总线 1045 送来的栈顶指针值进行减一和增一的操作,并将相应结果分别通过总线 1043 和 1047 送往选择器 1053 、 1055 和 1057 。由于寄存器堆 1001 的容量有限,在作为操作数栈使用时若容量已满或接近满(即栈顶指针离栈底指针的距离达到一定程度)时,需要将栈底的一部分操作数按顺序存储到外部存储器(或缓存)中,并移动栈底指针,使得这部分寄存器可以容纳新被压入栈中的操作数,从而构成一个类似循环缓冲( Circular Buffer )的结构。同样地,当操作数栈已空或接近空(即栈顶指针离栈底指针的距离达到一定程度)时,需要将之前存储到外部存储器(或缓存)中的那部分操作数按逆序填充回操作数栈中,同时移动栈底指针,使得操作数栈能继续提供操作数。在本实施例中,控制器 1019 根据该栈顶指针值,产生一个新的栈底指针值经译码器 1009 译码后控制寄存器堆 1001 将原栈底指针和所述新栈底指针之间的寄存器值存储到外部存储器中,或从外部存储器将相应操作数填充到寄存器堆 1001 中原栈底指针和所述新栈底指针之间的寄存器中。
相应地,在内部指令中有一个指令域表示该内部指令是寄存器运算指令还是栈运算指令,该指令域的值通过控制线 1021 被送到选择器 1033 、 1035 和 1037 。当该内部指令是栈运算指令时,选择器 1033 、 1035 和 1037 均选择输入 A 并分别送往译码器 1003 、 1005 和 1007 ;当该内部指令是寄存器运算指令时,选择器 1033 、 1035 和 1037 均选择输入 B 并分别送往译码器 1003 、 1005 和 1007 。
这样,若一条内部指令是寄存器运算指令,则两个源寄存器号和一个目标寄存器号分别通过总线 1023 、 1025 和 1027 被选择器 1033 、 1035 和 1037 选择及经译码器 1003 、 1005 和 1007 译码后对寄存器堆寻址,从而读出及写入相应的寄存器值。该操作与现有技术类似,在此不再赘述。
若一条内部指令是栈运算指令,则上述三个存储寄存器号的指令域被用于存储栈顶指针移动信息。例如,对于一条从栈顶取出两个操作数运算并将结果存回栈顶减一的栈运算指令,其中一个操作数对应的寄存器号就是寄存器 1011 中存储的栈顶指针值,另一个操作数对应的寄存器号是该栈顶指针值减一,而运算结果对应的寄存器号也是该栈顶指针值减一。即,将位于栈顶的两个操作数出栈运算后,再将运算结果压回栈顶。此时,选择器 1053 受总线 1023 上的指令域控制,选择输入 D (当前栈顶指针值),从寄存器堆中读出第一个操作数;选择器 1055 受总线 1025 上的指令域控制,选择输入 H (当前栈顶指针值减一),从寄存器堆中读出第二个操作数;选择器 1057 受总线 1027 上指令域控制,选择输入 K (当前栈顶指针值减一),经译码后选中将被写回的寄存器。同时,选择器 1051 受总线 1029 上指令域控制,选择输入 N (当前栈顶指针值减一)作为新的栈顶指针值写回寄存器 1011 ,完成栈顶指针更新。
又如,对于一条将操作数压入操作数栈的指令,选择器 1057 受总线 1027 上指令域控制,选择输入 I (当前栈顶指针值加一),经译码后选中相应寄存器,从而将操作数写入该寄存器,实现压栈操作。同时,选择器 1051 受总线 1029 上指令域控制,选择输入 L (当前栈顶指针值加一)作为新的栈顶指针值写回寄存器 1011 ,完成栈顶指针更新。
又如,对于一条将操作数从操作数栈出栈的指令,选择器 1053 受总线 1023 上指令域控制,选择输入 D (当前栈顶指针值),经译码后选中相应寄存器读出操作数,实现出栈操作。同时,选择器 1051 受总线 1029 上指令域控制,选择输入 N (当前栈顶指针值减一)作为新的栈顶指针值写回寄存器 1011 ,完成栈顶指针更新。
此外,控制器 1019 中存储了当前栈底指针值,并对从寄存器 1011 送来的当前栈顶指针值进行判断。若栈底指针值与栈顶指针值接近到一定程度,说明操作数栈接近空,若之前有操作数被存储到外部存储器(或缓存),则需要将一定数目的操作数从外部存储器(或缓存)填充到寄存器堆中栈底以外部分,并更新栈底指针值。相应地,若栈底指针值与栈顶指针值远离到一定程度,说明操作数栈接近满,则需要将一定数目的操作数从寄存器中栈底开始部分存储到外部存储器(或缓存)中,并更新栈底指针值。
请参考图 10B ,其为本发明所述更新栈底的一个实施例。在本实施例中,假设当栈底指针值与栈顶指针值相差' 3 '的时表示操作数栈接近空,且每次填入一个操作数。在某一时刻,栈底指针指向寄存器 1073 ,栈顶指针指向寄存器 1079 。执行一个出栈操作后,栈顶指针指向寄存器 1077 。此时,栈底指针值与栈顶指针值相差' 3 ',则控制器 1019 发出信号从外部存储器(或缓存)取回之前存储出去的最后一个操作数,并将该操作数填充到栈底指针值减一位置的寄存器(即寄存器 1071 ),同时对栈底指针值减一,使得栈底指针指向寄存器 1071 ,保持栈中操作数的数目大于' 3 '。
请参考图 10C ,其为本发明所述更新栈底的另一个实施例。在本实施例中,假设当栈底指针值与栈顶指针值相差' 7 '的时表示操作数栈接近满,且每次向外部存储器(或缓存)存储一个操作数。在某一时刻,栈底指针指向寄存器 1081 ,栈顶指针指向寄存器 1091 。执行一个入栈操作后,栈顶指针指向寄存器 1093 。此时,栈底指针值与栈顶指针值相差' 7 ',则控制器 1019 发出信号将栈底指针指向的那个操作数存储到外部存储器(或缓存)中,同时对栈底指针值加一,使得栈底指针指向寄存器 1083 ,保持栈中操作数的数目小于' 7 '。
根据本发明技术方案,每次填充或存储多个操作数的方法与图 10B 和图 10C 的实施例中所述类似,在此不再说明。此外,在上述实施例中通过对栈顶指针值和栈底指针值之间的差值做判断以确定操作数栈是否接近空或满。然而,也可以根据栈顶指针值的变化来判断。例如,自上一次调整栈底指针值以来,若栈顶指针值累计增加或减少到一定程度,即可进行相应的操作。
在图 7A 实施例中,将结束轨迹点视为一个无条件分支点,因此当循迹器读指针 631 指向结束轨迹点之前的那个轨迹点(即指令块中的最后一条指令),且该轨迹点不是分支点,或是分支转移没有发生的分支点时,循迹器读指针 631 继续更新、移动到结束轨迹点,并输出 BN1 送往一级缓存 602 。由于结束轨迹点不对应于真实的指令,循迹器读指针 631 要到下一个时钟周期才会更新为下一轨道的第一个轨迹点,因此在本时钟周期内,一级缓存 602 还需要向处理器核 601 输出一条空指令(即不会改变处理器核内部状态的指令,例如 NOP )供执行。在本发明中,可以对送到一级缓存 602 的寻址地址进行判断,一旦发现寻址地址对应结束轨迹点,则不需要访问一级缓存 602 ,直接输出空指令供处理器核 601 执行。然而,这样做的缺点是使得处理器核 601 多花费一个时钟周期用于执行无用的空指令。因此,可以对图 7A 进行改进,使得循迹器读指针 631 指向结束轨迹点的前一轨迹点时,根据该轨迹点的指令类型及处理器核 601 执行该指令的反馈,在下一时钟周期直接指向分支目标轨迹点或下一轨道的第一个轨迹点。
请参考图 11A ,其为本发明所述基于轨道表的缓存结构的另一个实施例。本实施例中的处理器核 601 、一级缓存 602 、扫描转换器 608 、二级缓存 606 、替换模块 611 、偏移地址映射模块 618 和选择器 692 、 696 、 694 均与图 7A 实施例相同。不同之处在于,轨道表 610 每次输出两个轨迹点的内容(循迹器读指针 631 指向的轨迹点内容 1182 及其后的一个轨迹点内容 1183 ),而循迹器中则增加了类型译码器 1152 、控制器 1154 和选择器 1116 。其中控制器 1154 执行图 7A 中未显示的控制器的类似功能,此处将其显示以便于说明较复杂的功能与操作。
在本实施例中,轨道表 610 的读端口在循迹器输出的读指针 631 的寻址下,输出两个相邻轨迹点的内容并放上总线 1117 与总线 1121 ,控制器 1154 则检测所述总线 1117 上的指令类型,类型译码器 1152 检测所述总线 1121 上的指令类型。在任一时刻,从轨道表 610 中读出两个表项:当前表项 1182 及其顺序下一个(右方)表项 1183 。当前表项 1182 中的内容经总线 1117 读出送往选择器 738 的一个输入及控制器 1154 。下一表项 1183 则经总线 1121 送出,送往类型译码器 1152 译码,其结果控制选择器 1116 。选择器 1116 的一个输入来源于总线 1121 ,另一个输入来源于读指针 631 中的 BN1X 及增量器 736 送来的增一后的 BN1Y (即读指针 631 中的 BN1Y 值增一)。类型译码器 1152 只对无条件分支指令类型译码,若总线 1121 上的类型为无条件分支指令类型,则控制选择器 1116 选择输出总线 1121 上的内容;若任何其他类型,则选择来源于总线 631 的 BN1X 与增量器 736 输出的增一后的 BN1Y 。
以下先考虑总线 1121 上的类型(即顺序下一个表项)不是无条件分支指令类型。此时,选择器 1116 选择来自增量器 736 的输出送往选择器 738 的一个输入。
如果控制器 1154 译出总线 1117 上(即当前表项 1182 中的内容)的指令类型是非分支指令,控制器 1154 控制选择器 738 选择由选择器 1116 选择的增量器 736 的输出作为寄存器 740 的输入。来自处理器核 601 的控制信号 1111 控制该输入存入寄存器 740 ,使得循迹器向右移动达到下一个地址(即顺序更大的地址 BNX1 不变, BNY1+ ' 1 ')。在本实施例中,控制信号 1111 是处理器核 601 向循迹器提供的反馈信号,此控制信号 1111 在处理器核正常工作时一直为' 1 ',使循迹器中寄存器 740 每个时钟周期都更新,使读指针 631 指向轨道表中一个新的表项及一级缓存 602 中一条新的指令以供处理器核执行。当处理器核 601 中工作异常,需要停流水线或者不能执行新的指令时,则控制信号 1111 为' 0 ',使寄存器 740 停止更新,循迹器及指针 631 保持原来状态不变,一级缓存 602 暂停向处理器核 601 提供新的指令。
如果总线 1117 上该内容中的指令类型是无条件分支,则控制器 1154 控制选择器 738 选择总线 1117 上的分支目标地址,使得读指针 631 跳转到由总线 1117 上分支目标地址对应的轨迹点位置。
如果总线 1117 上的指令类型是直接有条件分支,则控制器 1154 控制循迹器暂停更新并等待,直到处理器核 601 产生分支转移是否发生的 TAKEN 信号 635 。此时寄存器 740 不仅受控制信号 1111 控制,也受处理器核 601 产生的一个表示 Taken 信号 635 是否有效的信号 1161 控制,需要信号 1161 显示 TAKEN 信号 635 有效且控制信号 1111 也有效时,寄存器 740 才更新。如果分支转移没有发生( TAKEN 信号 635 为' 0 '),则选择器 738 选择选择器 1116 的输出,如之前执行非分支指令的方式运行;如果分支转移发生( TAKEN 信号 113 为' 1 '),则选择器 738 选择总线 1117 ,将其上的分支目标地址存入寄存器 740 ,指针 631 指向轨道表中分支目标的相应表项,及一级缓存 602 中的分支目标指令,将其读出供处理器核 601 执行。
如果总线 1117 上的指令类型是 BN2 分支类型,则控制器 1154 控制循迹器中寄存器 740 暂停更新并等待,按前例将该 BN2 转换获得 BN1 地址,并写回轨道表中的原来间接分支表项。该表项经总线 1117 读出,此后处理与前例相同。循迹器沿该 BN1 并根据处理器核 601 反馈的指令执行结果(如:分支指令的执行结果),控制一级缓存 602 向处理器核 601 输出指令供执行。
如果分支转移没有发生,则如之前非分支指令的做法运行,如果分支转移发生,则如之前无条件分支指令的做法运行。
如果该内容中的指令类型是间接分支,控制器 1154 控制循迹器中寄存器 740 暂停更新,并等待处理器核 601 经总线 683 送出分支目标地址,并如前例被送往主动表 604 、块地址映射模块 620 匹配,以后操作与上例同。
如果表项 1183 中是无条件分支指令,则分支类型译码器 1152 对总线 1121 上的指令类型译码,使得选择器 1116 选择总线 1121 上的分支目标而不选择经增量器 736 提供的 BN1 (所述 BN1 即 BN1X 、 BN1Y+ ' 1 '),如此当处理器核 601 执行完表项 1182 相应的指令后,并不执行表项 1183 对应的指令(因为表项 1183 对应的可能是结束轨迹点,在一级缓存 602 中并无指令与其对应),而是直接执行表项 1183 中所含的分支目标地址的相应指令。
如果表项 1182 中是一条非分支指令,则如上所述执行完该指令后所执行的下一条指令就是表项 1183 中的分支目标所指向的指令。如果表项 1182 中是一条无条件分支指令,则执行完该指令后所执行的下一条指令就是表项 1182 中的分支目标所指向的指令,表项 1183 对该过程不产生影响。如果表项 1182 中是一条条件分支指令,则执行完该指令后所执行的下一条指令取决于处理器核 601 所产生的 TAKEN 信号 635 。如判断为分支转移发生( TAKEN 信号 635 为' 1 '),则选择器 738 选择总线 1117 上的分支目标,表示 TAKEN 信号 635 有效的信号 1161 控制将该目标存入寄存器 740 ,使指针 631 指向该分支目标,下一条执行的指令就是表项 1182 中分支目标地址所指向的指令。如判断为分支转移不发生( TAKEN 信号 635 为' 0 '),则选择器 738 选择经选择器 1116 输出的总线 1121 上的分支目标,表示 TAKEN 信号 635 有效的信号 1161 与控制信号 1111 控制将来自 1183 的无条件分支目标存入寄存器 740 使指针 631 指向该分支目标,下一条执行的指令就是表项 1183 中的无条件分支目标地址所指向的指令。
结束轨迹点中的无条件分支目标其地址也可以是二级缓存地址 BN2 。类型译码器 1152 在译码总线 1121 上读出的表项的指令类型时如果发现该地址为 BN2 格式,也可以将该总线 1121 输出的 BN2 放上总线 1117 按前例转换为 BN1 存回该表项。为了清晰及便于说明起见,这个路径在图 11A 中没有画出。
图 11A 例中该条件分支指令的类型判断可以有四种方式。第一种方式为只有一种无条件分支类型,即对程序中原有的无条件分支指令,与本发明所添加的结束轨迹点中控制跳转到下一轨道起始表项的无条件跳转操作不加区分。这种方式会使得程序中原有的条件分支指令被跳过,不被处理器核 601 执行,但是程序流在轨道表 610 与循迹器的控制下,可以正确执行该分支指令的目标指令及其后续指令。这样,节省了原来执行该无条件分支指令所占的时钟周期。但处理器核 601 中因为未执行该指令,程序计数器 PC 值会有误差,如果需要保持精确 PC 值则需进行补偿。本发明中的缓存系统不需要 PC 即能正确向处理器核 601 提供其将要执行的指令供其不间断地执行。如果需要获得某个时刻的 PC 值时(如调试时),每行轨道表中都记载了该一级指令块相应的二级缓存块地址 BN2X 及二级缓存子块地址。由此, BN2X 从主动表 604 中可读出相应的标签,与二级缓存块地址,子块地址及指针 631 中 BNY 的数值拼接,就是正在执行的指令的 PC 值。
第二种方式为有两种无条件分支类型。其中,一种为结束点无条件分支类型对应轨道中每条轨道的结束点。对于这种结束点无条件分支类型,类型译码器 1152 将其视为该结束点不对应程序中一条指令,由此控制选择器 1116 选择总线 1121 上的分支目标,在执行完总线 1117 上的指令后直接跳转到总线 1121 上的分支目标地址。另一类对应程序中的无条件分支类型,类型译码器 1152 在译出这种类型时不将其作为分支处理,控制选择器 1116 选择增量器 736 的输出。当执行完总线 1117 上的表项内容的相应指令后,下一条执行的指令为其顺序下一条指令,即程序中原有的无条件分支指令。这种方式下处理器核中的 PC 则一直保持正确的值。
第三种方式为对图 11A 实施例进行改进,在扫描转换器 608 对指令块审查的过程中,如果发现一级指令块的倒数第二条指令不是有条件分支指令,且最后一条指令为非分支指令,扫描转换器 608 在这种情况下将结束轨迹点合并到该最后一条指令对应的轨迹点中。即,将该最后一条指令的指令类型标记为无条件分支指令,并将下一指令块第一条指令对应的 BN1 或 BN2 (若是 BN2 则循迹器读出时会按前例将其转换为 BN1 )作为轨迹点内容存储在该最后一条指令对应的轨迹点中。这样,当循迹器读指针 631 指向该指令对应的轨迹点时,除了从一级缓存 602 中读出该指令供处理器核 601 正常执行外,控制器 1154 将总线 1117 上指令类型译码发现是无条件分支类型,因此控制选择器 738 选择总线 1117 ,在下一时钟周期将读指针 631 更新为该无条件分支的分支目标 BN1 (即下一指令块第一条指令对应的 BN1 )。此时,处理器核 601 不需要浪费一个时钟周期执行空指令。
在扫描转换器 608 对指令块审查的过程中,如果发现一级指令块的最后一条指令(对应一条轨道中最后一个轨迹点)为分支指令,扫描转换器 608 在这种情况下不将结束轨迹点合并到该指令对应的轨迹点中,而将结束轨迹点的内容放置在每条轨道最后一条指令对应的轨迹点之后(右方)的轨迹点。当该最后一条指令是无条件分支指令时,控制器 1154 按总线 1117 上的无条件分支类型控制选择器 738 选择总线 1117 上的分支目标放上指针 631 ,跳转至该目标,结束轨迹点不会被执行。当该最后一条指令是条件分支指令时,控制器 1154 按总线 1117 上的条件分支类型控制循迹器暂停,等待处理器核 601 产生的分支判断信号 635 。此时类型译码器 1152 译出总线 1121 上的指令类型为无条件分支,控制选择器 1116 选择总线 1121 。当分支判断信号 635 为'分支',控制器 1154 控制选择器 738 选择总线 1117 上的条件分支目标放上指针 631 。当分支判断信号 635 为'不分支',控制器 1154 控制选择器 738 选择 1116 选择器的输出,将总线 1121 上的无条件分支目标放上指针 631 。一级缓存 602 按指针 631 送出指令供处理器核 601 执行。
上述三种方式都既适用于定长的指令或变长的指令。即并不要求结束轨迹点在轨道中的固定位置。此外,若结束轨迹点在轨道中的位置是固定的,则可以根据读指针 631 中 BN1Y 的值判断是否已经到达最后一条指令。第四种方式为轨道表中只有一种无条件分支类型,但循迹器根据该类型在轨道中所处位置将其分为两种类型。此方式中,指针 631 中的 BN1Y 被送进类型译码器 1152 而总线 1121 上的指令类型不需要译码。当所述 BN1Y 指向一条轨道中最后一个表项时,类型译码器 1152 控制选择器 1116 选择总线 1121 上的分支目标,在执行完总线 1117 上的指令后直接跳转到总线 1121 上的分支目标地址。当所述 BN1Y 指向一条轨道中除最后一个表项之外的其他表项时,类型译码器 1152 控制选择器 1116 选择增量器 736 的输出。当执行完总线 1117 上的表项内容的相应指令后,下一条执行的指令为其顺序下一条指令。这种方式下处理器核中的 PC 则一直保持正确的值。这种方式适应定长指令。
此外,当从总线 1117 上读出的轨道表 610 表项经控制模块 1154 译码其类型为条件分支指令时,本发明可以控制处理器核 601 沿分支中的一支猜测执行( speculate execution ),以提高处理器的执行效率。请参见图 11B ,其为本发明支持猜测执行的实施例。图 11B 中循迹器中与图 11A 中循迹器相比增添了选择器 1162 及寄存器 1164 ,用于选择、存储分支猜测执行未选中的另一支暂存,以备猜测错误时使用。猜测执行方向可以由现有的静态预测,或动态分支预测( branch prediction )技术决定,也可由存于轨道表中对应分支指令的表项中的分支预测域决定。
以猜测不分支为例,控制器 1154 在译出总线 1117 上的一个条件分支类型并获得不分支的预测值时,控制选择器 1162 及寄存器 1164 选择总线 1117 上的分支目标地址存入寄存器 1164 。同时控制器 1154 控制选择器 738 选择 1116 选择器的输出(其为分支指令的顺序下一条指令)供存入寄存器 740 ,使指针 631 控制一级缓存 602 提供分支指令后的顺序下一条指令供处理器核 601 执行,并向处理器核标记这条指令为猜测执行。指针 631 也指向轨道表 610 中分支指令后顺序第一个表项,使其被放上总线 1117 。之后控制器 1154 按总线 1117 上的指令类型决定循迹器的后续方向,继续向处理器核提供指令。所有这些指令都被标记为猜测执行。当总线 1161 通知分支判断信号 635 为有效时,控制器 1154 将预测的分支方向与 635 上的分支方向比较。若比较结果相同,则沿原猜测方向继续执行。若比较结果不同,此时控制器 1154 向处理器核 601 发送'猜测错误'的信号,使处理器核清除所有带猜测执行标记的指令及其中间执行结果。同时控制器 1154 控制选择器 738 选择寄存器 1164 的输出,使分支未被猜测执行的一支的地址被用于控制一级缓存器 602 向处理器核 601 提供指令,并沿此继续执行。
若猜测为分支,则控制器 1154 在译出总线 1117 上的一个条件分支类型并获得进行分支的预测值时,控制选择器 1162 及寄存器 1164 选择 1116 选择器的输出(其为分支指令的顺序下一条指令)存入寄存器 1164 。同时控制器 1154 控制选择器 738 选择总线 1117 上的分支目标地址供存入寄存器 740 ,使指针 631 控制一级缓存 602 提供分支指令的分支目标指令供处理器核 601 执行,并向处理器核标记这条指令为猜测执行。指针 631 也指向总线 1117 上的分支目标地址指向的轨道表 610 中表项,使其被放上总线 1117 。之后控制器 1154 按总线 1117 上的指令类型决定循迹器的后续方向,继续向处理器核提供指令。所有这些指令都被标记为猜测执行。当总线 1161 通知分支判断信号 635 为有效时,控制器 1154 将预测的分支方向与分支判断信号 635 上的分支方向比较。若比较结果相同,则沿原猜测方向继续执行。若比较结果不同,此时控制器 1154 向处理器核 601 发送'猜测错误'的信号,使处理器核清除所有带猜测执行标记的指令及其中间执行结果。同时控制器 1154 控制选择器 738 选择寄存器 1164 的输出,使分支未被猜测执行的一支的地址被用于控制一级缓存器 602 向处理器核 601 提供指令,并沿此继续执行。
现有的指令集转换技术通常用一个固定指令转换模块(有时将其称为译码器)将一种外部计算机指令集转换为内部指令集(有时称为微操作)后供执行内部指令集的处理器核执行。通常该转换模块位于存储外部指令的缓存和处理器核之间,处理器核提供的外部指令地址寻址缓存读出外部指令经转换模块转换为内部指令后供给处理器核执行。对外部指令的重复转换,不仅大幅增加功耗,而且时延较长的指令转换器在指令执行的关键路径上,需要较深的指令缓冲器( Instruction Buffer ),大幅加深了处理器核的流水线,从而增加了硬件开销和分支预测失败时的性能损失。当转换模块位于缓存之前时,缓存内存储的是可被处理器核直接执行的内部指令,但因为内部指令(一般是定长指令)和外部指令(可以是变长指令)一般不是一一对应的,因此在分支转移时缺乏可靠地将分支目标指令的外部指令地址(一般由外部指令编译器产生的分支偏移量及外部分支指令地址相加产生, 上述两者都以外部指令地址表达)转换为内部指令地址并以此在缓存中寻址正确的内部指令的方法及系统。导致现有的处理器宁愿承受上述因重复转换同一指令导致的功耗、性能、成本等损失,而将指令转换模块置于缓存与处理器核之间,而在一级指令缓存器存储外部指令的原因。虽然使用跟踪缓存、指令循环缓冲器等可以在程序执行路径( trace )命中或执行循环代码时避免所述实时地址转换,但跟踪缓存中会同时重复存储位于不同路径上的同一指令,造成很大的容量浪费,导致跟踪缓存的性能不高。这些存储器在某些特定条件下可以用特定的指令地址寻址,但无法让处理器核在任意条件下用指令地址可靠,高效地对存储内部指令的存储器实现如同正常缓存方式的寻址,不可避免地要经常,重复读取外部指令将其经转换器转换为内部指令,或者使用低效的软件方式将外部指令地址翻译成内部指令地址。总之,现有技术缺乏可靠,高效的方法及系统将外部指令地址转换为内部指令地址,是影响虚拟机效率的瓶颈。另外现有的指令转换器都是将固定的一种或少数种特定的外部指令集转换为内部指令集。
采用本发明所述的指令集转换系统和方法则可以将转换得到的内部指令存储在缓存中,并由地址映射模块完成对处理器核产生的外部指令地址向内部指令地址的转换,使得处理器核可以直接对已经存储在缓存中的内部指令寻址,而不需要处理器核反复对存储外部指令的缓存寻址,读出外部指令后经指令转换器转换为内部指令后供处理器核执行,多次反复的对一级缓存中的同一外部指令进行转换,从而避免上述功耗,关键路径上长时延,及额外的硬件开销成本问题。本发明所述的可配置指令转换器可以根据配置将任意种不特定的外部指令集转换为内部指令集。
本发明所述的指令集转换系统主要由转换器和地址映射模块两大部分组成。本发明所述的转换器可以是固定转换也可以是可配置的。根据本发明技术方案,当一个处理器核可执行的指令集(即内部指令集)中的指令与任意需要运行的指令集(即外部指令集)中的指令一一对应时,可配置转换器就可以与处理器核共同使用,将外部指令转换为内容指令供所述处理器核执行。此时,外部指令中分支指令对应的分支目标地址与该分支指令对应的内部指令的分支目标地址是相同的,不需要进行外部地址到内部地址的映射。请参考图 12 ,其为本发明所述包含可配置转换器的处理器系统的一个实施例。在本实施例中,外部指令 1205 经可配置转换器 1202 转换后被存储在指令存储器 1203 中,供处理器核 1201 直接执行。在此,指令存储器 1203 中存储的是内部指令,可配置转换器 1202 的功能和结构与图 2 实施例中的转换器 200 类似。由于外部指令和内部指令一一对应,因此外部指令地址与内部指令地址是相同的,当处理器核 1201 执行一条分支指令, 如果不执行分支,则以分支指令地址增' 1 '作为下条指令的地址送到指令存储器 1203 读取内部指令供处理器核 1201 执行;如执行分支,按外部指令的分支偏移量加上分支指令的地址产生的外部指令分支目标地址,与内部指令分支目标地址相同;因此可以直接使用该外部指令分支目标地址对指令存储器 1203 寻址,从中读出分支目标内部指令。不需要将外部指令地址转换为内部指令地址。当执行非分支指令时,其下条指令的地址产生方式与上述分支指令不执行分支时相同。
采用本发明所述的可配置转换器的处理器系统能够根据需要进行配置,从而执行不同的外部指令集。请参考图 13A ,其为本发明所述可配置转换器的一个框图实施例。在本实施例中,存储器 201 如图 2 所述存储了内部指令集和外部指令集的转换规则。提取器 1302( 即图 3 中操作码提取器 211 , 213 , 215) 则如前所述从总线 1205 送来的外部指令中抽取出外部指令操作码作为寻址地址经总线 1307 送到存储器 201 寻址读出对应于该外部指令的转换规则,其中的掩膜及移位控制信号经总线 1308 控制移位模块 1303 (即图 2 中 221 , 223 , 225 , 227 )对外部指令中各个指令域(如操作数的寄存器堆地址)提取,掩膜并移位到内部指令的格式规定的位置;其中的内部指令操作码也经总线 1309 送出, 并按规则移位到内部指令格式规定的位置,与上述掩码、移位后的指令在合并模块 1304 (与图 2 中 207 相似)中合并为内部指令,经总线 1306 输出。这样,本发明所述可配置转换器就完成了将外部指令转换为内部指令的操作;改变存储器 1301 中的转换规则就可以使指令转换器与执行内部指令的处理器核的组合执行不同的外部指令集。
此外,还可以在所述可配置转换器中增加一个寄存器用于存储外部指令是定长( Fix Length )还是变长( Variable Length )的信息。当该寄存器被配置为定长(例如被配置为' 0 ')时,则表示外部指令在外部指令块中的边界是对齐的,因此在转换时可以从外部指令块的起始地址开始转换。当该寄存器被配置为变长(例如被配置为' 1 ')时,则表示外部指令在外部指令块中的边界不一定对齐,此时只能对目标指令开始直至该外部指令块中最后一条尚未被转换的指令进行转换。
进一步,可以在存储器 1301 中存储复数种外部指令集的转换规则,其中每种外部指令集各有其地址空间,不同的程序线程选择不同的转换规则地址空间。
此时在图 2 中控制提取外部指令操作码的寄存器 212 , 214 , 216 之外再增设一个寄存器存储该线程对应的指令集转换规则的存储器 201 基地址。另将上述寄存器增设为复数组,每组对应一种外部指令集, 由选择器选择。并在处理器的存储器管理器 MMU 中的线程号存储器(一般在 TLB 中)对应每条线程添加一个存储域,存储选择上述复数组寄存器的选择信号。请参考图 13B ,其为本发明所述可配置转换器中存储器的一个实施例。例如寄存器组 1311 存储的是 P 指令集的操作码提取位置及其相应指令转换规则在存储器 201 中的基地址' m ';寄存器组 1311 存储的是Q指令集的操作码提取位置及其相应指令转换规则在存储器 201 中的基地址' n '。
当线程 J 的外部指令由指令转换器转换时, MMU 中 J 线程的选择信号 316 控制选择器 315 选择寄存器组 1311 的输出。此时,操作码提取器 1302( 即图 3 中操作码提取器 211 , 213 , 215) 受寄存器组 1311 的控制对被转换的外部指令提取操作码;该操作码与也来自寄存器组 1311 的基地址' m '被加法器 1318 相加后作为地址对转换规则存储器 201 寻址, 控制指令转换器的操作,将 P 指令集指令转换成内部指令存入图 12 中指令存储器 1203 。当线程 K 的外部指令由指令转换器转换时, MMU 中 K 线程的存储器的选择信号 316 控制选择器 315 选择寄存器组 1313 的输出。此时,操作码提取器 1302 受寄存器组 1313 的控制对被转换的外部指令提取操作码;该操作码与也来自寄存器组 1313 的基地址' n '被加法器 1318 相加后作为地址对转换规则存储器 201 寻址, 控制指令转换器的操作,将Q指令集指令转换成内部指令存入图 12 中指令存储器 1203 。如此处理器核在从 J 线程切换到 K 线程时,实际上是从执行 P 指令集指令转换为执行Q指令集指令。如此可实现在一个本发明所公开的虚拟机中执行含有复数种外部指令集中的指令的程序。当然用复数个指令转换器,每个负责转换一种外部指令集,也可实现同样的功能。
某些计算机指令集中的指令上的复数个域之间是正交的( Othogonal ),即这些域之间是独立的, 比如有些指令集除操作码域外还用指令中某些域中的编码来代表对特定存储器或寄存器的寻址, 这些域也需要由转换规则进行映射, 而非对外部指令中的地址进行移位就满足内部指令的要求。 此时可以用复数个转换规则存储器及相应逻辑对应复数个正交的指令域,使得转换规则存储器的总表项数(行数)控制在一个合理的数目。请参考图 13C ,其为本发明所述可配置转换器中存储器的另一个实施例。与图 13A 相比,图 13C 中增添了一个转换规则存储器 1321 及其专用的提取器 1322 (与 1302 同样功能),及移位逻辑 1323 (与 1303 同样功能)。另外还新增了如同图 13B 例中的寄存器组 1311 及 1313 的寄存器组(图 13C 中位显示)以控制新增的存储器 1321 及其相应逻辑。新增的逻辑中存储器 1321 及掩膜移位逻辑 1323 的输出都被送到合并其 1304 与原有存储器 201 及掩膜移位逻辑 1303 的输出合并。两套存储器及其相应逻辑可以分工协同处理同一计算机指令集,各负责外部指令上部分域的转换,在合并器 1304 中合并成内部指令。两套存储器及其相应逻辑还可以独立操作,各自独立负责将一种外部指令转换为内部指令,实现如图 13B 的功能。为此可增设一个可写的寄存器,由这个寄存器的状态决定图 13C 的指令转换器是以协作,或独立方式操作。
此外图 13A 中合并模块 1304 还要根据内部指令的转换顺序产生与外部指令的映射关系,例如图 8A 或图 8B 所示的例子,以供填写块地址偏移映射器 YMAP 等。合并模块 1304 还产生写地址,控制将内部指令填入指令存储器 1203 等。如果内部指令是定长的则每对指令存储器 1203 写进一条指令,一级缓存写地址加上一个固定长度,如 4 个字节。如果内部指令是变长的则存储器 1301 中对应该指令的转换规则中要记载该指令的长度,每对指令存储器 1203 写进一条指令,一级缓存写地址加上从存储器 1301 输出的该指令的长度, 作为下一指令的起始地址。也可以将一个内部指令块的复数条内部指令分次存入一个缓冲器,将整个内部指令块一起写入指令存储器 1203 。也可以由其他模块产生上述映射关系及写地址,如在图 7A ,图 7B 中由扫描转换器中负责扫描的部分负责。
采用本发明所述可配置转换器的处理器系统可以在外部指令集与内部指令集的指令一一对应的情况下工作。然而,当两种指令集的指令不一一对应时,会有一条外部指令被转换为多条内部指令,或多条外部指令被合并为一条内部指令的情况发生;又或者外部指令或内部指令中至少一种为变长指令;从而有可能导致外部指令的分支目标地址与相应内部指令的分支目标地址不一一对应。此时,可以用本发明所述的地址映射模块结合指令转换器实现指令集转换及指令地址的映射。请参考图 14 ,其为本发明所述包含指令转换器和地址映射模块的处理器系统的一个实施例。在本实施例中,外部指令经转换器 1202 转换后被存储在指令存储器 1203 中,供处理器核 1201 直接执行。即指令存储器 1203 中存储的是内部指令,且指令存储器 1203 根据内部指令地址寻址输出相应的内部指令。在转换过程中,转换器 1202 还产生外部指令与相应内部指令的对应关系存储到地址映射模块 1404 中。当处理器核 1201 按指令顺序执行指令存储器 1203 中的内部指令时,其程序计数器 PC 每次增加' 1 ',使得相应的内部指令地址增' 1 ',从而对指令存储器 1203 寻址以读出下一条内部指令。当处理器核 1201 执行分支指令产生分支目标地址时,由于该分支目标地址是以外部指令地址形式表示的,因此被送到地址映射模块 1404 按前述方法转换为对应的内部指令地址后再送到指令存储器 1203 寻址以读出相应的内部指令(即分支目标指令)。具体地,若地址映射模块 1404 中已经存储了所述外部指令地址对应的映射关系,则说明该外部指令对应的内部指令已经被存储在指令存储器 1203 中,可以直接将所述外部指令地址转换为内部指令地址输出。若地址映射模块 1404 中尚未存储所述外部指令地址对应的映射关系,则说明该外部指令尚未被转换为内部指令。此时,由转换器 1202 将包含所述外部指令在内的至少一条外部指令转换后存储到指令存储器 1203 中,并将对应的映射关系存储到地址映射模块 1404 中,这样就可以将所述外部指令地址转换为内部指令地址输出。在此,转换器 1202 可以是固定将一种特定外部指令转换为内部指令的转换器,也可以是图 2 ,图 3 ,图 4 ,图 5 及图 13A 、 B 中公开的可配置指令转换器。
根据本发明技术方案,地址映射模块 1404 可以由映射表构成。所述映射表可以由外部指令地址寻址,其表项内存储了相应内部指令的地址。在此基础上,所述映射表可以有多种具体实现方式。
方式一:映射表中的每个表项均由外部指令地址的最小单位(例如:字节)寻址,每个表项中都存储了该表项对应的外部指令对应的内部指令所在的内部指令块的块地址(即内部指令块在指令存储器 1203 中的块号),以及内部指令在所述内部指令块中的块内地址偏移地址。这样,在对外部指令地址进行转换时,可以根据所述外部指令地址对映射表的表项寻址,读出相应表项中的内部指令块地址及块内偏移地址,完成地址转换。
方式二:当外部指令的长度不固定时,可以对方式一所述映射表进行压缩以消除空的表项。以外部指令按字节寻址为例,由于外部指令长度不固定,只有每条外部指令起始地址字节才占据一个表项,存储该外部指令的块内偏移以及相应的内部指令块内偏移地址,而其余外部地址非起始地址字节不占据表项。在此,映射表每行对应一个外部指令块,可以由外部指令块地址寻址。这样,在对外部指令地址进行转换时,可以根据所述外部指令的块地址对映射表的行寻址,读出整行内容。之后,用所述外部指令的块内偏移地址对该行所有表项中的外部指令块内偏移地址进行匹配,选择并输出匹配成功项中存储的内部指令地址,完成地址转换。
方式三:映射表中的每行由两部分组成,第一部分包含数据的位数与一个外部指令块包含多少个最小地址单位相同(例如,数据位数与外部指令块包含的字节数相同),第二部分包含数据的位数与一个内部指令块可能包含的最多内部指令数相同。第一部分中对应各外部指令起始地址(即起始字节)的数据置为' 1 ',其余均为' 0 ',而第二部分中对应各外部指令相应第一条内部指令对应的数据置为' 1 ',其余均为' 0 ',具体格式可以参考图 8B 。这样,在对外部指令地址进行转换时,可以根据所述外部指令的块地址对映射表的行寻址,读出整行内容(包括两个部分)。之后,根据所述外部指令的块内偏移地址对第一部分直至该块内偏移地址字节对应的数据为止的' 1 '进行加' 1 '计数,再对计数结果根据第二部分中的' 1 '进行减' 1 '计数,直到计数结果为' 0 ',此时第二部分中的计数位置对应的就是内部指令的块内偏移地址,完成地址转换。 如图 8C 的装置可以高效完成上述映射。
进一步,可使外部指令块与内部指令块有固定的对应关系(如一个存储外部指令的二级缓存块可以等分成两个二级缓存子块,其中每个子块对应一个存储内部指令的一级缓存块)。如此可以将外部指令与内部指令的映射操作分解为块地址的映射操作(因为有对应关系,所以易于实现),以及块内偏移地址的映射两个部分来实现以简化映射的难度。如此一级缓存块不一定每个表项含有有效的内部指令。以下以一级指令在一级指令块中从最小块内偏移地址(一般为' 0' )开始增序排列。如此对应每个指令块还需要存储其偏移地址最大的指令的偏移地址以提醒系统下个周期要提供按程序顺序下一条指令的一级缓存块地址。此外也需要一个块内偏移映射器根据该二级指令子缓存块与其对应的一级指令缓存块之间的映射关系(如上述三种方式等)提供分支目标的块内偏移映射。
请参考图 15 ,其为本发明所述包含可配置转换器和地址映射模块的处理器系统的另一个实施例。本实施例中转换器 1202 、指令存储器 1203 和处理器核 1201 均与图 12 、 14 中的相同,另外还给出了地址映射模块的一种具体实施方式。在本例中,如果指令存储器 1203 缺失,可将相应的外部指令地址送往更外层次存储器获取相应的外部指令块经指令转换器 1202 如前述转换并填充到指令存储器 1203 中。以下各实施例的说明均假设指令存储器 1203 总是命中。
地址映射模块由标签存储器 1505 (相当于前述实施例中的主动表 604 )、块内偏移映射器 1504 (为了简单明了, 1504 中包含图 6 中 618 偏移地址映射模块和 622 偏移地址映射器的功能)和结束标志存储器 1506 构成,三者的行均与指令存储器 1203 中的内部指令块对应。其中,结束标志存储器 1506 的每行存储了指令存储器 1203 中对应内部指令块的最后一条内部指令的块内偏移地址。可以在处理器核 1201 读取内部指令的同时在结束标志存储器 1506 中检查该内部指令是否为当前内部指令块中的最后一条。若该内部指令不是当前内部指令块中的最后一条,则下一内部指令的块内偏移地址就是该内部指令的偏移地址加一;否则,下一内部指令就是下一内部指令块的第一条内部指令。
标签存储器 1505 中的每行存储了外部指令块地址(即标签),因此可以通过标签匹配找到该外部指令所在指令块对应的内部指令块在指令存储器 1203 中的位置,以及与该内部指令块同一行中块内偏移映射器 1504 中相应的映射关系、结束标志存储器 1506 中该指令块中最后一条内部指令的位置信息。与缓存结构类似,对于不同的存储器组织形式,标签存储器 1505 和指令存储器 1203 可以有不同的结构。具体地,以直接映射存储结构为例,所述外部指令的块地址可以被进一步分为标签和索引号,根据索引号对标签存储器 1505 中的行寻址读出相应行的内容后与块地址中的标签进行比较,若相等则匹配成功,否则匹配不成功。当匹配不成功时可以用该外部指令地址从更低的指令存储器中取得相应的外部指令块经指令转换器 1202 如前转换为内部指令块后按缓存替换规则写入指令存储器 1203 中,并由将外部指令中的标签写入标签存储器 1505 的同一行,将指令转换器 1202 产生的块内偏移映射关系存入块内偏移映射器 1504 、将 1202 产生的该指令块最后一条指令的块内偏移量存入结束标志存储器 1506 的同一行。当然,标签存储器 1505 和指令存储器 1203 也可以被组织为其他任何合适的组织结构(例如:组相联或全相联结构),其具体匹配方法均与缓存中相应组织结构情况下的匹配方法相同,在此不再赘述。为了便于描述,在后面的实施例中均以直接映射结构为例进行说明,且假设标签匹配均成功。
处理器核 1201 根据是否需要分支或跳转通过总线 1508 提供不同的指令地址。当一条指令地址经总线 1508 输出以控制指令存储器 1203 读取指令供处理器核 1201 执行时, 1508 上的块地址也被送到结束标志存储器 1506 中寻址读出该行的结束地址,与 1508 上的内部指令块内偏移地址进行匹配以检查该内部指令是否为内部指令块中的最后一条。如该指令不是内部指令块中最后一条指令时,则结束标志存储器输出的 1507 信号控制处理器核 1201 在下一时钟周期的指令块地址不变,块内偏移量增' 1 '在下个周期放上总线 1508 。若是最后一条,则结束标志存储器输出的 1507 信号控制处理器核 1201 在下个周期输出下一指令块的外部指令块地址(由当前指令块地址增' 1 '所得)并以' 0 '作为内部指令的块内偏移地址,组合成指令地址放上总线 1508 。此时 1507 也控制将 1508 上的指令块地址送往标签存储器 1505 匹配,如匹配,则总线 1508 上就是下一条指令的正确地址。在被执行的指令为非分支指令时,分支判断信号 1509 控制选择器 1510 选择总线 1508 上的块内偏移地址对指令存储器 1203 寻址读取下一周期的内部指令供处理器核 1201 执行。用于对指令存储器 1203 的块地址则在任何时候都来自总线 1508 。
但当处理器核 1201 译码上述来自指令存储器 1203 的指令发现其为一条分支指令时,则按照指令进行分支判断。如果分支判断为'不分支',则下一周期产生的地址如同上述一般。分支判断信号 1509 控制选择器 1510 选择总线 1508 上地址。如果分支判断为'执行分支',则以分支指令的外部指令地址加上分支指令中所含有的分支偏移量获得分支目标的外部指令地址经总线 1508 在下一周期送出。为了减少对地址值的存储,实际上处理器核只记录了分支指令(或其他指令)的内部指令地址。可以用例如图 8B 的映射关系,以图 8C 中的映射装置进行逆运算,即以内部指令地址送到译码器 805 ,及内部指令映射关系送到 807 作为输入,外部指令的映射用以控制矩阵 803 ,则该装置的输出即为外部指令地址。也可以在进行指令转换时,将外部分支指令的外部块内偏移量加到该分支指令的分支偏移量,将其和作为分支偏移量记录到内部分支指令中。如此在处理器核 1201 执行到分支指令时, 只需要将指令块地址(块内偏移量为' 0 ')加上分支指令中所记录的修正后分支偏移量,其和就是正确的外部分支目标地址,省却分支指令内部指令块内偏移量映射到外部指令块内偏移量的操作。
该外部指令分支目标地址中的块地址经总线 1508 被送到标签存储器 1505 匹配,也被送到块内偏移映射器 1504 读出该行的映射关系将 1508 上的外部块内偏移量映射为内部指令块内偏移量 1512 。分支判断信号 1509 控制选择器 1510 选择 1512 作为块内偏移量送往指令存储器 1203 。 1508 总线上的块地址也被送往指令存储器 1203 。若标签存储器 1505 匹配成功,则以该地址取分支目标指令供处理器核执行。
实际上在本实施例中,总线 1508 上的下一条指令的块地址(包含指令地址中标签及索引部分)一直是外部指令地址。其中索引部分被用于对所有存储器如 1505 , 1504 , 1516 及 1203 做行寻址。 而 1508 上的下一条指令的块内偏移地址则根据指令的类型等可以是外部指令地址或内部指令地址。如果当前指令的类型为非分支指令或分支指令但不执行分支时,且该指令非内部指令块中最后一条指令,则下一条指令的块内偏移地址是内部指令格式(当前指令地址增' 1' ,指向当前内部指令的下一条内部指令)。如果当前指令的类型为非分支指令或分支指令但不执行分支时,且该指令是内部指令块中最后一条指令,则下一条指令的块内偏移地址' 0 '可以被视为外部指令格式,也可被视为内部指令格式。如果当前指令的类型为分支指令且执行分支时,则下一条指令的块内偏移地址是外部指令格式,要经过块内偏移映射器 1504 映射为内部块内偏移指令地址才可用于从指令存储器 1203 中读取指令。如果将外部地址中的索引部分视为内部指令地址的块地址,则指令存储器 1203 在任何时刻都是由内部指令地址寻址。 如果指令存储器 1203 及指令地址映射模块的组织方式是多路组,则相似地内部指令的块地址由路号( Way number )与外部指令中的索引部分组成。即本实施例中公开的虚拟机中的地址映射模块,可以将由外部指令编译器产生的外部指令地址直接映射为内部指令地址访问存储内部指令的指令存储器,供处理器核执行。或者也可以将内部指令地址的块地址视为等同于外部指令地址的块地址(包含标签部分及索引部分)。本虚拟机既避免了现有软件虚拟机通过软件将外部指令地址映射为内部指令时的低效及储存庞大的地址映射表的开销;也避免了现有硬件虚拟机由外部指令地址对存有外部指令的指令存储器寻址,读出外部指令,将其由指令转换器转换为内部指令再由处理器核执行,因为多次重复转换同一条指令导致的高功耗。 本虚拟机的一个技术特征是外部指令先经过指令转换器转换才存入指令缓存,因此指令缓存中储存的是内部指令,可不需指令转换直接执行。
根据本发明技术方案,还可以增加分支目标表用于记录分支目标指令的内部指令地址,使得重复执行同一条分支指令发生分支转移时不必每次都需要将分支目标指令的外部指令地址转换为内部指令地址。请参考图 16 ,其为本发明所述包含分支目标表的处理器系统的一个实施例。在本实施例中,可配置转换器 1202 、指令存储器 1203 、处理器核 1201 、标签存储器 1505 、块内偏移映射器 1504 和结束标志存储器 1506 均与图 15 中的相同。不同之处在于增加了分支目标存储器( BTB ) 1607 以及选择器 1608 的连接方式与图 15 种选择器 1510 不同。在此,分支目标存储器 1607 中存储了以内部指令地址形式记录的分支目标历史信息,即存储了该分支指令本身的内部指令地址,及其分支目标的内部指令地址,及之前执行该分支指令时是否转移的预测信息。分支目标存储器 1607 并不必要与其他存储器行行对应。分支目标存储器 1607 输出其分支预测信号 1511 以控制选择器 1608 选择来自总线 1508 或分支目标存储器 1607 的指令地址。
这样,在处理器核 1201 经总线 1508 输出内部指令地址到指令存储器 1203 寻址的同时,还将该内部指令地址送到分支目标存储器 1607 与存储在其中的所有分支指令本身的内部指令地址进行匹配,并输出匹配成功项包含的分支目标内部指令地址和预测信息。当前指令为非分支指令时或虽为分支指令但分支预测为不分支时,下一时钟周期分支预测选择信号 1511 控制选择器 1608 选择总线 1508 上的指令地址访问指令存储器 1203 ,其操作与图 15 实施例在执行同样指令时的操作相同,在此不再赘述。当前指令为分支指令且分支预测为执行分支时,分支预测选择信号 1511 控制选择器 1608 选择分支目标存储器 1607 输出的内部指令分支目标地址访问指令存储器 1203 。当前指令为分支指令但在分支目标存储器 1607 中未匹配命中时,则在分支目标存储器 1607 中按置换规则分配一个表项以存储分支指令的内部指令地址。如果分支判断为'执行分支',则如图 15 例处理器核 1201 产生外部指令地址经总线 1508 送出。以该外部指令地址如图 15 例经标签存储器 1505 匹配确认的指令块地址,及经块内偏移映射器 1504 映射得到的内部指令块内偏移量 1512 一同作为内部指令分支目标地址,以及分支预测值存储进分支目标存储器 1607 中新分配表项中的相应域。该内部指令分支目标地址也被分支目标存储器 1607 旁路经选择器 1608 访问指令存储器 1203 。如果分支判断为'不分支',则将分支目标存储器 1607 中新增表项置为无效,分支预测选择信号 1511 控制选择器 1608 选择总线 1508 上的指令地址(此时为分支指令的下一条顺序内部指令的地址)访问指令存储器 1203 ; 此时 1508 上的指令地址于图 15 例中在同等条件下产生的地址相同,不再赘述。当执行分支指令判断分支预测为错误时,处理器核 1201 清除按错误预测执行的指令的中间结果,执行正确的分支,并更新分支目标存储器 1607 中存储的分支预测。
请参考图 17 ,其为本发明所述包含分支目标表和循迹器的处理器系统的另一个实施例。本实施例中的转换器 1202 、指令存储器 1203 、处理器核 1721 、标签存储器 1505 、块内偏移映射器 1504 、结束标志存储器 1506 和分支目标存储器 1607 均与图 16 中的相同。不同之处在于,本例中还包括下块地址存储器 1709 、选择器 1711 、或逻辑 1707 和循迹器,并通过循迹器产生内部指令地址,使得处理器核 1701 只需要输出外部指令地址即可。
本实施例中新增的下块地址存储器 1709 与标签存储器 1505 、块内偏移映射器 1504 、结束标志存储器 1506 均行行对应,其格式请参考图 18A 的一个实施例。在本例中下块地址存储器每行包含两个部分:第一部分 1801 存储了该行对应的内部指令块的上一内部指令块的 X 地址;第二部分 1802 存储了该行对应的内部指令块的下一内部指令块的 X 地址。这样,使用当前内部指令块的块地址(即循迹器输出的 X 地址)对下块地址存储器 1709 寻址,即可读出顺序地址的上一、下一内部指令块的相应 X 地址。而选择器 1711 则根据处理器核 1201 输出的分支转移是否发生的 TAKEN 信号 1713 对下块地址存储器 1709 输出的下一内部指令块的 X 地址与 Y 地址' 0 '构成的下一内部指令块第一条内部指令地址,及分支目标存储器 1607 输出的分支目标内部指令地址选择后送往选择器 1705 。或逻辑 1707 则在当前内部指令为内部指令块最后一条指令、或发生分支转移时,控制选择器 1705 选择来源于选择器 1711 的输入。
所述循迹器由寄存器 1701 、增量器 1703 和选择器 1705 构成。其中,寄存器 1701 中存储, 并输出由块地址(以下简称为 X 地址)和内部指令块内偏移地址(以下简称为 Y 地址)构成的当前内部指令地址 1723 。当前内部指令地址 1723 用于对指令存储器 1203 寻址读取其中一行中的内部指令送到处理器核 1721 译码,并同时访问下块地址存储器 1709 ,结束标志存储器 1506 的相应一行,也被送到分支目标存储器( BTB ) 1607 中匹配。 1723 中的 X 地址寻址结束标志存储器 1506 中读取相应的一行中的内容与 1723 上的 Y 地址进行匹配以检查该指令是否为内部指令块中的最后一条。若该指令不是最后一条且处理器核 1721 对该指令译码结果判断为不是分支指令,则或逻辑 1707 控制选择器 1705 选择寄存器 1701 输出的 X 地址及增量器 1703 输出的增' 1 '后的 Y 地址存储到寄存器 1701 中作为下个时钟周期的当前内部指令地址。
若该指令是最后一条内部指令或是分支指令,则选择器 1705 在或逻辑 1707 的控制下选择选择器 1711 的输出存储到寄存器 1701 中作为下个周期的当前内部指令地址。具体地,若当时分支判断信号( TAKEN ) 1713 为'不分支',则控制选择器 1711 选择下块地址存储器 1709 中,由前述当前内部指令地址 1723 寻址提供的下一内部指令块第一条内部指令的地址,经选择器 1705 选择后存储到寄存器 1701 中。若当时分支判断信号( TAKEN ) 1713 为'执行分支',则控制选择器 1711 选择分支目标存储器 1607 中,由当前内部指令地址 1723 匹配获得的分支目标内部指令地址,经选择器 1705 选择后存储到寄存器 1701 中。也可以用分支目标存储器 1607 中存储的分支预测值代替处理器和 1721 产生的分支判断信号 1713 控制选择器 1711 及 1705 。这种方式需要核实分支预测是否正确及一旦预测错误可以修正的机制。
本实施例中,控制指令存储器 1203 等的内部指令地址由循迹器提供。处理器核 1721 仅在当前内部指令地址 1723 与分支目标存储器 1607 内容匹配不命中,或在下块地址存储器 1709 寻址遇到无效表项,且在分支判断及结束指令判断选择了上述不命中或无效指令地址时才需要提供外部指令地址 1708 作为下一周期的地址。具体地,在上述分支目标存储器 1607 内容匹配不命中时,处理器核 1721 如图 16 例同样方式计算外部指令分支目标地址 1708 送往标签存储器 1505 匹配,也送到块内偏移映射器 1504 映射。匹配映射所得的内部指令分支目标地址如图 16 例同样方式被存入分支目标存储器 1607 的表项,并被存入寄存器 1701 作为当前内部指令地址 1723 。在上述下块地址存储器 1709 寻址遇到无效表项时,处理器核 1721 如图 16 例同样方式计算外部指令下块地址 1708 送往标签存储器 1505 匹配。匹配所得的内部指令下块地址被存入上述无效表项中的 1802 域,也要将本地址块的块地址存入下块地址存储器 1709 中对应上述匹配所得的下块地址所指向的一行中的 1801 域。
需要说明的是,由于各个顺序地址的内部指令块通过下块地址存储器 1709 中存储的信息联系在一起,即可以根据当前内部指令块的 X 地址对下块地址存储器 1709 寻址读出下一内部指令块的 X 地址 1802 。如果某个内部指令块被替换出指令存储器 1203 ,可以根据该内部指令块的 X 地址对下块地址存储器 1709 寻址读出其中存储的上一内部指令块的 X 地址 1801 ,再根据该 1801 中的 X 地址对下块地址存储器 1709 寻址找到相应行,将该行中存储下一内部指令块(即被替换的内部指令块) X 地址的部分 1802 置为无效,从而反映替换后的地址关系。如果指令存储器以组相联方式组织,则一个指令块的下个指令块的行地址是本指令块的行地址增' 1 ',可以缺省 ; 1801 及 1802 域记录路号( Way number )即可实现其功能。
进一步地,可以将上述技术扩展到包含更多层指令存储器的系统。请参考图 19 ,其为本发明所述包含两层指令存储器的处理器系统的一个实施例。在本例中,转换器 1202 、指令存储器 1203 、处理器核 1201 、块内偏移映射器 1504 、结束标志存储器 1506 、分支目标存储器 1607 、下块地址存储器 1709 、选择器 1711 、或逻辑 1707 和循迹器均与图 16 中的相同。不同之处在于,指令存储器 1203 、块内偏移映射器 1504 、下块地址存储器 1709 、结束标志存储器 1506 和分支目标存储器 1607 共同构成第一级指令存储层次,而指令存储器 1903 、标签存储器 1905 和块地址映射模块 1904 (与图 6 中 620 类似功能)共同构成第二级指令存储层次。在此,指令存储器(以下改称一级指令缓存器以资与 1903 明确区分) 1203 中存储的是内部指令,而指令存储器 1903 中存储的是外部指令。指令存储器 1903 中的外部指令在被处理器核 1201 执行前先被转换器 1202 转换为相应的内部指令后存储在一级指令缓存器 1203 中供处理器核 1201 取用。
在本实施例中一个外部指令块可以对应多个内部指令块。在本例中,指令存储器 1903 中包含了一级指令缓存器 1203 中的所有内部指令对应的外部指令,因此可以使用一个标签存储器 1905 同时为两个存储层次服务。
在本实施例中,标签存储器 1905 的行与指令存储器 1903 中的外部指令块一一对应,其中存储有相应外部指令块的标签地址。此外,本例中还增加了块地址映射模块 1904 ,也与标签存储器 1905 行行对应,每行存储了该外部指令块在一级指令缓存器 1203 中对应的单数个或复数个内部指令块的 1X 地址及有效信号(当该外部指令块对应的一个内部指令块尚未被存储到 1203 中时,其相应 1X 地址的有效信号为无效)。请参考图 18C ,其为所述两个存储层次虚拟机系统中外部指令地址格式的示意图。在此,外部指令地址由块地址、子块号 1813 和块内偏移地址 1814 构成。其中,块地址对应指令存储器 1903 中的外部指令块,可以被进一步分为标签 1811 和索引号 1812 ,并可根据索引号 1812 对标签存储器 1905 的行寻址读出存储在其中的标签信息,与地址中的标签 1811 比较以确定外部指令块地址是否匹配成功。索引号 1812 也可对块地址映射模块 1904 中的存储器寻址选出其中一行,子块号 1813 选择该存储器中的一列。
请参考图 20 ,其为本发明所述块地址映射模块 1904 结构的一个示意图。所述块地址映射模块由写入模块 2001 、输出选择器 2007 ,以及存储器构成。此例中每个外部指令块被分为两个子块,每个子块中的外部指令被指令转换器 1202 转换为内部指令存入一级指令缓存中的一个一级指令块。因此 1904 中的存储器每一行对应于二级指令缓存 1903 中的一个(二级)外部指令块,存储器也被分为两列 2003 及 2005 对应每个外部指令块中的两个子块由子块号 1813 选择。存储器的每个表项对应一个子块,其中存有与该外部指令子块对应的(一级)内部指令块的一级指令块地址( 1X 地址)。如此块地址映射模块 1904 可以将外部指令块地址映射为与其相应的内部指令块地址,将外部指令子块与其相应的内部指令块联系起来。并且一个外部指令子块的相应内部指令块可以被放置在一级指令缓存中任意一个一级缓存块中,因此一级指令缓存可以是全相联组织形式。
具体地,对块地址映射模块 1904 中存储器写入时由外部指令地址中的子块号 1813 控制写驱动器 2001 选择驱动存储器列 2003 或 2005 ,由索引地址 1812 选择存储器中一行供写入相应的内部指令 1X 地址(即图 20 中的 1X )。对块地址映射模块 1904 中存储器读出时由索引地址 1812 选择存储器中一行读出,由外部指令地址中子块号 1813 控制选择器 2007 选择存储器列 2003 或 2005 的数据输出。
回到图 19 ,对于第一级指令存储层次,其工作原理和运行过程与图 17 实施例类似,不同之处仅在于当前内部指令地址 1723 与分支目标存储器 1607 内容匹配不命中,或在下块地址存储器 1709 寻址遇到无效表项,且在分支判断及结束指令判断选择了上述不命中或无效指令地址时的处理有所不同。如图 17 实施例一样,此时处理器核 1721 提供外部指令地址 1708 作为下一周期的地址。不同的是该外部指令地址不再直接由图 17 中在本层次中的标签存储器 1505 转换,而是以该外部指令地址中的索引 1812 读出标签存储器 1905 中的一个表项与外部指令地址中的标签 1811 匹配,以外部指令地址中的索引 1812 及子块号 1813 对块地址映射模块 1904 寻址。如标签匹配命中,且 1904 中读出的 1X 地址有效,说明所需的内部指令已存储在一级指令缓存器 1203 中。此时将读出的 1X 地址经总线 1906 送回第一级指令存储层次填充无效的下块地址存储器 1709 中表项;或以该 1X 地址寻址块内偏移映射器 1504 将总线 1708 上的外部指令块内偏移地址映射为内部指令块内偏移地址, 1X 地址连同上述内部指令块内偏移地址构成内部指令分支目标地址( 1Y 地址)存入匹配未命中的分支目标存储器 1607 表项。此后操作与图 17 例中相同。
如标签匹配命中,且从 1904 中读出的 1X 地址无效,说明所需的内部指令尚未存储在一级指令缓存器 1203 中。此时以总线 1708 上的外部指令地址对二级缓存 1903 寻址,将相应外部指令子块送到指令转换期 1202 转换为内部指令块存储进一级指令缓存器 1203 中由缓存置换逻辑指定的一级缓存块,并将该一级缓存块的 1X 地址存入 1904 中外部指令指向的表项(即原读出无效的表项), 并将该地址设为有效。指令转换过程中产生的块内偏移映射关系,以及结束标志也被写入块内偏移映射器 1504 及 1506 中由该 1X 地址指向的行。将读出的 1X 地址如前例经总线 1906 送回第一级指令存储层次存入无效的下块地址存储器 1709 中表项,或连同映射产生的内部指令块内偏移地址存入匹配未命中的分支目标存储器 1607 表项。此后操作与图 17 例中相同。
如标签匹配未命中,说明所需的指令尚未存储在二级指令缓存 1903 中。此时将总线 1708 上的外部指令地址送到更低层的存储器取外部指令块填入二级指令缓存 1903 中由缓存置换逻辑指定的一个二级缓存块。同时将总线 1708 上外部指令中的标签 1811 存入标签存储器 1905 中与上述二级缓存块相应的表项,将块地址映射模块 1904 中与上述二级缓存块相应的两个表项均置为无效。之后按上述标签匹配命中但寻址获得的块地址映射模块中 1X 地址无效的情况执行。
当外部指令为定长指令时,外部指令块或子块的边界与一条外部指令的起点重合。因此无论是因为顺序执行进入、还是因为分支转移进入该外部指令块(或子块)时,都可以从该外部指令块或子块的边界开始将完整的块或子块转换成相应的内部指令块存储到内部指令存储器中。当外部指令集是一种变长指令集时,一个外部指令块(或子块)中第一条外部指令的起始地址可能不一定与块(或子块)的边界重合。这种情况下,当分支转移进入一个外部指令块时,只能对从分支目标指令开始至该外部指令块结束的部分外部指令块进行转换并存储在一个内部指令缓存块以供处理器核执行;而对于该分支目标指令之前的指令则要等待下一次分支目标或顺序进入该外部指令块时其起始点落在这些指令上才能进行转换并将转换得到的内部指令添加进上述内部指令块。可以修改转换得到的内部指令在一级指令缓存器 1203 中的存储形式以适应这种情况,并定义每条外部指令属于其开始地址所在的外部指令块。
请参考图 21 ,其为本发明所述外部指令与块边界不对齐情况下指令存储器存储内部指令的一个实施例。外部指令块 2101 为指令存储器 1903 中的一行外部指令块(或子块),内部指令块 2102 为一级指令缓存 1203 中与外部指令块 2101 对应的一行内部指令块。假设第一次分支转移的目标指令为外部指令 2105 ,可以从目标指令 2105 开始一直到该指令块转换完毕,存入内部指令缓存块。因此可以依然按照地址递增的顺序存储内部指令,但将转换得到所有内部指令的最高地址与内部指令块 2102 的地址最高位( MSB )处(即图 21 中内部指令块 2102 的最右侧)对齐。这样,外部指令 2105 对应的内部指令 2106 就被存储到如图 21 所示位置,且从外部指令块 2101 中从指令 2105 开始的所有外部指令对应的内部指令均被按地址顺序存储到内部指令块 2102 中如图 21 所示的阴影部分中。
此外,在本实施例中,指令存储器 1903 和 1203 的每行都增加了一个指针,分别用于指向外部指令块中已经被转换的第一条外部指令(如图 21 中指向内部指令 2105 的指针 2103 ),以及内部指令块中已经被存储的第一条内部指令(如图 21 中指向内部指令 2106 的指针 2104 )。这样,当再次因顺序执行或分支转移进入该外部指令块时,就可以比较进入时外部指令块内偏移地址和所述指针 2103 ,确定目标指令是否已被转换。进一步地,如果确定新的目标指令尚未被转换,则对外部指令块 2101 中从该新的目标指令开始直至指针 2103 指向的外部指令之前的所有外部指令转换后,将转换得到的所有内部指令的最高地址与内部指令块 2102 中指针 2104 指向的地址的前一地址对齐,并依然按照地址递增的顺序存储内部指令。同时分别将指针 2103 、 2104 的值更新为指向所述新的目标指令在外部指令块 2101 中的位置,及该新的目标指令对应的内部指令在内部指令块 2102 中的位置。块内偏移映射器 1504 中的内部指令映射关系也按高位对齐的方式存放,与内部指令缓存块一致。可以将上述两个指针在块内偏移映射器 1504 中的每行中实现。
根据本发明技术方案,当采用图 21 实施例所述内部指令存储方式时,每个内部指令块的第一条指令不一定位于该内部指令块的起始地址(即 Y 地址' 0 ')。因此需要对所述处理器系统中的下块地址存储器进行相应修改。请参考图 18B ,其为本发明所述下块地址存储器格式的另一个实施例。在本例中下块地址存储器每行除了包含图 18A 实施例中的第一部分 1801 和第二部分 1802 外,还增加了一个第三部分 1803 ,用于存储了该行对应的内部指令块的下一内部指令块中第一条内部指令的 1Y 地址。这样,第二部分 1802 和第三部分 1803 共同构成了所述下一内部指令块第一条内部指令的地址,使得在因为外部指令边界不对齐而导致内部指令没有从内部指令块 LSB 开始存储的情况下,依然能根据当前内部指令块的块地址(即 1X 地址)对下块地址存储器寻址读出相应地址以找到下一内部指令块的第一条指令。图 21 及图 18B 的格式也可以应用于图 15 ,图 16 ,图 17 实施例中以处理外部指令起始地址与外部指令块边界不对齐的情况。
图 21 及图 18B 阐述了外部指令子块与内部指令块具有严格一一对应映射关系情况下解决指令与块边界不对齐问题的一个实施例。图 22 为本发明所述块地址映射模块的另一个实施例,描述一种外部指令块与内部指令块间弹性映射以解决指令与块边界不对齐问题的实现方式, 可以应用在图 19 的实施例中。在此例中,以一个外部指令块中的指令可被转换为内部指令放入最多三个(可以是任何数目)内部指令块为例,则块地址映射模块的主体部分被分为 3 个存储器 2201 、 2202 和 2203 ,这三个存储器的每行均与一个外部指令块对应,且每行由两个存储域组成,分别用于存储该外部指令段的起始外部指令在其所在外部指令块中的块内偏移地址(如图中的 2Y ),以及该子块对应的内部指令块在一级指令缓存器 1203 中的块地址(如图中的 1X )。此外,这三个存储器的相应行之间还具有通路 2205 和 2206 ,可以分别将存储器 2201 任意行的内容右移到存储器 2202 的相应行中,以及将存储器 2202 任意行的内容右移到存储器 2203 的相应行中。
当一个外部指令块第一次作为分支目标被访问时,从分支目标的外部指令块内偏移地址( 2Y )起该外部指令块的所有完整指令都被转换为内部指令依次放入一个内部指令块。该 2Y 值以及上述内部指令块的块地址( 1X )被存入图 20 中存储器 2201 中外部指令块地址( 2X )所指向的一行,以记录块地址为该 1X 的内部指令块中第一条内部指令对应该 2X 外部指令块中块内偏移为该 2Y 的外部指令。如果上述内部指令块中放满了还有更多的内部指令,则分配另一个内部指令块存放这些溢出的内部指令,并将溢出内部指令中第一条对应的外部指令的块内偏移地址( 2Y )连同新分配内部指令块的块地址( 1X )存入存储器 2202 中 2X 指向的一行。外部指令与内部指令的块内偏移映射关系也被存入图 19 中块内偏移映射器 1504 中由 1X 寻址的行。
进一步,分支目标的外部偏移地址 2Y 与块内偏移映射器 1504 中由相应的内部指令块地址 1X 指向的映射关系映射为内部指令块内偏移 1Y 。至此,由分支目标开始的外部指令块已经被指令转换器 1202 转换为内部指令;外部指令块地址 2X 也已由块地址映射模块 1904 映射为内部指令块地址 1X ,外部指令块内偏移地址 2Y 也被块内偏移映射器 1504 映射为内部指令偏移地址 1Y 。更进一步,该分支目标内部指令地址 1X , 1Y 可被存入分支预测模块 1607 中,供循迹器选用。
回到图 20 ,下一次该外部指令块被访问时,以访问地址中的外部指令块地址 2X 对存储器 2201 , 2202 与 2203 寻址读出同一行送入比较器 2204 。访问地址中的外部指令块内指令偏移地址 2Y 在比较器 2204 中与从各存储器读出的各 2Y 进行比较,选择第一个其 2Y 值小于访问地址中 2Y 值的存储器所存储的的 1X 值作为块地址映射模块 1904 的输出 1906 。后续操作与前述相同。如果存储器 2201 中的 2Y (其值是块地址映射器 1904 中所有存储器中最小的)仍大于访问地址的 BN2Y ,则访问目标指令仍未转换为内部指令,此时系统控制指令转换器 1202 将从访问目标开始一直到存储器 2201 存储的 2Y 值之前的外部指令都转换为内部指令存入一级缓存块置换逻辑指定的一级缓存块。同时图 20 中存储器 2202 中由访问地址中的外部指令块地址 2X 指向的行被右移到 2203 中同一行,存储器 2201 中由该 2X 指向的行被右移到 2202 中同一行, 而访问目标的 2Y 值与新指定的 1X 值被存入存储器 2201 。如此一个外部指令块以多次访问的起始点开始被转换为若干个内部指令块,其映射关系也被纪录在如图 22 结构的块地址映射模块 1904 中。图 8 实施例中对图 22 结构的块地址映射模块的操作有详细说明。当得到外部指令对应的内部指令 1X 地址与 1Y 地址后,后续操作与前述相同,在此不再赘述。
根据本发明技术方案,还可以将轨道表结合到所述处理器系统中。请参考图 23 ,其为本发明所述包含轨道表的处理器系统的一个实施例。在本实施例中,由于本发明所述的轨道表本身已经包含了分支目标地址信息、下一指令块地址信息,以及结束轨迹点信息,因此可以用轨道表 2301 代替了下块地址存储器 1709 、结束标志存储器 1506 和分支目标存储器 1607 。此外,标签存储器 1905 、块地址映射模块 1904 、转换器 1202 、一级指令缓存器 1203 、处理器核 1201 、块内偏移映射器 1504 、选择器 1711 、或逻辑 1707 和循迹器均与图 19 中的相同。在本例中还增加了扫描器 2302 ,如前所述用于对被转换的外部指令进行审查,并对其中的分支指令计算分支目标的外部指令地址 BN2 后,转换为相应的内部指令地址 BN1 。在本例中,由于所述内部指令地址 BN1 就是一级指令缓存器 1203 的地址,一级指令缓存器 1203 中的内部指令与轨道表 2301 中的轨迹点一一对应,且分支指令对应的轨迹点中包含了分支目标的内部指令地址,因此可以如前所述由循迹器对轨道表 2301 寻址读出轨迹点内容,并根据分支指令执行情况,选择当前循迹地址增' 1 '或轨迹点中的分支目标循迹地址作为下一内部指令的循迹地址。
此外,还可以根据轨道表 2301 中轨迹点的内容确定是否到达内部指令块的最后一条指令。例如,可以在轨迹点中用一个标志位表示该轨迹点是否对应所述最后一条指令,当循迹器读指针指向该轨迹点时,根据总线 2313 上读取到的该标志位值即可判定已经到达所述最后一条指令。
在本例中,轨道表 2301 可以同时通过总线 2311 输出循迹器读指针 1723 指向的轨迹点的内容,以及通过总线 2309 输出该轨迹点所在轨道的结束轨迹点(存有顺序下一个内部指令块起始点的地址)的内容,从而如图 19 实施例那样同时向选择器 1711 提供分支目标循迹 BN1 地址和下一内部指令块 BN1 地址。
本实施例与图 19 实施例还有一个不同之处在于增加了一个选择器 2315 ,用于对块地址映射模块 1904 经总线 1906 送来的 BN1X 及块内偏移映射器 1504 送来的 BN1Y 地址合并而成的 BN1 内部指令地址(也是一级指令缓存地址)和扫描器 2302 输出的 BN2 二级缓存地址选择后存入轨道表 2301 。
具体地,当扫描器 2302 对二级指令缓存 1903 送往一级指令缓存器 1203 的外部指令进行审查时,对其中的分支指令,按外部分支指令地址加指令中携带的外部分支偏移量的方式计算其分支目标的外部指令地址。计算出的外部分支指令地址索引部分对标签存储器寻址,读出内容与外部分支指令中的标签部分匹配。如不命中则以该外部指令从更低层次存储器读取该外部指令块存入二级缓存 1903 中由缓存块置换逻辑指定的二级缓存块;并在标签存储器 1905 的相应一行中存入该外部指令的标签部分,在块地址映射模块 1904 的相应一行将所有有效位置为'无效'。如命中,即以外部指令的索引号 1812 (如果二级缓存 1903 以组相连形式组织还要连同路号)为二级缓存块地址 BN2X ,子块号 1813 与块内偏移地址 1814 为 BN2Y 一同组成二级缓存地址 BN2 。该 BN2 被存放到轨道表 2301 中与该外部分支指令对应的内部分支指令的表项中。如此,当一条外部分支指令经转换为内部指令存入一级指令缓存器 1203 时,其分支目标已经至少以外部指令形式存入二级指令缓存 1903 ,且该内部分支指令的相应轨道表表项中已存有该分支目标的二级缓存地址 BN2 。
以后当循迹器读指针 1723 (一级缓存地址 BN1 )寻址一级指令缓存器 1203 读出内部分支指令供处理器和 1721 执行的同时,也寻址轨道表 2301 读出与该指令相应的轨道表项。当轨道表 2301 的输出 2311 是 BN2 格式时且分支判断 1713 为'执行分支'时,选择器 1711 将该 BN2 放上总线 2304 ,以该 BN2 对块地址映射模块 1904 寻址,如映射输出为'无效',说明该分支目标指令所在的指令块还未转换为内部指令块存入一级指令缓存器 1203 。此时处理器系统控制以该 BN2 寻址二级缓存 1903 读取该外部指令块送往扫描器 2302 如前述计算块中分支指令的分支目标,也如前述送到指令转换器 1202 转换为内部指令块如前述存入一级指令缓存器中由缓存块置换逻辑给出的 BN1X 地址指向的一级指令缓存块。系统也将该 BN1X 地址存入块地址映射模块 1904 中原来'无效'的表项,也将指令转换器 1202 产生的偏移地址映射关系存入块内偏移映射器 1504 中该 BN1X 指向的行。进一步,虚拟机系统控制将外部指令偏移地址 1814 根据上述 BN1X 指向的 1504 中的映射关系行映射为内部指令 BN1Y 。由上述 BN1X 与 BN1Y 构成的分支目标内部指令的一级缓存地址 BN1 被写入对应分支指令的轨道表表项以代替原来的 BN2 。至此分支目标外部指令及其后的外部指令块已被转换为内部指令块存入一级缓存 1721 ,同时该内部分支目标指令的一级缓存地址已被存入与其分支源指令相应的轨道表表项。
以后当循迹器输出的一级缓存地址 1723 ( BN1 )寻址一级指令缓存器 1203 读出内部分支指令供处理器和 1721 执行的同时,也寻址轨道表 2301 读出与该指令相应的轨道表项。当轨道表的输出 2311 是 BN1 格式时,该 BN1 经分支判断信号 1713 等控制选择器 1711 及 1705 选择,如 1713 为'不分支',则循迹器读指针,一级缓存地址 1723 经增量器 1703 增' 1 '后作为下一周期的一级缓存地址 1723 ;如 1713 为'执行分支'则轨道表输出的上述 BN1 作为下一周期的循迹器读指针,一级缓存地址 1723 。一级缓存地址 1723 直接对一级指令缓存器 1203 寻址,读出内部指令供处理器核 1721 执行。图 6 实施例是图 23 中结构的一个具体实现。
轨道表中的结束轨迹点也按同样方式处理, 即当外部指令被转换为内部指令存入一个一级缓存块时,扫描器 2302 也计算出其顺序下一指令块的外部地址(当前外部指令块地址增一)并将其送往标签存储器 1905 匹配。如果不匹配,则按前述方式从更低层存储器取外部指令块存入二级缓存器 1903 中缓存块置换逻辑以 BN2X 地址指定的缓存块并更新标签存储器 1905 及块地址映射模块 1904 中的相应行。如此得到的 BN2X 或匹配时所得的 BN2X 被存入轨道表 2301 中与上述一级缓存块对应行的结束轨迹点。以后当缓存读指针 1723 指向这一行时,该 BN2 从轨道表结束轨迹点经 2309 读出,其 BN2X 如同前例分支目标指令地址 BN2 经 2311 读出时一般经总线 2304 被送到块地址映射模块 1904 映射为 BN1X (如该 BN1X 地址无效,则如前例由 BN2 寻址二级缓存器 2302 将外部指令转换为内部指令并存入一级指令缓存器 1203 中由缓存块置换逻辑以 BN1X 指定的一级缓存块 , 并更新标签地址存储器 1905 及块地址映射模块 1904 ),该 BN1X 与总线 2304 上的 BN2Y 经块内偏移映射器 1504 映射为 BN1Y 。该 BN1X 与 BN1Y 构成 BN1 地址经选择器 2315 被存储进轨道表 2301 中替换原 BN2 。上述分支目标地址或下块地址都可以在第一次进行标签匹配时就检查相应块地址映射模块 1904 表项内容是否有效,如有效则说明分支目标指令或下块指令已经以内部指令形式存储在一级指令缓存器 1203 中,此时即以 1904 表项中的 BN1X 如上述过程将 BN2Y 映射为 BN1Y 而将 BN1 直接存入轨道表。
请参考图 24 ,其为本发明所述利用寄存器堆实现栈操作功能的处理系统的一个实施例。为便于说明,在图 24 中只显示了部分模块和器件。在本例中,处理器核中的寄存器堆 2402 可以被配置为栈使用。此时,栈控制器 2404 则根据指令的译码结果及当前寄存器堆 2402 中的存储状况,调整输出地址 2405 和 2406 分别作为栈顶指针值和栈底指针值送往寄存器堆 2402 。
栈控制器 2404 的具体结构可以采用如图 10A 中控制器 1019 、寄存器 1011 、减量器 1031 、增量器 1041 和选择器实现。其中寄存器 1011 存储了当前栈顶指针值。最基本的栈操作包括出栈( POP )和压栈( PUSH )两种。减量器 1031 和增量器 1041 分别对当前栈顶指针值减' 1 '和增' 1 ',分别对应出栈(栈顶指针值减' 1 ')和压栈(栈顶指针值增' 1 ')的情况。这样,根据指令译码结果,可以将从存储器 2403 读取来的操作数依次压栈到寄存器堆 2402 中(栈顶指针值依次相应增' 1 ')以实现基于栈的数据读取;也可以从寄存器堆 2402 中依次出栈若干个操作数(栈顶指针值依次相应减' 1 ')送到执行单元 2401 做相应算术逻辑运算后,再压栈回寄存器堆 2402 中(栈顶指针值相应增' 1 ')以实现基于栈的运算;还可以从寄存器堆 2402 中出栈操作数存储到存储器 2403 中(栈顶指针值相应减' 1 ')以实现基于栈的数据存储。具体地,可以由寄存器堆处理器指令集中控制每个读或写口的寄存器堆地址域中的三个位控制对该读口或写口进行栈顶指针值操作(增' 1 ',不变,或减' 1 ')。
在运行过程中,可以通过对栈顶指针值和栈底指针值进行比较判断该栈是否已满(或接近满),以及是否已空(或接近空)。一旦由寄存器堆 2402 构成的栈已满(或接近满),则可以在栈控制器 2404 的控制下将靠近栈底的若干数据暂存到存储器 2403 中,同时调整栈底指针指向新的栈底,从而使寄存器堆 2402 构成的栈空出一部分存储空间供后续栈操作使用。可以通过在存储器 2403 中按栈的形式组织存储空间,并以栈操作(压栈、出栈)存储所述需要暂存的数据的方法,保持这些数据的原始顺序信息。这样,一旦由寄存器堆 2402 构成的栈已空(或接近空),则可以在栈控制器 2404 的控制下从存储器 2403 的所述栈中按出栈的顺序读出若干之前暂存的数据存储回寄存器堆 2402 的相应寄存器中,并调整栈底指针指向新的栈底,即恢复这部分数据在被暂存到存储器 2403 前的状态,从而使得寄存器堆 2402 构成的栈中依然存有一部分数据供后续栈操作使用。这样,就可以利用寄存器堆实现栈操作功能。
为了能在不同的硬件平台下通用,某些计算机程序语言产生以栈操作指令为主的中间代码,并在执行时由软件解释器对将中间代码实时翻译成若干条机器指令再由硬件平台执行,因此中间代码的执行效率不高。使用本发明所述处理器系统,可以直接执行这种栈操作指令(即,将每条栈操作指令转换为对应的一条内部指令),从而大幅提高处理器系统的执行效率。此外,与现有技术通常使用软件实现虚拟机相比,本发明所述的多指令集处理器系统完全以硬件实现了虚拟机。
下面以图 23 所述结构为例,对本发明技术的几种实际应用情况进行举例说明。相应的方法和操作过程也可以应用到本发明提出的其他任何合适的结构中(例如图 15 、 16 、 17 、 19 结构等)。此外,为了便于说明,在下面的描述中仅以变长指令集、定长指令集,及栈操作指令集作为外部指令集的例子进行说明,但是其他任何合适的计算机指令集都可以作为外部指令集应用于本发明。
首先,该虚拟机系统被用于执行由变长指令构成的程序,即外部指令为变长指令。先将该变长指令集与内部指令集的相应指令映射转换规则导入到转换器 1202 的存储器 1301 中及将控制 1202 中寄存器 212 等的值写入。其中控制指令转换起点的寄存器其值为从(分支目标或顺序的)进入地址开始转换。据此在执行变长指令时,若处理器核 1201 所需变长指令已经存储在指令存储器 1903 中,则对指令存储器 1903 寻址读出该变长指令所在指令块送往扫描器 2302 和转换器 1202 ,并对从该变长指令开始直至指令块中最后一条尚未被转换的变长指令进行扫描 / 转换,计算分支指令的分支目标地址并转换为相应的内部指令地址,同时对这些变长指令转换得到的内部指令块被依次根据替换算法存储到一级指令缓存器 1203 的相应行中,并在轨道表 2301 的相应行中建立对应的轨道。具体地,在扫描 / 转换时,若一级指令缓存器 1203 中已经存储了分支目标对应的内部指令,则该分支目标的变长指令地址可以通过地址转换(如前所述由标签存储器 1905 、块地址映射模块 1904 和块内偏移映射器 1504 完成)得到对应的内部指令地址 BN1 作为轨迹点内容存储到轨道表中。若一级指令缓存器 1203 中尚未存储分支目标对应的内部指令,但指令存储器 1903 中已经存储了该分支目标,则可以将该分支目标的变长指令地址 BN2 作为轨迹点内容存储到轨道表中。若指令存储器 1903 中尚未存储该分支目标,则可以将该分支目标从更外层存储器填充到指令存储器 1903 中由替换算法确定的行中,并将该分支目标的变长指令地址 BN2 作为轨迹点内容存储到轨道表中。这样,轨道表 2301 中就包含了变长分支指令的分支目标的地址信息。
循迹器则根据从轨道表 2301 读取到的内容和处理器核 1201 对分支内部指令的执行结果,控制一级指令缓存器 1203 输出相应内部指令供处理器核 1201 执行。当按内部指令地址顺序执行时,可以通过增量器 1703 对循迹地址(即内部指令地址)增' 1 ',或选择轨道表 2301 通过总线 2309 输出的下一内部指令块地址 2309 ,直接从一级指令缓存器 1203 中找到相应的内部指令。
当执行分支转移时,可以根据轨道表 2301 的 2311 输出分支目标的内部指令地址 BN1 直接从一级指令缓存 1203 中找到相应的内部指令供处理器核 1201 执行。当轨道表 2301 输出的是分支目标的变长指令地址 BN2 时,若该变长指令对应的内部指令已经在之前的运行过程中被存储到了指令存储器 1203 中,则该变长指令地址如前所述可以通过地址转换得到对应的内部指令地址 BN1 ,并根据该地址从一级指令缓存器 1203 中找到相应的内部指令供处理器核 1201 执行。否则,根据该变长指令地址从指令存储器 1903 中找到相应的变长指令,并如前所述将从该变长指令开始直至指令块中最后一条尚未被转换的变长指令进行扫描 / 转换,将相应的内部指令块存储到一级指令缓存器 1203 中并在轨道表 2301 中建立相应的轨道,同时将该变长指令转换得到的内部指令提供给处理器核 1201 执行。处理器核 1201 执行所述内部指令产生相应的执行结果,例如执行分支内部指令时产生分支转移是否发生的 TAKEN 信号送往循迹器。循迹器则如前所述根据 TAKEN 信号及轨道表 2301 经总线 2313 送来的表示是否到达指令块最后一条指令的信号对多个地址来源进行选择,从而控制程序流继续执行。
在本例中,当处理器系统执行完由变长指令构成的程序后,转而执行由定长指令构成的程序。在这种情况下,可以当最后一条变长指令执行完毕后,停止处理器核的运行,并将处理器核中及各存储器中状态置为无效,将定长指令集与内部指令集的指令对应转换规则及寄存器设置导入到转换器 1202 的存储器中及寄存器中以替代原存储的变长对应转换规则。其中控制指令转换起点的寄存器其值为从外部指令块或子块的最低地址开始转换。在执行定长指令时,若处理器核 1201 所需定长指令已经存储在指令存储器 1903 中,则对指令存储器 1903 寻址读出该定长指令所在指令块送往扫描器 2302 和转换器 1202 ,对该定长指令块整块进行扫描 / 转换,计算分支指令的分支目标地址并转换为相应的内部指令地址,同时对转换得到的内部指令块根据替换算法存储到一级指令缓存器 1203 的相应行中,并在轨道表 2301 的相应行中建立对应的轨道。其具体操作与前述对变长指令的扫描 / 转换、存储到一级指令缓存器 1203 及在轨道表 2301 中建立轨道基本相同,不同之处仅在于在此是对整个定长外部指令块进行扫描转换。循迹器根据从轨道表 2301 读取到的内容和处理器核 1201 对分支内部指令的执行结果,控制一级指令缓存器 1203 输出相应指令供处理器核 1201 执行的过程与前述执行变长指令时相同。
之后,假设该处理器系统接下来执行由变长指令集和定长指令集混合编码的程序,则可以通过在指令集切换时重新配置转换器 1202 的方式实现不同外部指令集的实时切换。具体方式与上述由执行一种指令集更换到另一种指令集类似,只是过程中不需要将轨道表 2301 ,指令缓存器 1203 , 1903 等所有存储器清零。由于轨道表 2301 中不同线程轨道之间互不干扰,而其他各存储器都与轨道表相关,因此各线程之间是相互独立的,有各自独立的轨道空间。在指令集或线程切换时,只要将一个线程的循迹器读指针 1723 和处理器核中的寄存器状态保存起来,留待恢复执行该线程时将这些数据重新填回,就可以当时该线程被切换出的点开始继续( resume )执行。可以在循迹器中使用一个存储器保存循迹器对应每个线程的读指针,使得线程(或虚拟机)切换时可以方便地恢复相应的读指针。同样,可以为处理器核 1701 的各状态寄存器建立一个对应各线程的存储器,如此在不同线程之间切换时,其时间间隔只是读指针,处理器核状态寄存器与读指针存储器,状态存储器间交换数据所需时间。
最后,本发明所述处理器系统还可以结合图 13B 所述方法,由转换器 1202 根据线程号的不同采用相应的指令集对应关系对外部指令进行转换,使得在不同线程对应的指令集不相同的情况下,所述处理器系统不需要通过暂停处理器核来重新配置转换器,可以不间断地执行指令。具体地,可以在执行程序前,将所有可能用到的外部指令集的对应关系按图 13B 实施例所述方法导入到转换器 1202 的存储器中由线程号寻址的存储空间中。在对外部指令进行转换时,首先用线程号对转换器 1202 的所述存储器寻址找到对应的存储空间,然后再按前述方法根据该存储空间内的对应关系将外部指令转换为内部指令。在这种情况下,其他操作过程与前述实施例相同,在此不再赘述。由于每个虚拟机包含不同的线程,因此这样采用本例所述方法就可以实现在同一处理器系统上同时运行多个不同虚拟机的功能。如前所述,由于轨道表 2301 中不同线程轨道之间互不干扰,因此这些复数个虚拟机之间也不会因为不同线程的相同或不同计算机指令集的外部指令共存在二级缓存 1903 而发生相互干扰。保存循迹器读指针及处理器核内寄存器状态的方式如前所述。 在这种方式中,还可以在同一个处理器系统上运行多个执行同一指令集的虚拟机其方式是在指令转换器 1202 中只存储一种外部指令集的转换规则,而各线程的基地址都指向这一种规则。不同线程(不同虚拟机)之间相互独立,线程(虚拟机)切换时如前交换循迹器指针及处理器核寄存器状态即可。
如同以上实施例所述在两种不同的指令集之间无缝切换,具有本发明所述可直接执行栈操作指令的处理器,还可以在执行寄存器操作的指令集与执行栈操作的指令集之间无缝切换,可以不间断地执行多个不同指令集中的指令。具体地,可以在执行程序前,将所有可能用到的寄存器操作或栈操作指令集的转换规则按图 13B 实施例所述方法导入到转换器 1202 的存储器中由线程号寻址的存储空间中。在对寄存器操作指令或栈操作进行转换时,首先用线程号对转换器 1202 的所述存储器寻址找到对应的存储空间,然后再按前述方法根据该存储空间内的对应关系将寄存器操作指令或栈操作转换为内部指令。在定义内部指令时在普通用于控制寄存器操作指令的指令域以外另增一个位,即图 10A 中的 1021 控制信号。在转换寄存器操作指令集为内部指令时,该位被设为' 0 ',使信号 1021 控制选择器 1033 , 1035 , 1037 直接选择内部指令上的寄存器堆地址域直接对寄存器堆 1001 寻址控制其读写。在转换栈操作指令集为内部指令时,该位被设为' 1 ',使信号 1021 控制选择器 1033 , 1035 , 1037 选择由内部指令上控制栈顶指针增减的指令域(可以是用执行寄存器操作指令时的寄存器地址域)控制选择器 1053 , 1055 , 1057 选择的栈顶指针 1045 及其增减量对寄存器堆 1001 寻址控制其读写。如此可以在处理器运行过程中,在寄存器操作的指令集与栈操作的指令集之间无缝切换。这使得只要有适当的条件,如前述的线程号,控制指令转换器 1202 使用正确的转换规则将外部指令转换为内部指令,则栈操作的指令可以内嵌在寄存器操作的指令集的程序中无缝执行。反之亦然。其他操作过程与前述实施例相同,在此不再赘述。
根据本发明技术方案和构思,还可以有其他任何合适的改动。对于本领域普通技术人员来说,所有这些替换、调整和改进都应属于本发明所附权利要求的保护范围。
工业实用性
本发明提出的装置和方法可以被用于各种与指令集转换相关的应用中,可以提高处理器系统的效率。 本发明提出的装置和方法还可以被用于各种与虚拟机相关的应用中,以硬件实现虚拟机,可以提高虚拟机系统的效率 。
序列表自由内容

Claims (62)

  1. 一种指令集转换方法,其特征在于,包括:
    将外部指令转换为内部指令,并建立外部指令地址和内部指令地址之间的映射关系;
    将所述内部指令存储在处理器核能直接访问的缓存中;和
    直接根据该内部指令地址对缓存寻址读出相应的内部指令供处理器核执行;或
    根据所述映射关系将处理器核输出的外部指令地址转换为内部指令地址后,对缓存寻址读出相应的内部指令供处理器核执行。
  2. 如权利要求1所述的指令集转换方法,其特征在于,根据程序执行流及处理器核执行指令的反馈向处理器核提供后续指令;所述处理器核执行指令的反馈可以是处理器核执行分支指令时产生的分支转移是否发生的信号。
  3. 如权利要求1所述的指令集转换方法,其特征在于,对于需要被转换的外部指令:
    提取出外部指令中包含指令类型在内的各个指令域;
    根据提取出的指令类型查找对应的内部指令的指令类型和指令转换控制信息;
    根据所述指令转换控制信息对提取出的相应指令域进行移位;和
    对所述内部指令类型及移位后的指令域进行拼接,构成相应的内部指令,从而将外部指令转换为内部指令。
  4. 如权利要求3所述的指令集转换方法,其特征在于,
    一条外部指令被转换为一条内部指令;其中,该外部指令的指令地址对应内部指令的指令地址;或
    一条外部指令被转换为多条内部指令;其中,该外部指令的指令地址对应所述多条内部指令中第一条内部指令的指令地址。
  5. 如权利要求4所述的指令集转换方法,其特征在于,
    多条外部指令被转换为一条内部指令;其中,所述多条外部指令中第一条外部指令的指令地址对应该内部指令的指令地址。
  6. 如权利要求3所述的指令集转换方法,其特征在于,建立外部指令地址和内部指令地址之间的映射关系。
  7. 如权利要求6所述的指令集转换方法,其特征在于,所述外部指令地址和内部指令地址之间的映射关系包括:
    外部指令地址和内部指令块地址之间的映射关系;
    外部指令块内地址和内部指令块内地址之间的映射关系。
  8. 如权利要求7所述的指令集转换方法,其特征在于,可以用一种数据结构表示外部指令地址和内部指令块地址之间的映射关系;
    所述数据结构中存储了内部指令块地址,且所述内部指令块地址同时按外部指令块地址和外部指令块内地址进行排序。
  9. 如权利要求8所述的指令集转换方法,其特征在于,在所述数据结构中,如果一个外部指令地址对应的内部指令块地址存在,则可以根据所述外部指令地址中的外部指令块地址和外部指令块内地址,在该数据结构中找到对应的位置,读出其中存储的内部指令块地址。
  10. 如权利要求8所述的指令集转换方法,其特征在于,在所述数据结构中,如果一个外部指令地址对应的内部指令块地址不存在,则可以根据所述外部指令地址中的外部指令块地址和外部指令块内地址,找到其插入位置,并在位置中存储该外部指令地址对应的内部指令块地址。
  11. 如权利要求7所述的指令集转换方法,其特征在于,根据所述外部指令块地址和内部指令块地址之间的映射关系,可以对外部指令地址进行转换得到对应的内部指令块地址。
  12. 如权利要求11所述的指令集转换方法,其特征在于,根据所述外部指令块内地址和内部指令块内地址之间的映射关系,可以对外部指令块内地址进行转换得到对应的内部指令块内地址。
  13. 如权利要求6所述的指令集转换方法,其特征在于,对于任意一个外部指令地址,通过正向移位逻辑,从初始值开始,对从该地址所在的外部指令块起始地址开始至该外部指令地址之间的外部指令条数进行计数;其中,每经过一条所述外部指令,正向移一位,最终得到一个移位结果;
    通过反向移位逻辑,从所述外部指令块对应的内部指令块的起始地址开始对每条外部指令对应的第一条内部指令的条数进行计数;其中,每经过一条所述内部指令,反向移一位,直到移位结果恢复为所述初始值;和
    此时对应的内部指令块内地址即对应所述外部指令的块内地址。
  14. 如权利要求6所述的指令集转换方法,其特征在于,通过地址计算,将栈寄存器操作转换为对寄存器堆的操作,使得处理器核内部的寄存器堆能作为栈寄存器使用。
  15. 如权利要求6所述的指令集转换方法,其特征在于,所述转换能将一种或多种指令集的指令转换为一种指令集的指令。
  16. 一种指令集转换系统,其特征在于,包括:
    处理器核,用于执行内部指令;
    转换器,用于将外部指令转换为内部指令,并建立外部指令地址和内部指令地址之间的映射关系;
    地址映射模块,用于存储所述外部指令地址和内部指令地址之间的映射关系,并对外部指令地址和内部指令地址之间进行转换;
    缓存,用于存储转换得到的内部指令,并根据内部指令地址输出相应内部供处理器核执行。
  17. 如权利要求16所述的指令集转换系统,其特征在于,所述转换器进一步包括:
    存储器,用于存储外部指令类型与内部指令类型的对应关系,及相应外部指令和内部指令之间各个指令域的对应关系;
    对齐器,用于将外部指令移位对齐,并在外部指令跨越指令块边界的情况下,将该外部指令移位到一个指令块并对齐;
    提取器,用于提取出外部指令中的各个指令域;其中,提取出的指令类型被用于对所述存储器寻址,以读出所述外部指令对应的指令转换控制信息及相应的内部指令类型,并根据所述控制信息对提取出的指令域进行移位;
    指令拼接器,用于对所述内部指令类型和移位后的指令域进行拼接,构成内部指令。
  18. 如权利要求17所述的指令集转换系统,其特征在于,所述地址映射模块进一步包括:
    块地址映射模块,用于存储外部指令块地址与内部指令块地址之间的映射关系,并将外部指令块地址转换为内部指令块地址;和
    偏移地址映射模块,用于存储外部指令块内地址与内部指令块内地址之间的映射关系,并将外部指令块内地址转换为内部指令块内地址。
  19. 如权利要求18所述的指令集转换系统,其特征在于,所述系统还包括一个循迹系统;所述循迹系统根据存储在其中的程序执行流及处理器核执行指令的反馈,同时对所述程序执行流及缓存寻址,并从缓存中读出后续指令送往处理器核供执行;
    所述处理器核执行指令的反馈可以是处理器核执行分支指令时产生的分支转移是否发生的信号。
  20. 如权利要求19所述的指令集转换系统,其特征在于,地址映射模块中还包含一个正向移位逻辑和一个反向移位逻辑;
    对于任意一个外部指令地址,通过正向移位逻辑,从初始值开始,对从该地址所在的外部指令块起始地址开始至该外部指令地址之间的外部指令条数进行计数;其中,每经过一条所述外部指令,正向移一位,最终得到一个移位结果;
    通过反向移位逻辑,从所述外部指令块对应的内部指令块的起始地址开始对每条外部指令对应的第一条内部指令的条数进行计数;其中,每经过一条所述内部指令,反向移一位,直到移位结果恢复为所述初始值;和
    此时对应的内部指令块内地址即对应所述外部指令的块内地址。
  21. 如权利要求20所述的指令集转换系统,其特征在于,处理器核内的寄存器堆可以被用做栈寄存器;所述系统还包含:
    栈顶指针寄存器,用于存储当前栈顶指针,该指针指向寄存器堆中的一个寄存器;
    加法器,用于计算栈顶指针加一的值,对应当前栈顶之上的寄存器的位置;
    减法器,用于计算栈顶指针减一的值,对应当前栈顶寄存器之下的寄存器的位置;
    栈底控制模块,用于检测栈寄存器是否即将为空或即将为满,并在栈寄存器即将为满时将栈底位置的至少一个寄存器的值送往存储器保存,并相应调整栈底指针,使得栈寄存器不会溢出;或
    在栈寄存器即将为空时,相应调整栈底指针,并将之前送到存储器保存的至少一个寄存器的值存回栈底,使得栈寄存器能继续提供操作数供处理器核执行。
  22. 如权利要求1所述的缓存方法,其特征在于,对填充到一级缓存的指令进行审查,提取出相应的指令信息;第一读指针根据所述指令信息而非指令本身的功能确定如何更新。
  23. 如权利要求1所述的缓存方法,其特征在于,当第一读指针指向一条有条件分支指令,且其后一条是无条件分支指令时,则根据处理器核对有条件分支指令的执行结果:
    若分支转移发生,第一读指针被更新为所述有条件分支指令的分支目标寻址地址值;若分支转移没有发生,第一读指针被更新为所述无条件分支指令的分支目标寻址地址值;
    使得处理器核不需要单独一个时钟周期执行所述无条件分支指令。
  24. 如权利要求1所述的指令集转换方法,其特征在于,当处理器核执行到分支指令时,根据分支预测选择顺序执行下一指令和分支目标指令中的一个作为后续指令执行,并保存另一个的寻址地址;
    若分支转移结果与分支预测一致,则继续执行后续指令;
    若分支转移结果与分支预测不一致,则清空流水线,并从所述保存的寻址地址对应的指令重新执行。
  25. 如权利要求19所述的指令集转换系统,其特征在于,第一读指针根据所述指令信息而非指令本身的功能确定如何更新。
  26. 如权利要求19所述的指令集转换系统,其特征在于,同时从轨道表中读出第一读指针指向的轨迹点及其后一个轨迹点中存储的所述指令信息。
  27. 如权利要求26所述的指令集转换方法,其特征在于,当第一读指针指向一条有条件分支指令,且其后一条是无条件分支指令时,则根据处理器核对有条件分支指令的执行结果:
    若分支转移发生,第一读指针被更新为所述有条件分支指令的分支目标寻址地址值;若分支转移没有发生,第一读指针被更新为所述无条件分支指令的分支目标寻址地址值;
    使得处理器核不需要单独一个时钟周期执行所述无条件分支指令。
  28. 如权利要求19所述的指令集转换系统,其特征在于,所述循迹系统还包括一个寄存器,用于存储顺序执行下一指令和分支目标指令中的一个寻址地址;
    当处理器核执行到分支指令时,根据分支预测选择顺序执行下一指令和分支目标指令中的一个作为后续指令执行,并将另一个的寻址地址存储在所述寄存器中;
    若分支转移结果与分支预测一致,则继续执行后续指令;
    若分支转移结果与分支预测不一致,则清空流水线,并从所述寄存器中保存的寻址地址对应的指令重新执行。
  29. 如权利要求19所述的指令集转换系统,其特征在于,所述轨道表中每条轨道的最后一个轨迹点之后再增加一个结束轨迹点;所述结束轨迹点的指令类型为无条件分支指令,其分支目标寻址地址为顺序执行下一轨道第一个轨迹点的寻址地址;当第一读指针指向结束轨迹点时,一级缓存输出空指令。
  30. 如权利要求29所述的指令集转换系统,其特征在于,所述轨道表中每条轨道的最后一个轨迹点之后再增加一个结束轨迹点;所述结束轨迹点的指令类型为无条件分支指令,其分支目标寻址地址为顺序执行下一轨道第一个轨迹点的寻址地址;且
    当结束轨迹点之前的轨迹点不是分支点时,可以将该结束轨迹点的指令类型及分支目标寻址地址作为该轨迹点的指令类型及分支目标寻址地址。
  31. 一种能执行一种或多种指令集的处理器系统,其特征在于包括:
    一个第一存储器,用于存储属于第一指令集的复数条计算机指令;
    一个指令转换器,用于将所述属于第一指令集的复数条计算机指令转换为复数条内部指令,所述内部指令属于一种第二指令集;
    一个第二存储器,用于存储由指令转换器转换得到的所述复数条内部指令;和
    一个连接所述第二存储器的处理器核,用于在不需要访问所述复数条计算机指令、以及不需要指令转换器参与的情况下,从第二存储器中读取并执行所述复数条内部指令。
  32. 如权利要求31所述的系统,其特征在于:
    指令转换器包含一个存储器,所述存储器可以根据配置用于存储第一指令集和第二指令集之间的映射关系;和
    指令转换器根据存储在其中的第一指令集和第二指令集之间的映射关系将属于第一指令集的所述复数条计算机指令转换为属于第二指令集的所述复数条内部指令。
  33. 如权利要求31或32所述的系统,其特征在于进一步包括:
    一个连接指令转换器和处理器核的地址转换器,用于将所述复数条计算机指令中的目标计算机指令地址转换为所述复数条内部指令中的目标指令的内部地址。
  34. 如权利要求33所述的系统,其特征在于在地址转换器转换地址时:
    将所述目标计算机指令地址映射为内部指令块地址;
    将所述目标计算机指令地址映射为内部指令在所述块地址对应的指令块中的块内偏移地址;和
    合并所述块地址和块内偏移地址,构成内部地址。
  35. 如权利要求34所述的系统,其特征在于:
    根据所述计算机指令块地址和所述内部指令块地址之间的块地址映射关系映射产生所述块地址。
  36. 如权利要求35所述的系统,其特征在于:由地址转换器存储所述块地址映射关系。
  37. 如权利要求35所述的系统,其特征在于:由硬件逻辑根据一个映射关系表映射产生所述块内偏移地址。
  38. 如权利要求34所述的系统,其特征在于进一步包括:
    一个结束标志存储器,用于存储内部指令块的结束指令的内部指令地址;所述结束指令就是转移到顺序地址的下一内部指令块前的最后一条内部指令。
  39. 如权利要求34所述的系统,其特征在于进一步包括:一个下块地址存储器,用于存储顺序地址下一内部指令块的块地址。
  40. 如权利要求34所述的系统,其特征在于进一步包括:一个分支目标缓冲,用于存储了分支目标的内部指令地址。
  41. 如权利要求32所述的系统,其特征在于:
    所述第一存储器存储了属于一个第三指令集的复数条计算机指令;
    指令转换器根据配置在所述存储器中存储了第三指令集和第二指令集之间的映射关系;和
    指令转换器根据存储在其中的第三指令集和第二指令集之间的映射关系将属于第三指令集的所述复数条计算机指令转换为属于第二指令集的所述复数条内部指令。
  42. 如权利要求41所述的系统,其特征在于在所述系统上运行一个第一线程指令序列和一个第二线程指令序列;其中:
    第一线程指令序列由第一指令集的复数条计算机指令构成;
    第二线程指令序列由第三指令集的复数条计算机指令构成;
    所述指令转换器根据配置在所述存储器中同时存储了第一指令集和第二指令集之间的映射关系,以及第三指令集和第二指令集之间的映射关系;和
    指令转换器根据线程号选择所述第一指令集和第二指令集之间的映射关系及第三指令集和第二指令集之间的映射关系中的一个,将该线程的所述复数条计算机指令转换为属于第二指令集的所述复数条内部指令。
  43. 如权利要求32所述的系统,其特征在于:
    所述复数条计算机指令中的每一条均至少包含一个内容为指令类型的指令域;
    所述复数条内部指令中的每一条均至少包含一个内容为指令类型的指令域;
    所述复数条计算机指令和所述复数条内部指令一一对应;和
    所述映射关系包括每条计算机指令的指令类型和每条内部指令的指令类型之间的映射关系,以及每条计算机指令中除指令类型之外的指令域与每条内部指令中除指令类型之外的指令域之间的映射关系。
  44. 如权利要求32所述的系统,其特征在于:
    所述复数条计算机指令中的每一条均至少包含一个内容为指令类型的指令域;
    所述复数条内部指令中的每一条均至少包含一个内容为指令类型的指令域;
    所述复数条计算机指令和所述复数条内部指令的总数不相等;和
    所述复数条计算机指令中的每一条均被映射为所述复数条内部指令中的一条或多条。
  45. 如权利要求43或44所述的系统,其特征在于:
    所述计算机指令的指令域中至少包含一个指令类型;和
    指令转换器至少利用所述指令类型对指令转换器中的存储器寻址读出相应的映射关系。
  46. 如权利要求45所述的系统,其特征在于:
    所述映射关系包含一个移位逻辑;和
    所述复数条内部指令中至少一条指令的一个指令域通过对相应计算机指令的相应指令域移位产生。
  47. 一种用于执行一种或多种指令集的处理器系统的方法,其特征在于包括:
    将属于第一指令集的复数条计算机指令存储在一个第一存储器中;
    由一个指令转换器将所述复数条计算机指令转换为属于一个第二指令集的复数条内部指令;
    将由指令转换器转换得到的所述复数条内部指令存储在一个第二存储器中;和
    由一个连接所述第二存储器的处理器核在不需要访问所述复数条计算机指令、以及不需要指令转换器参与的情况下,从第二存储器中读取并执行所述复数条内部指令。
  48. 如权利要求47所述的方法,其特征在于:
    通过将第一指令集和第二指令集映射关系存储到指令转换器的存储器中,对指令转换器进行配置;和
    指令转换器根据存储在其中的第一指令集和第二指令集之间的映射关系将属于第一指令集的所述复数条计算机指令转换为属于第二指令集的所述复数条内部指令。
  49. 如权利要求47或48所述的方法,其特征在于:
    通过一个连接指令转换器和处理器核的地址转换器将所述复数条计算机指令中的目标计算机指令地址转换为所述复数条内部指令中的目标指令的内部地址。
  50. 如权利要求49所述的方法,其特征在于在地址转换器转换地址时:
    将所述目标计算机指令地址映射为内部指令块地址;
    将所述目标计算机指令地址映射为内部指令在所述块地址对应的指令块中的块内偏移地址;和
    合并所述块地址和块内偏移地址,构成内部地址。
  51. 如权利要求50所述的方法,其特征在于:
    根据所述计算机指令块地址和所述内部指令块地址之间的块地址映射关系映射产生所述块地址。
  52. 如权利要求51所述的方法,其特征在于:由地址转换器存储所述块地址映射关系。
  53. 如权利要求51所述的方法,其特征在于:由硬件逻辑根据一个映射关系表映射产生所述块内偏移地址。
  54. 如权利要求50所述的方法,其特征在于进一步包括:
    由一个结束标志存储器存储内部指令块的结束指令的内部指令地址;所述结束指令就是转移到顺序地址的下一内部指令块前的最后一条内部指令。
  55. 如权利要求50所述的方法,其特征在于进一步包括:由一个下块地址存储器存储顺序地址下一内部指令块的块地址。
  56. 如权利要求50所述的方法,其特征在于进一步包括:由一个分支目标缓冲存储了分支目标的内部指令地址。
  57. 如权利要求48所述的方法,其特征在于:
    将属于一个第三指令集的复数条计算机指令存储在所述第一存储器中;
    由指令转换器根据配置在所述存储器中存储了第三指令集和第二指令集之间的映射关系;和
    由指令转换器根据存储在其中的第三指令集和第二指令集之间的映射关系将属于第三指令集的所述复数条计算机指令转换为属于第二指令集的所述复数条内部指令。
  58. 如权利要求57所述的方法,其特征在于运行一个第一线程指令序列和一个第二线程指令序列;其中:
    第一线程指令序列由第一指令集的复数条计算机指令构成;
    第二线程指令序列由第三指令集的复数条计算机指令构成;
    由所述指令转换器根据配置在所述存储器中同时存储第一指令集和第二指令集之间的映射关系,以及第三指令集和第二指令集之间的映射关系;和
    由指令转换器根据线程号选择所述第一指令集和第二指令集之间的映射关系及第三指令集和第二指令集之间的映射关系中的一个,将该线程的所述复数条计算机指令转换为属于第二指令集的所述复数条内部指令。
  59. 如权利要求48所述的方法,其特征在于:
    所述复数条计算机指令中的每一条均至少包含一个内容为指令类型的指令域;
    所述复数条内部指令中的每一条均至少包含一个内容为指令类型的指令域;
    所述复数条计算机指令和所述复数条内部指令一一对应;和
    所述映射关系包括每条计算机指令的指令类型和每条内部指令的指令类型之间的映射关系,以及每条计算机指令中除指令类型之外的指令域与每条内部指令中除指令类型之外的指令域之间的映射关系。
  60. 如权利要求48所述的方法,其特征在于:
    所述复数条计算机指令中的每一条均至少包含一个内容为指令类型的指令域;
    所述复数条内部指令中的每一条均至少包含一个内容为指令类型的指令域;
    所述复数条计算机指令和所述复数条内部指令的总数不相等;和
    所述复数条计算机指令中的每一条均被映射为所述复数条内部指令中的一条或多条。
  61. 如权利要求59或60所述的系统,其特征在于:
    所述计算机指令的指令域中至少包含一个指令类型;和
    指令转换器至少利用所述指令类型对指令转换器中的存储器寻址读出相应的映射关系。
  62. 如权利要求61所述的方法,其特征在于:
    所述复数条内部指令中至少一条指令的一个指令域通过对相应计算机指令的相应指令域移位产生。
PCT/CN2014/092313 2013-11-27 2014-11-26 一种指令集转换系统和方法 WO2015078380A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
JP2016534248A JP6591978B2 (ja) 2013-11-27 2014-11-26 命令セット変換システム及び方法
US15/100,250 US10387157B2 (en) 2013-11-27 2014-11-26 System and method for instruction set conversion based on mapping of both block address and block offset
KR1020167017252A KR20160130741A (ko) 2013-11-27 2014-11-26 명령 세트 전환 시스템 및 방법
EP14865998.0A EP3076288A4 (en) 2013-11-27 2014-11-26 Instruction set conversion system and method

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN201310625156.4 2013-11-27
CN201310625156 2013-11-27
CN201310737869.X 2013-12-24
CN201310737869.XA CN104679480A (zh) 2013-11-27 2013-12-24 一种指令集转换系统和方法

Publications (1)

Publication Number Publication Date
WO2015078380A1 true WO2015078380A1 (zh) 2015-06-04

Family

ID=53198376

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/092313 WO2015078380A1 (zh) 2013-11-27 2014-11-26 一种指令集转换系统和方法

Country Status (6)

Country Link
US (1) US10387157B2 (zh)
EP (1) EP3076288A4 (zh)
JP (1) JP6591978B2 (zh)
KR (1) KR20160130741A (zh)
CN (1) CN104679480A (zh)
WO (1) WO2015078380A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109388429A (zh) * 2018-09-29 2019-02-26 古进 Mhp异构多流水线处理器的任务分发方法
CN112559039A (zh) * 2020-12-03 2021-03-26 类人思维(山东)智慧科技有限公司 一种计算机编程用指令集生成方法及系统

Families Citing this family (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012103359A2 (en) 2011-01-27 2012-08-02 Soft Machines, Inc. Hardware acceleration components for translating guest instructions to native instructions
WO2012103367A2 (en) 2011-01-27 2012-08-02 Soft Machines, Inc. Guest to native block address mappings and management of native code storage
CN109358948B (zh) 2013-03-15 2022-03-25 英特尔公司 用于支持推测的访客返回地址栈仿真的方法和装置
WO2014151652A1 (en) 2013-03-15 2014-09-25 Soft Machines Inc Method and apparatus to allow early dependency resolution and data forwarding in a microprocessor
EP3001313A1 (de) * 2014-09-23 2016-03-30 dSPACE digital signal processing and control engineering GmbH Verfahren zur Simulation eines Anwendungsprogramms eines elektronischen Steuergeräts auf einem Computer
TWI601059B (zh) * 2015-11-19 2017-10-01 慧榮科技股份有限公司 資料儲存裝置與資料儲存方法
EP3173935B1 (en) * 2015-11-24 2018-06-06 Stichting IMEC Nederland Memory access unit
US10318356B2 (en) * 2016-03-31 2019-06-11 International Business Machines Corporation Operation of a multi-slice processor implementing a hardware level transfer of an execution thread
CN106775587B (zh) * 2016-11-30 2020-04-14 上海兆芯集成电路有限公司 计算机指令的执行方法以及使用此方法的装置
US10713750B2 (en) * 2017-04-01 2020-07-14 Intel Corporation Cache replacement mechanism
US10325341B2 (en) 2017-04-21 2019-06-18 Intel Corporation Handling pipeline submissions across many compute units
CN107220103A (zh) * 2017-05-27 2017-09-29 郑州云海信息技术有限公司 一种宿主物理机的cpu加速方法及装置
CN109101275B (zh) * 2018-06-26 2021-07-23 飞腾技术(长沙)有限公司 一种基于移位的指令提取与缓冲方法及超标量微处理器
CN111984323A (zh) * 2019-05-21 2020-11-24 三星电子株式会社 将微操作分配到微操作高速缓存器的处理设备及其操作方法
US10783082B2 (en) * 2019-08-30 2020-09-22 Alibaba Group Holding Limited Deploying a smart contract
CN110704213B (zh) * 2019-10-10 2020-07-10 北京航空航天大学 一种数字孪生数据虚实高效实时交互方法和系统
CN110989408A (zh) * 2019-12-09 2020-04-10 深圳市康冠商用科技有限公司 一种设备控制方法、装置、设备及可读存储介质
CN111414199B (zh) * 2020-04-03 2022-11-08 中国人民解放军国防科技大学 一种指令融合的实现方法及装置
CN113535231A (zh) * 2020-04-17 2021-10-22 中科寒武纪科技股份有限公司 减少指令跳转的方法及装置
CN111913745B (zh) * 2020-08-28 2022-06-28 中国人民解放军国防科技大学 一种嵌入式多指令集处理器设计方法
CN114168495A (zh) * 2020-09-10 2022-03-11 西部数据技术公司 存储设备的增强的预读能力
US11366774B2 (en) * 2020-09-24 2022-06-21 Adesto Technologies Corporation Memory latency reduction in XIP mode
CN112564686B (zh) * 2020-11-12 2023-08-15 东南大学 基于动态电路的大扇入独热码数据选择器电路
US20220318015A1 (en) * 2021-03-31 2022-10-06 Advanced Micro Devices, Inc. Enforcing data placement requirements via address bit swapping
CN115827063B (zh) * 2023-02-16 2023-06-13 沐曦集成电路(南京)有限公司 一种基于Fill Constant指令的写存储系统及方法
CN117608667B (zh) * 2024-01-23 2024-05-24 合芯科技(苏州)有限公司 指令集处理系统、方法及电子设备

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050223192A1 (en) * 2000-10-09 2005-10-06 Pts Corporation Instruction sets for processors
CN1682181A (zh) * 2002-09-20 2005-10-12 Arm有限公司 具有外部和内部指令集的数据处理系统
CN103235724A (zh) * 2013-05-10 2013-08-07 中国人民解放军信息工程大学 基于原子操作语义描述的多源二进制代码一体化翻译方法

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0683615A (ja) * 1992-09-02 1994-03-25 Fujitsu Ltd 命令セットエミュレーションを行う計算機
JP2002063031A (ja) * 2000-08-18 2002-02-28 Fainaaku Kk アーキテクチャ切替え・再構成対応プロセッサ
US7251811B2 (en) * 2002-01-02 2007-07-31 Intel Corporation Controlling compatibility levels of binary translations between instruction set architectures
US7478224B2 (en) * 2005-04-15 2009-01-13 Atmel Corporation Microprocessor access of operand stack as a register file using native instructions
US7966474B2 (en) * 2008-02-25 2011-06-21 International Business Machines Corporation System, method and computer program product for translating storage elements
US8479196B2 (en) * 2009-09-22 2013-07-02 International Business Machines Corporation Nested virtualization performance in a computer system
US8527707B2 (en) * 2009-12-25 2013-09-03 Shanghai Xin Hao Micro Electronics Co. Ltd. High-performance cache system and method
US9141388B2 (en) * 2009-12-25 2015-09-22 Shanghai Xin Hao Micro Electronics Co., Ltd. High-performance cache system and method
WO2012103367A2 (en) * 2011-01-27 2012-08-02 Soft Machines, Inc. Guest to native block address mappings and management of native code storage
US9280398B2 (en) * 2012-01-31 2016-03-08 International Business Machines Corporation Major branch instructions
US10474465B2 (en) * 2014-05-01 2019-11-12 Netronome Systems, Inc. Pop stack absolute instruction
KR101963725B1 (ko) * 2014-05-12 2019-04-01 인텔 코포레이션 자기 수정 코드에 하드웨어 지원을 제공하는 방법 및 장치

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050223192A1 (en) * 2000-10-09 2005-10-06 Pts Corporation Instruction sets for processors
CN1682181A (zh) * 2002-09-20 2005-10-12 Arm有限公司 具有外部和内部指令集的数据处理系统
CN103235724A (zh) * 2013-05-10 2013-08-07 中国人民解放军信息工程大学 基于原子操作语义描述的多源二进制代码一体化翻译方法

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109388429A (zh) * 2018-09-29 2019-02-26 古进 Mhp异构多流水线处理器的任务分发方法
CN109388429B (zh) * 2018-09-29 2024-01-02 古进 Mhp异构多流水线处理器的任务分发方法
CN112559039A (zh) * 2020-12-03 2021-03-26 类人思维(山东)智慧科技有限公司 一种计算机编程用指令集生成方法及系统
CN112559039B (zh) * 2020-12-03 2022-11-25 类人思维(山东)智慧科技有限公司 一种计算机编程用指令集生成方法及系统

Also Published As

Publication number Publication date
EP3076288A1 (en) 2016-10-05
KR20160130741A (ko) 2016-11-14
US10387157B2 (en) 2019-08-20
EP3076288A4 (en) 2018-03-07
JP2016539423A (ja) 2016-12-15
US20170003967A1 (en) 2017-01-05
CN104679480A (zh) 2015-06-03
JP6591978B2 (ja) 2019-10-16

Similar Documents

Publication Publication Date Title
WO2015078380A1 (zh) 一种指令集转换系统和方法
WO2015096688A1 (zh) 一种缓存系统和方法
CN104679481B (zh) 一种指令集转换系统和方法
JP3794917B2 (ja) 分岐予測を迅速に特定するための命令キャッシュ内のバイト範囲に関連する分岐セレクタ
EP0853780B1 (en) Superscalar microprocessor with risc86 instruction set
US6106573A (en) Apparatus and method for tracing microprocessor instructions
US6708330B1 (en) Performance improvement of critical code execution
US5625789A (en) Apparatus for source operand dependendency analyses register renaming and rapid pipeline recovery in a microprocessor that issues and executes multiple instructions out-of-order in a single cycle
JP3182740B2 (ja) 単一クロック・サイクルに非連続命令を取り出すための方法およびシステム。
EP0941508B1 (en) Variable instruction set computer
WO2012175058A1 (en) High-performance cache system and method
WO2014139466A2 (en) Data cache system and method
US7203800B2 (en) Narrow/wide cache
JP3803723B2 (ja) 分岐予測を選択する分岐セレクタを採用する分岐予測機構
WO2015024493A1 (zh) 基于指令读缓冲的缓存系统和方法
JP3794918B2 (ja) 復帰選択ビットを用いて分岐予測のタイプを分類する分岐予測
US5872943A (en) Apparatus for aligning instructions using predecoded shift amounts
EP0853783B1 (en) Instruction decoder including two-way emulation code branching
US5961580A (en) Apparatus and method for efficiently calculating a linear address in a microprocessor
US6047368A (en) Processor architecture including grouping circuit
JP2001522082A (ja) より小さな数の分岐予測および代替ターゲットを用いて近似的により大きな数の分岐予測をすること
US6336182B1 (en) System and method for utilizing a conditional split for aligning internal operation (IOPs) for dispatch
EP0912927B1 (en) A load/store unit with multiple pointers for completing store and load-miss instructions
EP4250096A1 (en) Processor micro-operations cache architecture for intermediate instructions
US7213129B1 (en) Method and system for a two stage pipelined instruction decode and alignment using previous instruction length

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14865998

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2016534248

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 15100250

Country of ref document: US

REEP Request for entry into the european phase

Ref document number: 2014865998

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2014865998

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 20167017252

Country of ref document: KR

Kind code of ref document: A