WO2016169518A1 - Instruction and data push-based processor system and method - Google Patents

Instruction and data push-based processor system and method Download PDF

Info

Publication number
WO2016169518A1
WO2016169518A1 PCT/CN2016/080039 CN2016080039W WO2016169518A1 WO 2016169518 A1 WO2016169518 A1 WO 2016169518A1 CN 2016080039 W CN2016080039 W CN 2016080039W WO 2016169518 A1 WO2016169518 A1 WO 2016169518A1
Authority
WO
WIPO (PCT)
Prior art keywords
address
cache
level
instruction
buffer
Prior art date
Application number
PCT/CN2016/080039
Other languages
French (fr)
Chinese (zh)
Inventor
林正浩
Original Assignee
上海芯豪微电子有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201510233007.2A external-priority patent/CN106201913A/en
Application filed by 上海芯豪微电子有限公司 filed Critical 上海芯豪微电子有限公司
Priority to US15/568,715 priority Critical patent/US20180088953A1/en
Publication of WO2016169518A1 publication Critical patent/WO2016169518A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems

Definitions

  • the invention relates to the field of computers, communications and integrated circuits.
  • the central processor in the stored program computer generates an address to the memory, from which the read command or data is sent back for execution by the central processing unit, and the result of the execution is sent back to the memory for storage.
  • the execution speed of the central processor increases, so memory access latency becomes a bottleneck for computer performance improvement.
  • the stored program computer uses a buffer to mask the memory access latency to alleviate this bottleneck. But the central processor fetches instructions or data into the cache in the same way. That is, the processor core in the central processing unit generates an address and sends the address to the buffer.
  • the buffer sends the corresponding information directly to the processor core for execution, thus avoiding access to the memory. delay.
  • the capacity of the buffer increases, the buffer access latency increases, and the channel latency of the access increases.
  • the execution speed of the processor core increases, so the buffer access latency is now an improvement in computer performance. A serious bottleneck.
  • the manner in which the above processor core fetches information (including instructions and data) from the memory for execution can be considered as the processor core pulls (pull) information to the memory.
  • the pull information needs to go through the delay channel twice, once the processor sends the address to the memory, and once the memory sends the information to the processor core.
  • all the processors of the stored program computer have modules for generating and recording instruction addresses, and the pipeline structure of the instructions must have a pipeline segment for instruction fetching.
  • Modern stored procedure computer fetching instructions usually requires a plurality of pipeline segments, which deepens the pipeline and increases the loss of branch prediction errors.
  • generating and recording a long instruction address also requires more energy.
  • a computer that converts a variable length instruction into a fixed-length micro-operation requires that the address of the fixed-length micro-operation be inversely converted to the address of the variable-length instruction to address the cache, which has a cost.
  • the method and system apparatus proposed by the present invention can directly address one or more of the above or other difficulties.
  • the present invention provides a processor system comprising: a push buffer and a corresponding processor core; wherein: the processor core does not generate and maintain an instruction address, and there is no pipeline segment in the pipeline; The processor core only provides branch decisions to the push buffer and provides a base address stored in the register file when the indirect branch instruction is executed; the push buffer extracts control flow information in the stored instructions thereof and stores Control flow information and the branch determination to push instructions to the processor core for execution thereof; the push buffer, upon encountering an indirect branch instruction, to the processor based on the base address from the processor core The core provides the correct indirect branch target instructions for execution.
  • the push buffer may provide the processor core with a subsequent sequence of branch instructions and two instructions of the branch target, and the branch generated by the processor core determines to execute one of the instructions, thereby masking the The delay at which the processor core passes the branch decision to the push buffer.
  • the push buffer may store the base address of the indirect branch instruction and the corresponding indirect branch target address, and may reduce or eliminate the delay when pushing the indirect branch target instruction, partially or completely masking the processor core to the base The delay at which the address is sent to the push buffer.
  • the push buffer may push instructions to the processor core in advance based on control flow information stored therein, partially or completely masking the delay in transmitting information from the push buffer to the processor core.
  • the processor core of the processor system proposed by the present invention does not need to have a pipeline stage for fetching instructions, nor does it need to generate and record an instruction address.
  • the invention proposes an organization form of a complex hierarchical cache, and its last (lowest) level cache (Last Level Cache, LLC) is a link group association organization, which has a virtual real address translation buffer TLB and a tag unit TAG, which can convert a virtual address of a memory into a physical address by TLB, and obtain the real address of the memory and the content in the TAG. Match to get the buffer address of the LLC. Due to LLC The buffer address is derived from the memory real address mapping, so the LLC buffer address is actually the real address. The resulting LLC buffer address can be used to address the LLC's information memory RAM and can also be used to select an LLC active table.
  • Last Level Cache, LLC is a link group association organization, which has a virtual real address translation buffer TLB and a tag unit TAG, which can convert a virtual address of a memory into a physical address by TLB, and obtain the real address of the memory and the content in the TAG. Match to get the buffer address of the LLC. Due to LLC The buffer address is
  • the LLC active table stores the mapping relationship between the LLC cache block and the cache block in the higher layer buffer, that is, the LLC active table is addressed by the LLC buffer address, and the contents of the entry are the corresponding higher level cache block address.
  • other levels of buffers other than LLC are all associative organizations, which are directly addressed by their own buffer addresses, and do not require a tag unit TAG or TLB.
  • the buffer address of the current level and the higher-level buffer address are mapped by the active table.
  • the active table is similar to the LLC active table, and is addressed by the hierarchical buffer address and the higher-level buffer address is stored in the entry.
  • the highest level buffer has a corresponding track table in which the control stream information scanned by the scanner and reviewed for storage into the highest level buffer memory RAM instruction is stored.
  • the track table is addressed by the highest level buffer address, and its entry stores the branch target address of the branch instruction.
  • the tracker generates a highest level buffer address addressing the first read port output sequence instruction of the highest level buffer memory to be pushed to the processor core; also reading the corresponding entry in the track table by the highest level buffer address addressing The corresponding branch target address, the second read output branch target instruction that addresses the highest level buffer memory with the branch target address is also pushed to the processor core.
  • the processor core executes the branch instruction to generate a branch decision, and selects one of the two instructions to execute and discards the other branch.
  • the branch determination also controls the tracker to select one of the two buffer addresses accordingly, addressing the highest level buffer to continue pushing instructions to the processor core.
  • the present invention proposes a cache replacement method for determining a replaceable cache block based on the degree of association between cache blocks.
  • the way from the branch source branch to the branch target is recorded in the track table.
  • the related table records the corresponding low-level buffer address of the cache block content in the low-level buffer, the branch source path of the jump into the cache block, and the number of branch sources that jump into the cache block.
  • the association degree of the cache block may be defined according to the count of the branch source jumped in the cache block, and the smaller the count, the smaller the degree of association, and may be pre-replaced.
  • the cache block of the last old replacement may be replaced according to the order of the previous replacement, so as to avoid the cache block that has just been replaced being replaced.
  • the entry in the track table is addressed by the jump-in branch source path stored in the correlation table, and the cache block address is replaced by the corresponding lower-layer buffer address of the cache block content in the related table to keep the control information flow intact. Sex. The above description is based on the degree of association between the same storage levels.
  • the minimum degree of association replacement method can also be applied between different storage levels.
  • the method is to record the number of high-level cache blocks that are the same as the content of the cache block as the degree of association, and the smaller the count, the smaller the degree of association, and the cache block with the smallest degree of association.
  • This method can also be called the least descendant method (Least Children), where the descendant refers to the same high-level cache block as the cache block.
  • the number of entries in the track table with the cache block as the branch target is also recorded (the cache block and the track table can be at different storage levels). When both numbers are '0', the cache block can be replaced. If the descendant count is not '0', the cache block can be replaced by replacing the descendant cache block.
  • the replacement may be performed when it is '0', or the low-level buffer address containing the contents of the cache block is substituted for the track table entry.
  • the current cache block address is replaced.
  • the minimum degree of association between storage hierarchies can also be shared with the earliest replaced method described above.
  • the present invention provides a method of temporarily storing register states in a tracker and processor core to a memory identified by a thread number.
  • the state of the registers in the memory and the tracker and processor core can be swapped by threads to switch threads. Because the thread instructions in the push cache of the present invention are independent, there is no need to clear the cache when the thread is changed, and no one thread executes the instruction of another thread.
  • the present invention proposes a method and processor system that can execute instructions provided by a plurality of memory levels simultaneously.
  • the invention proposes a function call and function return method and system based on a track table.
  • the invention provides a computer memory hierarchical organization method and system, in addition to a hard disk, the storage levels, including traditional main memory (main Memory) is organized by cache and managed by hardware, without the need for the operating system to allocate memory.
  • This method does not need to be matched by the tag unit when the instruction or data is read, which reduces the delay.
  • the invention proposes a full associative caching method for preserving the relationship between data by hierarchy, and avoids the comparison matching operation between the address and the label according to the bidirectional address mapping between different levels of data.
  • the cache system pushes the data to the processor core (Serve) in advance according to the previous extraction of the same load instruction, the retained step information, and the mutual relationship.
  • the present invention proposes a method and system for extracting and recording the interrelationship between logically organized data (i.e., address information containing data in a data).
  • the method and system autonomously learn according to the result of executing the load instruction, and the logical relationship between the extracted data remains in the data track table.
  • the entries in the data track table correspond one-to-one with the data storage table entries.
  • the data track entry corresponding to 'data' in the data memory retains the 'data type' generated by analyzing the relationship between the data.
  • the data address entry corresponding to the 'address' in the data memory retains the address mapped 'address pointer'.
  • the 'address pointer' can directly address the data memory to read data without the need for tag unit matching.
  • the method and system push data to the processor core according to the relationship between the data before the logical relationship is extracted.
  • the cache system retains the logical relationship retained in the data track table according to the previous execution of the same load instruction, and the processor core executes The comparison result provided by the relevant instruction, the data is read in advance and pushed to the processor core.
  • the memory hierarchy method and system of the present invention actively pushes most of the instructions and data to the processor core; the processor core only needs to provide branch decision or comparison results, and the pipeline stop signal of the processor most of the time.
  • the present invention provides a memory hierarchy and method that can access a memory hierarchy at the other end of a communication channel with a uniform memory address.
  • the present invention provides a processor system including a processor core and a buffer that pushes instructions and data to the processor core for execution and processing by the processor core.
  • the system and method of the present invention can provide a basic solution for providing a two-way delay of a processor core access buffer in a processor system.
  • the processor core sends a memory address to the buffer, and the buffer transmits information (instructions or data) to the processor core based on the memory address.
  • the system and method for utilizing correlation between instructions according to the present invention pushes instructions from the buffer to the processor core, thereby avoiding delays in transmitting the memory address by the processor core to the buffer.
  • the push buffer of the present invention is not in the pipeline structure of the processor core, so instructions can be pushed in advance to mask the delay of the buffer to the processor core.
  • the system and method of the present invention also provides a multi-layer cache organization form, in which virtual and real address translation and address mapping are performed only in a lowest level cache (LLC), and instead of a virtual cache in a conventional cache, the highest level cache is performed, and The address map is cached at each level.
  • LLC lowest level cache
  • Each level of cache in the multi-layer cache organization can be addressed by a buffer address based on the real address mapping of the memory, such that the cost and power consumption of the fully associative cache approximates the direct map cache.
  • the system and method of the present invention also provides a cache replacement method based on the degree of association between cache blocks, which is suitable for a buffer that utilizes an inter-instruction relationship (control information flow).
  • FIG. 1 is an embodiment of a track table based cache system of the present invention
  • FIG. 3 is another embodiment of the processor system of the present invention.
  • FIG. 5 is another embodiment of the processor system of the present invention.
  • FIG. 6 is an address format of a processor system in the embodiment of FIG. 5;
  • Figure 7 is a partial storage table format of the processor system in the embodiment of Figure 5;
  • FIG. 8 is another embodiment of the processor system of the present invention.
  • FIG. 10 is a schematic diagram of a pipeline structure of a processor core in the processor system of the present invention.
  • FIG. 11 is another embodiment of the processor system of the present invention.
  • Figure 12 is an embodiment of the processor/memory system of the present invention.
  • Figure 13 is another embodiment of the processor/memory system of the present invention.
  • Figure 14 is a format of each storage table in the embodiment of Figure 13;
  • Figure 15 is an address format of a processor system in the embodiment of Figure 13 of the present invention.
  • 16 is a format of a data track table, a data active table, and a data related table according to the present invention
  • Figure 18 is another embodiment of the processor/memory system of the present invention.
  • FIG. 19 is a schematic diagram of the action mechanism of the data cache hierarchy in the embodiment of FIG. 18 of the present invention.
  • 21 is an embodiment of prefetching data organized in a logical relationship
  • 22 is an embodiment of a handler function call and a function return instruction
  • FIG. 1 shows an example of a cache system including a track table of the present invention.
  • 10 is an embodiment of the track table of the present invention.
  • the track table 10 is composed of the same number of rows and columns as the level 1 buffer 22, wherein each line is a track corresponding to a level 1 cache block in the level 1 cache. Each entry on the track corresponds to an instruction in the level 1 cache block.
  • each L1 cache block in the L1 cache contains a maximum of 4 instructions, and the intra-block offset addresses BNY are 0, 1, 2, and 3, respectively.
  • the five instruction blocks in the first-level buffer 22, whose first-level buffer block addresses BN1X are ‘J’, ‘K’, ‘L’, ‘M’, and ‘N’, will be described as an example. Therefore, there are corresponding 5 tracks in the track table 10, and up to 4 items in each track can correspond to up to 4 instructions in the first level cache block of 24, and BNY can also address the items in the track.
  • the track table 10 and the corresponding level 1 buffer 22 can be addressed by the level 1 buffer address BN1 formed by the level 1 buffer block address BN1X and the in-block offset address BNY. Read the track table entry and the corresponding instruction.
  • the fields 11, 12, and 13 in Fig. 1 are the entry format of the track table 10.
  • the domain 11 is an instruction type format, and can be divided into two categories: non-branch instructions and branch instructions according to the type of the corresponding instruction.
  • the type of the branch instruction may be further subdivided into direct and indirect branches according to one dimension, or may be subdivided into conditional branches and unconditional branches according to another dimension.
  • Stored in domain 12 is the buffer block address, and in domain 13 is the offset within the memory block.
  • the domain 12 is in the primary buffer BN1X format
  • the domain 13 is in the BNY format.
  • the buffer address can also use other formats, in which case address format information can be added in the field 11 to illustrate the address format in the fields 12, 13.
  • Only one of the track table entries of the non-branch instruction stores the instruction type field 11 of the non-branch type, and the entry of the branch instruction has the BNX domain 12 and the BNY domain 13 in addition to the instruction type field 11.
  • the value 'J3' in the entry 'M2' indicates that its branch target instruction level one cache address of the instruction corresponding to the 'M2' entry is 'J3'.
  • the corresponding instruction can be judged as a branch instruction according to the field 11 in the entry, according to the fields 12, 13
  • the branch target of the instruction is the instruction of the 'J3' address in the level 1 buffer.
  • the instruction that addresses the BNY of the 'J' instruction block in the found first level buffer 24 to '3' is the branch target instruction.
  • the track table 10 in addition to the above BNY is outside the column of '0' ⁇ '3' and also contains an additional end column 16, where each end entry has only fields 11 and 12, where domain 11 stores an unconditional branch type, and domain 12 stores
  • the sequence address of the corresponding instruction block corresponds to the BN1X of the next instruction block, that is, the next instruction block can be directly found in the L1 cache according to the BN1X, and the track corresponding to the next instruction block is found in the track table 10. .
  • the blank entries in the track table 10 display corresponding non-branch instructions, and the remaining entries correspond to branch instructions.
  • the entries also show the level 1 cache address (BN1) of the branch target (instruction) of the corresponding branch instruction.
  • BN1 level 1 cache address
  • the next instruction to be executed may only be an instruction represented by the entry on the right side of the same track of the entry; for the last entry in the track, the next one is to be
  • the executed instruction may only be the first valid instruction in the first-level cache block pointed to by the content of the end entry on the track; for the branch instruction entry on the track, the next instruction to be executed may be the table.
  • the instruction represented by the entry on the right side of the item may also be an instruction pointed to by the BN in the entry of the item, and is selected by the branch. Therefore, the track table 10 contains all the program control flow information of all the instructions stored in the level one cache.
  • FIG. 2 is an embodiment of the processor system of the present invention.
  • a level 1 cache 22, a processor core 23, a controller 27, and a track table 20 like the track table 10 of FIG. 1 are included.
  • Incrementor 24, The selector 25 and the register 26 form a tracker 47 (within the dotted line).
  • the processor core 23 controls the selector 25 in the tracker with the branch decision 31, and controls the register 26 in the tracker with the pipeline stop signal 32.
  • the selector 25 is controlled by the controller 27 and the branch decision 31 to select the output 29 of the track table 20 or the output of the incrementer 24.
  • the output of selector 25 is registered by register 26, while the output 28 of register 26 is referred to as a read pointer (Read Pointer, RPT), the instruction format is BN1.
  • RPT Read Pointer
  • the data width of the incrementer 24 is equal to the width of BNY, and only increases the BNY of the read pointer by '1' without affecting the value of BN1X, such as the width of the overflow result of the incremental result (ie, the capacity of the first-level cache block).
  • the carry output of the incrementer 24 is '1', the system will look for the BN1X of the next level one cache block in the end column to replace the block BN1X; the following embodiments are the same, unless otherwise stated.
  • the tracker in the system of the present embodiment accesses the track table 20 with the read pointer 28 to output the entry via the bus 29, and also accesses the level 1 cache 22 to read the corresponding command for execution by the processor core 23.
  • the controller 27 decodes the field 11 in the entry output on the bus 29. If the instruction type in the field 11 is non-branch, the controller 27 controls the selector 25 to select the output of the incrementer 24, and the read pointer is incremented by '1' in the next clock cycle, and the sequential lower strip is read from the first-level cache 22 ( Fall Through) instruction.
  • controller 27 controls selector 25 to select fields 12, 13 on bus 29, the next cycle read pointer 28 points to the branch target, and reads the branch target instruction from level one cache 22. . If the instruction type in the field 11 is a direct conditional branch, the controller 27 causes the branch decision 31 to control the selector 25. If it is determined that the branch is not to be executed, the next week the read pointer 28 is incremented by the incrementer 24 by '1', from one The sequence buffer 22 reads the sequence instruction; if it is determined to execute the branch, the next week the read pointer points to the branch target, and the branch target instruction is read from the level 1 cache 22. When the pipeline in processor core 23 stalls, the update of register 26 in the tracker is halted by pipeline stall signal 32, causing the cache system to stop providing new instructions to processor core 23.
  • the non-branch entries in the track table 10 can be discarded to compress the track table.
  • the format of the table of compressed track table adds the source in addition to the original domain 11, 12, 13 (Source) BNY (SBNY) field 15 to record the (source) intra-block offset address of the branch instruction itself, because the compressed table entry has horizontal displacement in the table, although the order between the branch entries is still maintained, but it is no longer Can be directly addressed by BNY.
  • the compressed track table 14 stores the same control flow information in the track table 10 in a compressed entry format. Only the SBNY field 15, the BNX domain 12, and the BNY domain 13 are shown in the track table 14.
  • the entry "1N2" in the K line indicates that the entry represents an instruction with the address K1, and its branch target is N2.
  • the end table 16 is in the rightmost column of the track table 14 and is output through the independent reading port 30.
  • the BN1X therein is used to read out the SBNY in all entries corresponding to the row.
  • a value of 15 and each of the SBNY values is sent to a corresponding comparator of the column (such as comparator 18, etc.) and BNY in the read pointer Part 17 is compared separately.
  • These comparators output '0' if the SBNY value of this column is less than the BNY, otherwise output '1'.
  • the cache system provides an instruction for each clock cycle as an example for convenience of description.
  • FIG. 3 is another embodiment of the processor system of the present invention.
  • 20 of them are the track tables of the first level cache.
  • 22 is the memory RAM of the first-level buffer
  • 39 is the Instruction Read Buffer (IRB)
  • 47 is the tracker.
  • 91 is a register
  • 92 is a selector
  • 23 is a processor core.
  • Instruction read buffer IRB 39 may store a portion of a level one instruction cache block or a single or multiple level one level instruction cache block, which is addressed by the read pointer 28 of the tracker 47.
  • Read pointer 28 also addresses track table 20.
  • the branch target address of the track table output is addressed to the level 1 cache 22 via bus 29 and also to the tracker 47 via bus 29.
  • IRB Together with the primary buffer memory 22, a dual read memory is formed.
  • the IRB 39 provides a first read port, the memory 22 provides a second read port, and the register 91 temporarily stores data output by the second read port.
  • IRB The output of the 39 and the output of the level 1 buffer 22 are controlled by the branch decision 31 output from the processor core 23, and the output of the selector 92 is sent to the processor core 23 for execution.
  • Each entry in column 16 of 14 is an unconditional direct branch type.
  • the other entries in 14 are direct conditional branch types.
  • Read pointer at the beginning 28 points to the address 'L0', the corresponding instruction is read from the IRB 39, and the default value of the branch decision 31 controls the selector 92 to select the instruction from the IRB 39 for execution by the processor core 23.
  • Reading pointer at the same time
  • the address 'L0' on the 28 is addressed to the track table 14, the entry '0M1' is output from the bus 29, the first stage buffer 22 is accessed by the address 'M1' on 29, and the corresponding branch target instruction is stored in the register 91.
  • the controller 27 compares the SBNY field 15 on the bus 29 with the BNY field 13 on the read pointer 28, and finds that the two are equal, so the selector 92 is controlled by the branch decision 31. Assuming 31 is 'no branch' at this time, 31 controls selector 92 to select IRB in the next clock cycle. 39 output.
  • the read pointer 28 is stepped to the address 'L1', the corresponding instruction is read from the IRB 39, and is selected by the selector 92 for execution by the processor 23. Reading pointer at the same time
  • the address 'L1' on 28 addresses the track table 14, outputs the entry '3J0' from the bus 29, accesses the level 1 buffer 22 at the address 'J0' on 29, and reads the corresponding instruction as the branch target instruction in the register 91.
  • the controller 27 compares the SBNY field 15 on the bus 29 with the BNY field 13 on the read pointer 28, and finds that the two are not equal, so the selector 92 is selected to select the IRB by default. Output of 39 for processor core 23 carried out.
  • the read pointer 28 steps to the address 'L2', at which point the controller 27 finds that the SBNY field 15 on the bus 29 and the BNY field 13 on the read pointer 28 are still not equal, so 27 still controls the selector 92 selection.
  • IRB The output of 39 is for execution by processor core 23.
  • the read pointer 28 steps to the address 'L3', at which point the controller 27 finds that the SBNY field 15 on the bus 29 and the BNY field 13 on the read pointer 28 are equal, so the selector 92 is controlled by the branch decision 31.
  • control selector 92 selects the output of register 91, i.e., the branch target instruction of address 'J0', for execution by processor 23.
  • the branch decision 31 also controls the tracker 47 to select 'J0' on the bus 29 to place the read pointer 28, and the control will The ‘J’ level 1 cache block is stored in IRB 39.
  • the read pointer 28 steps to 'J1', and the control IRB 39 outputs the corresponding command to be selected by the selector 92 for execution by the processor core 23.
  • FIG. 4 is another embodiment of the processor system of the present invention.
  • 40 is the secondary active table (Active List 2, AL2)
  • 41 is the address cache buffer TLB and tag unit TAG of the second level cache
  • 42 is the memory RAM of the level 2 cache
  • 43 is the scanner
  • 44 is the selector
  • 20 is the track table of the level 1 cache
  • 22 is the memory RAM of the level 1 buffer
  • 27 is the controller
  • 33 is the selector
  • 39 is the instruction read buffer IRB.
  • the incrementer 24, the selector 25, together with the register 26 constitute a tracker 47
  • the incrementer 34, the selector 35, together with the register 36 constitute a tracker 48
  • 23 is the processor core, which can receive two instructions and select one execution completion under branch control and abandon execution of another branch
  • 45 is a register for temporarily storing the state of each thread of the processor.
  • the scanner 43 examines the instruction block stored from the L2 cache memory 42 to the L1 buffer memory 22, and calculates the branch target address of the direct branch instruction therein by adding the branch instruction to the memory address of the branch instruction itself. Branch offset.
  • the calculated branch target address is selected by the selector 44 and sent to the TLB/tag unit 41 for matching.
  • the secondary active table 40 is accessed by matching the obtained secondary buffer address BN2. If the instruction corresponding to the L2 cache address has been stored in the L1 cache memory 22, the corresponding entry in 40 is valid, and the BN1X block address in the entry and the type of the branch instruction generated by the scanner 43 are at this time. And the intra-block offset BNY is merged into one track table entry.
  • the corresponding entry in 40 is invalid.
  • the L2 cache address BN2 (including the intra-block offset BNY) and the scan are obtained.
  • the type of the branch instruction generated by the unit 43 is merged into one track table entry.
  • Each corresponding track table entry in an instruction block thus generated is written in the instruction sequence to a track corresponding to the instruction block in the memory 22 in the track table 20, that is, the program stream extracted and stored in the instruction block is completed. .
  • the read pointer 28 generated by the tracker 47 addresses the track table 20 to read the entry through the bus 29.
  • the controller 27 decodes the branch type and address format in the output entry. If the branch type in the output entry is a direct branch and the buffer address is in the BN2 format, the controller 27 addresses the secondary active table 40 with the BN2 address. If the entry in the entry is valid, the BN1X in the entry is filled in the track table 20 instead of the BN2X in the above entry, so that it becomes the BN1 format; if the entry in the 40 is invalid, the secondary buffer memory is addressed by the BN2 address. 42.
  • the read command block is filled in a first level cache block provided by the level 1 buffer replacement logic in the level 1 buffer memory 22, and the block number BN1X of the level 1 cache block is filled into the above invalid entry in 40. And the entry is valid, and the BN1X is filled in the entry in the track table as above, and the BN2 address in the entry is replaced with the BN1 address.
  • the BN1 address of the write track table 20 described above can be bypassed onto the bus 29 and sent to the tracker 47 for later use. If the branch type output via the bus 29 is a direct branch and the buffer address is in the BN1 format, the controller 27 causes it to be sent directly to the tracker 47 for later use.
  • the controller 27 controls the tracker to wait for the processor core 23 to calculate the indirect branch target address via the bus 46, and the selector 44 sends the match to the L2 cache TMB/tag unit 41 to The matching secondary cache address BN2 accesses the secondary active table 40. If the corresponding entry in the 40 is invalid, the secondary cache memory 42 reads the instruction block and fills the first level cache memory 22 with the BN2 address as described above. In the cache block, the obtained BN1 address is bypassed to the tracker 47 for backup.
  • Related table Correration Table, which may also be referred to as an association table, 37 is a component of the permutation logic of the level 1 buffer 22, the structure and function of which will be described in the embodiment of FIG.
  • the branch in the processor core 23 determines that there are two pipelines before the pipeline segment, one of which receives the IRB from the instruction read buffer. 39 sequential instructions, the branch is named FT (Fall-through) branch; The other receives a branch target instruction from the level 1 buffer memory 22, which is named the TG (Target) branch.
  • the number of front-end pipeline segments included in the two branches is determined by the pipeline structure of the processor. In this embodiment, two front-end pipeline segments are included in each of the two branches as an example.
  • the branch in the processor core 23 determines that the pipeline segment executes the branch instruction, selects one of the two instructions to complete execution based on the generated branch decision 31, and discards execution of the other branch.
  • IRB 39 can store two instruction blocks as an example, the instruction read buffer IRB 39 is addressed by the IPT read pointer 38 of the tracker 48.
  • the primary command buffer 22, associated table 37 and track table 20 are addressed by the RPT read pointer 28 of the tracker 47.
  • the branch judgment 31 The default value is '0', that is, no branch, the processor core 23 selects an instruction to execute the FT branch; when the processor core 23 makes a judgment on the branch, if the judgment is 'no branch', the value of the branch judgment 31 is '0. ' At this time, the processor core 23 selects the instruction to execute the FT branch; if it is judged as the 'branch', the value of the branch decision 31 is '1', at which time the processor core 23 selects the instruction to execute the TG branch.
  • the selectors 33, 25, 35 can all be controlled by the branch judgment 31.
  • the above three selectors select the right input; when 31 is '1', the above three selectors are selected. Input on the left. Further, when the processor core 23 does not make a judgment on the branch, the selectors 33 and 25 are also controlled by the controller 27.
  • the operation of the processor system of the embodiment of Fig. 4 is described below in conjunction with the contents of track table 14 of Fig. 1.
  • the M command block is already in the instruction read buffer IRB39, the branch decision 31 is '1', and the selectors 25 and 35 select the left input, the IPT read pointer. Both 38 and PT read pointer 28 point to address M1.
  • the M1 instruction in IRB39 pointed to in IPT 38 is sent to the FT branch front-end pipeline in the processor core; at the same time, RPT 28 points to the track table 20, and reads the value 'N' of the end entry 16 of the M line from the independent read port 30 to address the level 1 buffer 22 and output the N command block to the IRB. 39.
  • the entry 2J3 of the track table 14 in which the M line matches the BNY address '1' is output via the bus 29.
  • the instruction branch judges 31 to be the default value '0', and the selector 35 selects the input of the incrementer 34, IPT.
  • Controller 27 compares the value '2' on the 15 domain SBNY on bus 29 with the RPT The value of the 13 field BNY on the 28 is '1', and when they are not equal, the selector 25 is controlled to select the output of the incrementer 24 to step the RPT 28 to point to M2, at which point SBNY and RPT on the bus 19
  • the decoder 27 controls the selector 33 and the selector 25 to select the input on the right, that is, the BN1 address J3 on the bus 29 is stored in the register 26. Thereafter, the controller 27 controls the RPT read pointer 28 to read J3 from the level 1 cache 22, and the K0 command is sent to the TG branch front end pipeline of the processor core 23.
  • M2 is a branch instruction.
  • the pipeline segment executes the M2 instruction to generate a branch decision. If the branch judges that '31' is '0', the processor core 23 selects M3 in the FT branch. The N0 instruction continues to execute, and the J3 and K0 instructions in the TG branch are discarded.
  • the branch judgment 31 controls the selectors 25 and 35 to select the output of the incrementer 34 to be stored in the registers 26 and 36, so that the RPT 28 and the IPT are made.
  • 38 points to N1, IPT 38 controls IRB 39 output N1 and subsequent instructions to the processor core 23 FT for continuous execution.
  • RPT 28 Pointing to N rows in the track table, reading the end entry of the N row, and sending it to the first level buffer 22 to read the sequence of the N instruction block, the next instruction block is stored in the IRB 39.
  • the processor core selects J3 in the TG branch.
  • the K0 instruction continues to execute, and the M3, N0 instruction in the FG branch is discarded.
  • the branch judgment 31 controls to store the K line instruction output by the level 1 cache 22 into the IRB. 39, and controls the selectors 25 and 35 to select the output of the incrementer 24 to be stored in the registers 26 and 36, so that both the RPT 28 and the IPT 38 point to K1, and the IPT 38 controls the IRB 39 output.
  • the FT of K1 and subsequent instructions to the processor core 23 is continuously executed.
  • the RPT 28 points to the K line, and the L of the end line of the K line is sent to the first level buffer 22 to read the L line and store it in the IRB. 39.
  • the processor 23 can execute the instructions without interruption, without the pipeline stall due to the branch.
  • the tracks corresponding to different threads in the track table are orthogonal, so they can coexist and do not affect each other.
  • the indirect branch address 46 generated by the processor core in FIG. 4 is a virtual address, which is selected by the selector 44 after being combined with the thread number, wherein the index address is simultaneously sent to the TLB and the second label unit in 41, and the virtual label portion is The thread number is sent to the TLB and mapped to a physical label, and the physical label matches the label of each path read by the index address in the secondary label unit, and the obtained road number is matched (Way Number) is combined with the index number (Index) in the virtual address, that is, the level 2 cache block address.
  • the level 2 cache address BN2 and the level 1 cache address BN1 obtained by the mapping are actually mapped by the physical address instead of Virtual address mapping. Therefore, the two different threads with the same virtual address in the processor have different buffer addresses BN, which avoids the same virtual address of different programs of different threads to address the same cache address (address Aliasing) problem.
  • the same virtual address of the same program of different threads because it will be mapped to the same physical address, the cache address of the mapping is also the same, avoiding the duplication problem of the same program in the buffer. Based on this feature of the cache address, multi-threaded operations can be implemented. 45 in FIG.
  • FIG. 4 is a register bank in which a thread number and a status register in the processor are stored by a thread, such as in the tracker 47 of FIG.
  • the contents of register 26 and register 36 in tracker 48, as well as the values of the registers of the thread in processor core 23. 45 is addressed by thread number 49.
  • thread number 49 When the processor wants to switch threads, the values in the registers 26 and 28 in the trackers 47, 48, and the values in the registers in the processor core 23 are all read out and stored in 45, thereby changing on the bus 49.
  • the swapped thread number is then transmitted by the bus 49 to 45, and the contents of the entry pointed to by the thread number are swapped into the registers 26, 36 and the registers in the processor core 23, and then at the IRB.
  • the instruction block pointed to by 38 and the next instruction block in the order can start the operation of swapping in the thread.
  • the instructions of each thread in the track table 20 and the buffers 42 and 22 are orthogonal, and there is no phenomenon that one thread mistakenly executes an instruction of another thread.
  • FIG. 5 is another embodiment of the processor system of the present invention.
  • the secondary active table 40, the secondary cache memory RAM 42, secondary scanner 43, track table 20, level 1 cache related table 37, level 1 buffer memory RAM 22, instruction read buffer 39, tracker 47, tracker 48, processor core 23 has the same function as the module of the same number in the embodiment of Fig. 4; although the controller 27 and the selector 33 are omitted in Fig. 5 for making the drawing easy to read, the operation below the secondary buffer is the same as that of the embodiment of Fig. 4.
  • Figure 3 adds a three-level cache, consisting of a three-level active table 50, a three-level cached TLB, and a tag unit TAG.
  • the 51 and the tertiary buffer memory 52, the tertiary scanner 53 and the selector 54 are formed instead of the TLB and label unit 41 of the secondary cache in FIG. 4, and the selector 44.
  • the last level cache in the embodiment of Figure 5 (last Level Cache), the three-level buffer 52 is organized in a way group mode, and the second level buffer 42 and the first level buffer 22 are all connected in a connected manner.
  • Each of the L2 cache blocks 42 has four L1 cache blocks, and the L3 cache block of each of the L3 buffers 52 has 4 L2 cache blocks.
  • FIG. 6 is an address format of the processor system in the embodiment of FIG. 5.
  • the memory address is divided into a tag (61), an index (Index) 62, and a second-level subaddress (L2).
  • the address BN3 of the third-level buffer is composed of a road number 65 and an index 62, a second-level sub-address 63, a first-level sub-address 64, and an intra-block offset (BNY) 13; wherein the road number 65 and the index 62 are combined into three levels.
  • the address BN2 of the secondary buffer is composed of a secondary cache block number 67 and a primary subaddress 64, and an intra-block offset (BNY) 13; wherein the secondary cache block number 67 addresses a secondary cache block;
  • Each of the internal offsets 13 is collectively referred to as BN2X, addressing a level one instruction block in the secondary cache block.
  • the address BN1 of the primary buffer is composed of a primary cache block number 68 (BN1X) and an intra-block offset (BNY) 13.
  • the intra-block offset (BNY) 13 in the above four address formats is the same, and the BNY portion does not change when the address conversion is performed.
  • the secondary block number 67 points to a secondary cache block
  • the primary subaddress 64 points to one of the four primary instruction blocks in the secondary cache block.
  • the road number 65 and the index 62 point to a third-level cache block
  • the second-level sub-address 63 points to one of the four second-level instruction blocks
  • the first-level sub-address 64 points to the selected two-level instruction block.
  • FIG. 7 is a partial storage table format of the processor system in the embodiment of FIG. 5. This will be described below with reference to Fig. 5, Fig. 6 and Fig. 7.
  • the format of the tag unit in 51 of Figure 5 is the physical tag 86.
  • the CAM format of the TLB in 51 is the thread number 83 and the virtual tag 84
  • the RAM format is the physical tag 85.
  • the thread number 83 and the virtual tag 84 of the selector 54 selection output are mapped to the physical tag 85 in the TLB; the index address 62 in the virtual address reads the physical tags 86 and 85 in the tag unit to match the way number 65.
  • the road number 65 and the index address 62 in the virtual address are combined to form a three-level cache block address.
  • the AL3 three-level active list 50 of FIG. 5 is organized in a multiplexed associative manner, with the same number of rows in each path as the tag units in the L3 buffers 52 and 51, also addressed by the index address 62.
  • Each row has a count field 79 and four BN2X fields 80, and a plurality of 80s in the same row are addressed by a secondary subaddress 63.
  • Each 80 field has its corresponding valid bit 81.
  • the same line of each way shares a three-level pointer 82.
  • the AL2 secondary active list 40 is organized in a fully associative manner with the same number of rows as the L2 buffer 42 being addressed by the secondary block address 67.
  • the CT correlation table 37 is organized in a fully associative manner with the same number of rows as the L1 buffer 22, addressed by the primary block address 68.
  • the block number of the L2 cache block is stored in the L2 cache block.
  • the entry cache 80 is addressed by the secondary subaddress 63 in the corresponding row of the tertiary active table 50, and its corresponding valid bit 81 is also set to '1' (valid).
  • the instructions in the L2 cache block are decoded by a three-level scanner 53, wherein the branch offset in the branch instruction is added to the address of the instruction to obtain the branch target address.
  • the address of the next L2 cache block in the L2 cache block is also determined by the memory address of the L2 cache block plus the size of a L2 cache block.
  • the branch target address or the next L2 cache block address is selected by the selector 54 to be sent to the tag unit match 51. If not, the address is sent to the lower level memory read command and stored in the L3 cache memory 52. This ensures that the instructions in the L2 cache memory 42, its branch destination and the sequence of the next L2 cache block are at least in the L3 cache memory 52 or are being stored in the process 52.
  • level one instruction block in one of the level two buffer blocks in the level 2 buffer 42 is stored in a level one cache block in the level 1 buffer 22
  • the block number of the level 1 cache block in the 22 is stored in the
  • the entry cache 76 which is addressed by the primary subaddress 64 in the corresponding row of the secondary cache table 40, has its corresponding valid bit 77 set to '1' (valid).
  • the instructions in the level one cache block are decoded by the secondary scanner 43, wherein the branch offset in the branch instruction is added to the address of the instruction to obtain the branch target address.
  • the address of the next level one cache block in the first level cache block is also obtained by the memory address of the first level cache block plus the size of one level one cache block.
  • the branch target address or the sequence of the next level one cache block address is selected by the selector 54 to be sent to the tag unit 51 for matching. If there is no match, the address is sent to the lower layer memory read instruction and stored in the level three cache memory 52; Then, the entries 80 and 81 in the third-level active list 50 are read out by the 65, 62, 63 portions of the matched third-level cache addresses. If 81 is '0' (invalid), the third-level cache memory 52 is addressed by 65, 62, 63, 64 of the matched third-level cache addresses, and a second-level cache block is read into the second level.
  • the block number 67 and the valid value '1' of the second level cache block are written to the entries 80 and 81 addressed by the above-mentioned three-level cache address in the third-level active list 50. in.
  • the AL2 secondary active table 40 reads the entries 76 and 77 with the BN2X values (67 and 64) in the read entry 80. If 77 is '0' (invalid), the BN2X value and the BNY combined BN2 address (67, 64, 13) are stored in the entry corresponding to the branch instruction on the track being filled in the track table 20. If 76 is '1' (valid), the BN1 address (68, 13) in the BN1X and BNY in the entry is stored in the entry in the track table 20 corresponding to the branch instruction.
  • the branch type 11 decoded by the secondary scanner 43 is also stored in the entry of the track of the track table 20 together with the above BN2 or BN1 address.
  • the block address of the first level cache block is also matched and addressed in the above manner. If the next level two instruction block is not already in the level 2 buffer memory, the instruction block is stored from the level 3 cache 52 to the second level.
  • the cache 42 and stores the obtained BN2 or BN1 address in the end table entry 16 at the far right of the above track. This ensures that the instructions in the level one cache memory 42 whose branch destination and sequence the next level one cache block are at least already in the level two cache memory 42 or are being stored into the process 42.
  • This embodiment discloses a hierarchical prefetching function, and each storage hierarchy can ensure that the branch target of the storage hierarchy is at least, or is being written into, a lower level storage hierarchy. This allows the branch target instructions of the instruction being executed by the processor core to be in the primary cache or the secondary cache in most cases, masking access latency to lower memory levels.
  • the corresponding row in the correlation table 37 is also established while the first level instruction block is filled in the level 1 cache memory 22, and the instruction track of the cache block is created to fill the track table 20 with the corresponding track.
  • the level 1 cache block is the block number BN1X of the level 1 cache block in the target entry to maintain the integrity of the control information flow in the track table.
  • the row in the correlation table 37 is also addressed with the BN1X in the branch target being written into the track of the track table 20, and the count value 70 in the row is incremented by '1', thereby recording another branch.
  • the instruction targets the behavior and writes the first-level cache block number of the track itself being written to its 72-field and sets the corresponding 73-domain to '1' (valid) to record the path of the branch source (address) ).
  • a row in the associated table 37 is also addressed in that manner in that manner.
  • the branch target address format in the entry of the track table 20 may be in the BN2 or BN1 format as described above.
  • the controller decodes the branch type 11 therein. If the address format is BN2, the controller searches for the BN2X address (67 and 64) on the bus 29.
  • the address secondary active table 40 reads entries 76 and 77. If 77 is '0' (invalid), the L2 cache memory 42 is addressed by the BN2X address, and a level one instruction block is read into a level one cache block in the level one buffer memory 22, and the level is one level.
  • the cache block number and the valid value '1' are stored in the entries 26 and 77 pointed to by the above BN2X address in the secondary active list 40. If 77 is '1' (valid), then BN1X in 76 68 writes to entry 12 in the track table but does not change BNY in entry 13, thus replacing the original BN2 address with the BN1 address.
  • the BN1X address can be bypassed onto bus 29 for use by tracker 47.
  • Tracker 47 addresses track table 20, level 1 buffer memory 22; tracker 48 addresses IRB
  • the process of providing the processor core 23 with uninterrupted instructions for execution is the same as that of the embodiment of FIG. 4, and details are not described herein again.
  • Cache Replacement Logic of this embodiment with least correlation (Least Correlation, LC) and Earliest Replacement (Earlierst Replacement, The ER) combined method determines the cache block that can be replaced.
  • the count value 70 in the correlation table 37 is used to detect the correlation (also called the degree of association). The smaller the count value, the smaller the number of cache blocks targeted to the level 1 cache block, which is convenient for replacement.
  • the pointer 74 shared by each row in the associated table 37 points to a row that can be replaced (the count value 70 in the replaceable row must be lower than a preset value).
  • the corresponding track pointed to by 74 in the track table 20 is also replaced by the branch type and branch target extracted by the secondary cache block replaced by the secondary scanner 43.
  • the track in the track table 20 is also addressed by the BN1X address in the corresponding 72 field in which each of the 73 fields in the associated table 37 is '1' (valid), and the first level cache in the track is replaced.
  • the branch target address recorded in the block number is replaced with BN2X in the 71 field in the row indicated by 74 in the correlation table 37, so that the instruction originally used as the branch target in the replaced first-level cache block is now in the second-level buffer buffer.
  • the same instruction in 22 is a branch target such that replacing the level one cache block does not affect the control information flow.
  • the BN2X is used to address the secondary active table 40, and the count value 75 in the entry of 40 is increased by the number of times BN1X is replaced in the track table 20 by the BN2X value to record the correlation of the increase of the secondary cache block.
  • the pointer 74 then moves in a single direction, staying on the next line that satisfies the least correlation; when the pointer goes beyond the boundary of all the lines in the associated table 37, it moves to the other boundary (if the line exceeding the maximum address is the smallest from the address) The line starts to detect the least correlation detection).
  • the one-way movement of the pointer 74 ensures that the first-order cache block that was replaced the first time is preferentially replaced, that is, the above ER. Detecting the one-way movement of the count value 75 of each row and the pointer 74 implements the LCER level one cache replacement strategy. This permutation method replaces a single number of L1 cache blocks at a time.
  • the permutation may be continued in order or in reverse order until a level one cache block is encountered, with the count value 70 in the corresponding correlation table 37 exceeding a preset value.
  • This permutation method replaces a plurality of L1 cache blocks at a time.
  • a singular or multiple replacement method can be used as needed. It is also possible to mix different methods. If the singular permutation method is used normally, the complex permutation method is used when the lower layer cache lacks a cache block that can be replaced.
  • the replacement of the L2 cache is also based on the LCER strategy.
  • the corresponding 77 field in the secondary active table 40 is set to '0' and the count value is increased 75; when the cache block is stored in the primary cache memory 22 from the secondary cache memory 42
  • the corresponding valid bit 77 in the corresponding entry in the secondary active table 40 is set to '1', and the primary cache block number BN1X is written to the corresponding 76 field.
  • the count value 75 corresponding to the BN2X in the secondary active table 40 is incremented by '1'; each time the BN2X in the track table entry is When the BN1X is replaced, the count value 75 corresponding to the BN2X in the secondary active table 40 is decremented by '1'.
  • the count value 75 records the number of times a secondary cache block is used as a branch target; and each valid bit 77 in the entry records whether a portion of the secondary cache block has been stored in the primary buffer;
  • Each 76 field records the block address 68 of each corresponding level one cache block.
  • the replacement of the secondary cache causes the shared secondary pointer 78 to move in one direction and stay on the next replaceable secondary cache block.
  • the replaceable secondary cache block can be defined as the count value of 75 in its corresponding secondary active table 40 entry and all 77 fields are '0'. That is, when a secondary cache block is unrelated to all instructions in the primary buffer 22, the one-way moved pointer 78 guarantees the ER.
  • the replacement of the L3 cache is also based on the LCER strategy.
  • the corresponding valid bit 81 in the corresponding entry in the tertiary active table 50 is set to '1', and the L2 cache block number BN2X is written.
  • the count value 79 in the entry of the tertiary active table 50 is not used in this embodiment.
  • the third-level cache is an association form of road groups, and each group (the same index address) has a plurality of paths, and each group shares a pointer 82.
  • the next available path can also be looked up by pointer 82, where the replaceable path can be all 81 fields in the path being '0'. That is, the level three cache block is not related to the instructions in the level two buffer 42, and thus can be replaced.
  • the above method of using the pointer to ensure that the cache block that has just been replaced is not replaced again may be replaced by another method.
  • the third-level buffer is a group association organization manner. If it is not replaceable in each group (at least one of the 81 fields in the three-level active table 50 is '1'), you can select the first-level cache block in which the 81 field is the least '1'. Perform a complex replacement. For example, if only one 81 domain of a certain path is '1', that is, only one of the four secondary instruction blocks that can be stored in the third-level cache block is in the secondary cache memory 42, so the 80 domain corresponding to the 81 domain can be used.
  • the BN2X output in the address addresses the secondary active table 40, from which the BN1X number in the 76 field that is first valid in the address order (its 77 field is '1') is read, and the cache block from this level is calculated to
  • the last valid L1 cache block in the level cache block is a total of N L1 cache blocks. That is, the BN1X number and the first-level cache block number N are sent to the first-level cache replacement logic, and the first-level cache blocks are replaced from the first-level cache block pointed to by the BN1X, and the cache blocks targeted to the cache blocks are combined. Substitution, the above secondary cache block can be replaced.
  • level 1 cache block included in the level 3 cache block is not continuous, then a plurality of starting points and a plurality of corresponding N values are set to be sent to the level 1 cache replacement logic in sequence as described above.
  • the count values in each level in the embodiment of FIG. 7 are 79 in the third-level active list 50, 75 in the second-level active list 40, and 70 in the (primary) correlation table 37 are used to record the cache block at the same storage level.
  • the degree of relevance in Each valid bit in each level with a higher storage level is used to record the relevance of the cache block in a higher storage hierarchy, such as the association between the 81 record and the second-level cache block in the three-level active table 50, and the second-level initiative.
  • the degree of association of 77 records in Table 40 with the level one cache block.
  • the 73 in the related table 37 records the branch source address that jumps to the level 1 cache block.
  • the method of replacing the current cache block BN1X address in each entry pointed to by the branch source address in the track table 20 can be replaced by the BN2X address 71 of the cache block in 37 to maintain the integrity of the control flow information.
  • the cache block can be replaced.
  • Another replacement method can select a cache block replacement with a degree of association of '0'.
  • the cache system of the present invention operates based on control flow information, so the basic principle of cache replacement is that the integrity of the control flow information is not compromised.
  • FIG. 8 is another embodiment of the processor system of the present invention.
  • Figure 8 is a modification of the embodiment of Figure 5, wherein the three-level active list 50, the three-level cached TLB and tag unit 51, the third-level buffer memory 52, the selector 54, the secondary active table 40, the second-level cache memory 42, track table 20, level 1 cache related table 37, level 1 buffer memory 22, instruction read buffer 39, tracker 47, tracker 48, processor core 23 has the same function as the module of the same number in the embodiment of FIG.
  • the secondary scanner 43 (which can generate the branch type) is connected to the bus from the tertiary buffer 52 to the secondary buffer 42, which is the only one in the embodiment.
  • a secondary track table 88 is additionally added.
  • the organization of each buffer in the embodiment of Fig. 8 is the same as that in the embodiment of Fig. 5.
  • Each track in the secondary track table 88 corresponds to a secondary cache block in the secondary buffer 42.
  • Each secondary track contains four primary tracks, each corresponding to one primary instruction block in the secondary cache block.
  • the format of the first-order orbit in the secondary track table 88 also takes the SBNY in Figure 1. 15, type 11, BNX 12 and BNY 13 format, the address format can be BN3 or BN2 format.
  • the scanner 43 scans the L2 cache block stored from the L3 buffer memory 52 to the L2 cache memory 42 and calculates the branch target address of the branch instruction therein.
  • the branch target address is selected by the selector 54 to be sent to the TLB/tag unit 51 to be matched to the BN3 address, and the BN3 address addressing three-level active table 50 detects whether the entry is valid (whether the corresponding cache block has been stored in the secondary cache memory 42); Valid, the BN2X address in the entry is combined with the BNY in the BN3 address to form the BN2 address along with the SBNY generated by the scanner. 15 and type 11 are stored in the entry of the secondary active table 88 corresponding to the branch instruction; if invalid, the entry is directly stored in the 88 entry with the BN3 address together with SBNY 15 and type 11.
  • the secondary track table 88 When a primary instruction block in the secondary cache block of the secondary buffer memory 42 is stored in the primary cache block in the primary buffer memory 22, the secondary track table 88 outputs the corresponding primary track from the bus 89. Deposited in track table 20. If the address in the entry on the track is in the BN3 address format, the three-level active table 50 is addressed by the address, and if the entry valid bit 81 is invalid, the secondary cache block is removed from the tertiary buffer 52 in the foregoing manner.
  • BN2X Stored in a L2 cache block of the L2 buffer, and the L2 cache block number is combined with the second-level sub-address 64 of the BN3 address to form a BN2X address and stored in the 80-domain of the third-level active table 50; Valid, that is, BN2X in the entry is stored in the secondary track table 88 instead of the original BN3X address.
  • the BN2X is also bypassed to Bus 89 is provided for storage in track table 20.
  • This embodiment uses the count value 79 in the three-stage active meter 50. Similar to the method of using the count value 75 in the secondary active table in the embodiment of FIG.
  • the BN2 address is also used to address the secondary active table 40. If the valid entry 77 of the entry in the 40 is invalid, the BN2 address is stored in the entry in the track table 20. If the valid entry 77 of the entry in the 40 is valid, The BN1X address in the 40 entry is combined with the BNY address in the BN2 address and stored in the entry in the track table 20. When the BN2 address is output from the track table 20 via the bus 29, it is used to address the secondary active table 40. If the valid bit 77 in the entry is invalid, the secondary cache memory 42 is accessed by the BN2 address to read a level 1 cache.
  • the block is stored in a first-level cache block number in the first-level cache memory 22, and the first-level cache block number BN1X is stored in the 76 field of the second-level active table 40, and the BN1X is stored in the track table 20, and the block BN1X is bypassed onto bus 29 for use by the tracker.
  • the address of the track entry in the secondary active table 88 may be in the BN3 or BN2 format, and the address of the track entry in the active list 20 may be in the BN2 or BN1 format.
  • Another strategy is to fill in the track table 20 with the BN1 address.
  • the L2 cache memory 42 reads a level 1 cache block number stored in the L1 cache memory 22, and stores the L1 cache block number BN1X in the 76 field of the L2 active table 40, and The corresponding 77 field is set to be valid; and the BN1X is stored in the track table 20, and the BN1X can also be bypassed to the bus 29 for use by the tracker; if 77 bits in 40 are valid, the BN1X in the table 76 field is used.
  • the track table 20 is directly filled in and bypassed to the bus 29 for use.
  • FIG. 9 is an embodiment of an indirect branch target address generator of the processor system of the present invention.
  • the indirect branch target address is generally obtained by adding a base address stored in the register file in the processor core to the branch offset contained in the indirect branch instruction.
  • 93 is an adder
  • 39 is an IRB
  • 95 is a plurality of registers with comparators
  • 96 is a plurality of registers
  • the relationship between them is CAM-RAM, one-to-one correspondence.
  • 98 is a selector.
  • 15, 11, 12, and 13 are contents of the entry of the track table 20 output via the bus 29.
  • a set of registers 95 and 96 is arranged for each indirect branch instruction.
  • Adder 93 and IRB 39 is the sharing of all indirect branch instructions.
  • the field of the track table 20 of the indirect branch instruction has 15 fields SBNY, and the 11 field type is the same as that defined in FIG. 1; however, the 12 field is instead used to store the register file (RF) address, and the 13 field is used to store the register 95, 96. Group number.
  • the scanner 43 decodes the scanned one instruction into an indirect branch instruction, the 15 field and the 11 field of the track table entry are generated as described above, and the base address register file number in the instruction is placed in the 12 domain, and the 13 domain is Set to 'invalid'.
  • the 13 fields that are 'invalid' cause the system to allocate a set of registers 95, 96 (multiple rows of CAM-RAM in a group)
  • the group number of the set of registers is stored in the track table entry 13.
  • the write address of the heap when the write address is the same as the address in the track table entry 12 field, connects the bus 94, which writes the execution result of the execution unit transfer from the processor core back to the register file, to the other input of the adder 93.
  • the output 46 of the adder 93 is the branch target address, which is sent to the TLB/tag unit 51 for matching.
  • the base address on the bus 94 is also stored in a row available in the 95 register in the register group pointed to by the track table entry 13 field; the branch target instruction matches the resulting BN1 address stored in the register field pointed to by the 13 field via the bus 89.
  • the same line in the 96 registers is also stored in a row available in the 95 register in the register group pointed to by the track table entry 13 field; the branch target instruction matches the resulting BN1 address stored in the register field pointed to by the 13 field via the bus 89.
  • selector 98 selects the BN1 address on bus 89 to be output via bus 99.
  • the type of the entry on the bus 29 is an indirect branch instruction
  • the address of the bus 99 is used by the tracker 47; the address on the bus 29 is selected for use by the tracker 47 when the entry type of the entry is other.
  • the register group number in the 13 field in the track table entry on bus 29 selects the corresponding register bank 95 and 96.
  • the register file address in the 12 field selects the bus written back to the register file table entry.
  • the data on 94 is compared with the contents of register 95.
  • the BN1 address in the 96 rows of the corresponding register is output via bus 97, and is selected by the selector 98 for use by the tracker; if not, the addition is as described above.
  • the device 93 calculates that the indirect branch target address matches the BN1 address on the bus 89, and the selector 98 selects the address output on the bus 89.
  • a mismatch also causes the base address on bus 94 and the BN1 address on bus 89 to be stored in a row in registers 95, 96 that are not used.
  • the permutation logic is responsible for allocating register sets 95, 96 to the entries of the indirect branch type of bus 29 that are "invalid" in the field 13, which may be LRU or the like.
  • the base address of the indirect branch instruction can be mapped to the level 1 buffer address BN1, and the steps of address calculation and address mapping are omitted.
  • FIG. 10 is a schematic diagram of a pipeline structure of a processor core in a processor system according to the present invention.
  • 100 is a typical pipeline structure of a traditional computer or processor core, divided into I, D, E, M, W segments.
  • the I segment is the instruction segment
  • D is the instruction decoding segment
  • E is the instruction execution segment
  • M is the data access segment
  • W is the register write segment.
  • 101 is the pipeline segment of the processor core in the present invention, and has less than one segment compared with 100.
  • a conventional processor core generates an instruction address that is sent to a memory or buffer to read (pull) the instruction.
  • the cache system of the present invention automatically pushes instructions to the processor core, requiring only the processor core to provide a branch decision 31 to determine the program direction, and a stall pipeline signal 32 to synchronize the cache system with the processor core. Therefore, the pipeline structure of the processor core using the cache system of the present invention is different from the conventional pipeline structure, and there is no need for a pipeline segment for instruction fetching. In addition, the processor core using the cache system of the present invention does not need to maintain the instruction address (Program) Counter, PC). As shown in Figure 9, the indirect branch target address is generated based on the base address in the register file, and no PC address is required. Other instructions are also accessed by the BN address of the cache system, without the PC. Therefore, it is not necessary to maintain the PC in the processor core using the cache system of the present invention.
  • FIG. 11 is another embodiment of the processor system of the present invention.
  • Figure 11 is a modification of the embodiment of Figure 8, wherein the three-level active list 50, the three-level cached TLB and tag unit 51, the third-level buffer memory 52, the selector 54, the scanner 43, the secondary track table 88, two Active table 40, secondary cache memory 42, track table 20, level 1 cache related table 37, level 1 buffer memory 22, instruction read buffer 39, tracker 47, tracker 48, processor core 23 has the same function as the module of the same number in the embodiment of FIG.
  • a secondary correlation table 103 is added, and 102.
  • 102 is the indirect branch target address generator shown in the embodiment of FIG.
  • the organization of the buffer in the embodiment of Fig. 11 is the same as that of the embodiment of Figs. 5 and 8.
  • the secondary correlation table 102 is similar in structure to the related table 37.
  • Each of the L2 cache blocks has a count value, a L3 cache address corresponding to the L2 cache block, a source address of the branch source instruction with the L2 cache block as a branch target, and a valid signal thereof (refer to FIG. 7) Medium CT format); as in the related table, the count value is the number of branch source instructions.
  • the BN2 format branch target address in the filled track entry addresses the row in the secondary correlation table 103 (hereinafter referred to as the target) Line), fill in the secondary buffer address of the track (referred to as the source track) that is filling in the secondary track table 88 into the source address field in the target row and set its valid signal to 'valid', and the target line
  • the count is increased by '1'.
  • the corresponding level three buffer address of the source track is also filled in the row in the secondary correlation table 103 corresponding to the source track.
  • the address in the entry of the secondary track table 88 is in the BN3 format
  • the three-level active list 50 entry is addressed by the BN3 address, and the count value 79 therein is increased by '1'.
  • the format of the entry on the output 29 of the track table 20 When the format of the entry on the output 29 of the track table 20 is in the BN2 format, it will be used to address the secondary active table 40. If the corresponding entry is invalid, the BN2 (hereinafter referred to as the source BN2 address) address is required from the second
  • the read instruction block in the level buffer memory 42 fills the level one cache block specified by the permutation logic in the level one buffer 22. At this time, the source track BN2 addresses the secondary track table 88 to output the corresponding track to the track table 20 for storage.
  • the target BN3 address is sent to the tertiary active table 50 to be mapped to the BN2 address (hereinafter referred to as the target BN2 address), and the target BN3 is pointed at this time.
  • the count value in the third-level active table entry is reduced by '1', and the value in the target row pointed to by the target BN2 address in the secondary correlation table 103 is increased by '1';
  • the target BN3 address is stored in the same destination row; the source BN2 address is also stored in the same destination row, and its corresponding valid bit is set to 'active'.
  • the secondary pointer 78 points to the corresponding target row of the replaceable L2 cache block in the secondary correlation table 103, from which each valid BN2 source address is read, and each BN2 source address is searched for.
  • the address secondary track table 88 replaces the BN2 target address (pointing to the above target line) in the corresponding entry with the BN3 target address in the target row in 103, and sets the valid position of each BN2 source address in the target row in 103 to be invalid. '.
  • the count value in the target row in 103 is subtracted from the value of the valid BN2 source address, and the entry in the third-level active table 50 is addressed by the above BN3 target address, and the count value 79 is increased and the count value in 103 is incremented.
  • the value subtracted is the same value.
  • a lock signal bit may be added to the correlation table corresponding to the high-level cache block.
  • the lock signal bit is '0', the operation is the same as the above; when the lock signal bit is '1', the corresponding cache block is only
  • the degree of association is '0', that is, when no branch instruction targets the cache block (here, the end table entry of the previous instruction block is also regarded as storing the unconditional branch instruction)
  • the cache block can be replaced.
  • this is the first level cache block when one of the above lock signal bits is '1' only when its corresponding count value is 70. It can be replaced when it is '0' and all valid bits 73 are '0'.
  • the above-mentioned L2 cache block whose lock signal bit is '1' can be replaced only when its corresponding count value and all valid bits are '0'.
  • the entries in the level three active table 50 can be addressed by the BN3 address on the level 3 pointer 83, with all valid entries.
  • the BN2 address addresses the row in the secondary correlation table 103 and sets the lock signal to '1'. Thereafter, the level three cache block can be replaced. After the replacement, the buffer works in a non-inclusive state.
  • the corresponding three-level cache block in the L2 cache block whose lock signal is set to '1' has been replaced, and therefore cannot be maintained by replacing the BN2 address in the entry of the second track table 88 with the corresponding BN3 address.
  • the secondary cache block can be replaced until the association degree of the secondary cache block is '0'.
  • the high-level cache block can be replaced only when the degree of association is '0'; and in the entry of the active table corresponding to one cache block
  • the valid bits of all high-level sub-cache blocks (such as 81 in the three-level active table 50) are all '1', and the three-level when the count value in the entry (such as 79 in 50) is '0'
  • the cache block is set to be replaceable, and the buffer is an exclusive organization. It is also possible to set the buffer replacement method so that the cache blocks at all cache levels are replaced when the degree of association is '0'.
  • the 102 is the indirect branch target address generator in the embodiment of FIG. 9, which accepts the entry control on the bus 29 output from the track table 20, obtains the base address 94 from the processor core 23, and generates the indirect branch target address 46.
  • the processor 54 sends 51 to perform virtual real address translation and address mapping, and outputs a BN1 branch target address 99 for use by the tracker 47.
  • the tracker 47 selects the address 99 output by 102; when the type of the entry on the bus 29 is another instruction, the tracker 47 selects the bus 29 output from the track table 20. The address on.
  • FIG. 12 an embodiment of a processor/memory system of the present invention is shown.
  • the embodiment of Figure 12 applies the method to a memory external to the processor based on the embodiment of Figure 11, and other embodiments may be deduced by analogy.
  • Below the dashed line in Fig. 12 are the functional blocks and connections in the processor, which are identical to the embodiment of Fig. 11 except that there is no tertiary cache memory 52.
  • the three-level active table 50, the three-level cached TLB and tag unit 51, the selector 54, the scanner 43, the secondary track table 88, the secondary active table 40, and the second-level cache memory 42, secondary correlation table 103, indirect branch target address generator 102, track table 20, level 1 cache related table 37, level 1 buffer memory 22, instruction read buffer 39, tracker 47, tracker 48, Processor core 23 has the same function as the module of the same number in the embodiment of FIG.
  • the memory 111 and its address bus 113 are added above the dotted line in FIG. 12; the memory 112 and its address bus 114 are also added; the bus 115 sends the information block outputted by the memory 112 to the second level buffer memory 42 in the processor below the dotted line.
  • the instructions in the information are also scanned by the scanner 43 and the branch instruction information is extracted as described in the previous embodiment.
  • the memory 111 is organized by memory, and is addressed by a memory address 113 that is not obtained in the TAG of 51 (the virtual memory address generated by the source 102 or 43 is mapped to the physical address obtained by TLB in 51).
  • the memory 112 is organized by buffer, which is generated by a match obtained in the TAG of 51, or is output by the secondary track table 88 via 89, which is addressed by the tertiary buffer address 114.
  • the memory 112 outside the processor is actually used as a tertiary buffer memory instead of 52 in the embodiment of FIG.
  • the memory 111 is a low level memory not shown but described in Figures 4, 5, 8, and 11.
  • FIG. 12 is compared to the embodiment of FIG. 11 except that the last stage (level 3) cached memory (52 in FIG. 11) in the processor is moved outside the processor (112 in FIG. 12).
  • the two embodiments are logically equivalent.
  • the organization of the buffer (including the memory 112 as a three-level buffer memory) in the embodiment of Fig. 12 is the same as that of the embodiment of Fig. 11.
  • the structure in the embodiment of Figure 12 can have several different applications.
  • the first application form is that the memory 111 is a memory having a large capacity and a large access delay; and the memory 112 is a memory having a small capacity but a small access delay. That is, the memory 112 serves as a cache of the memory 111.
  • the memory can be constructed of any suitable storage device, such as a register or register file (register) File), static memory (SRAM), dynamic memory (DRAM), flash memory (Flash Memory), hard disk (HD), solid state drive (SSD), and any suitable storage device or future new form of memory.
  • register or register file (register) File such as a register or register file (register) File), static memory (SRAM), dynamic memory (DRAM), flash memory (Flash Memory), hard disk (HD), solid state drive (SSD), and any suitable storage device or future new form of memory.
  • the scanner 43 scans the instruction block sent from the memory 112 to the secondary buffer memory 42 via the bus 115, calculates the virtual branch target address of the direct branch instruction, and sends the virtual branch target address to the selector 54 (102 also generates an indirect branch.
  • the virtual branch target address of the instruction is sent to 54 via bus 46.
  • the TLB is mapped to a physical address in 51, and the physical address matches the TAG in 51.
  • the physical address is sent to the memory 111 via the address bus 113 to read the corresponding instruction block and stored in the memory 112 in the replaceable level three cache block indicated by the third level buffer replacement logic, and
  • the three-level cache block number is merged with the lower address output by the selector 54 into a BN3 address and stored in the secondary track table 88.
  • the index address output by the selector 54 is integrated into the BN3 address for addressing the three-level track table 50, and the BN2 address is stored in the secondary track table 88. If the entry in 50 is 'invalid', it will be directly stored in 88 with BN3. The rest of the operations are the same as those in the embodiment, and are not described herein again.
  • a specific embodiment of the first application may be a flash memory as the memory 111 and a DRAM.
  • Flash memory has a large capacity and low cost, but the access latency is large and the number of writes is limited.
  • DRAM memory is small in size and costly, but the access latency is small and the number of writes is unlimited. Therefore, the structure in the embodiment of Fig. 12 takes advantage of the respective advantages of flash memory and DRAM to mask their respective disadvantages.
  • 111 and 112 are collectively used as the main memory of the computer system (main Mamory, memory) use. There are lower storage levels such as hard drives outside of 111.
  • the first application is suitable for existing computer systems and can use existing operating systems.
  • the memory is managed by the storage manager in the operating system, that is, the memory is being used, and the memory is free; when the process needs it, it allocates memory and releases the memory after the process uses it. Because of the storage management by software, the execution efficiency is relatively low.
  • the second application of the embodiment of Fig. 12 uses a nonvolatile memory (such as a hard disk, a solid state hard disk, a flash memory, etc.) as the memory 111; and a volatile or nonvolatile memory as the memory 112.
  • 111 is used as a hard disk in a computer; and 112 is used as a memory memory in a computer, but 112 is organized by a buffer, and thus may be a hardware pair 112 of the processor.
  • Do storage management In this system architecture, the storage manager in the operating system is used with little or no instruction.
  • the instructions in the memory 111 are stored in the memory 112 in blocks as previously described. In a particular embodiment, the blocks of instructions may be virtual memories (virtual Memory)
  • the address in this embodiment is the format shown in FIG. 6.
  • the memory 111 (hard disk) address 113 is divided into a label 61, an index 62, a secondary sub-address 63, a primary sub-address 64, and a primary block internal offset. Transfer (BNY) 13.
  • the memory 111 (hard disk) address in this example may have a larger address space than the normal main memory address to address the entire hard disk, wherein 63, 64 and 13 are combined to be an offset address within a page; 61 and 62 are combined That is the page number.
  • the address BN3 of the memory 112 (main memory, that is, the third-level buffer in the foregoing embodiment) is composed of a road number 65 and an index 62, a second-level sub-address 63, a first-level sub-address 64, and an intra-block offset (BNY) 13.
  • the road number 65 is combined with the index 62, that is, the block address of the main memory 112, and one block is the above one page; 65, 62, 63 is flattened to address a second level instruction block in the main memory instruction block (page);
  • Each of the intra-block offsets 13 is collectively referred to as BN3X, addressing one of the primary instruction blocks (pages).
  • the address BN2 of the secondary buffer is composed of a secondary cache block number 67 and a primary subaddress 64, and an intra-block offset (BNY) 13; wherein the secondary cache block number 67 addresses a secondary cache block;
  • Each of the internal offsets 13 is collectively referred to as BN2X, addressing a level one instruction block in the secondary cache block.
  • the address BN1 of the primary buffer is composed of a primary cache block number 68 (BN1X) and an intra-block offset (BNY) 13.
  • the intra-block offset (BNY) 13 in the above four address formats is the same, and the BNY portion does not change when the address conversion is performed.
  • the secondary block number 67 points to a secondary cache block
  • the primary subaddress 64 points to one of the four primary instruction blocks in the secondary cache block.
  • the road number 65 and the index 62 in the BN3 address format point to a main memory instruction block
  • the second level sub-address 63 points to one of several second-level instruction blocks in the main memory instruction block
  • the first-level sub-address 64 points to the selected one.
  • the address of the starting point of the new thread (memory 111 address format) is passed through selector 54 (assuming that selector 54 has a third input in this embodiment) For the starting address to enter), sent to 51.
  • the index 62 in the start address addresses the tag unit TAG in 51, and the contents of the tag in each path are read to match the tag 61 in the start address. If there is no match, 61 and 62 of the start address are read out by the bus 113 to store the corresponding page (instruction block) in the memory 111, and stored in the memory 112 in the set indicated by the index 62 in the start address by the main memory. (that is, the third-level buffer in the foregoing embodiment) replaces the logic in a way specified by the road number 65; at this time, the 61 and 62 fields in the starting address are also stored in the same group in the same label unit in 51. .
  • the system controller reads one from the memory 112 (main memory) with the path number 65, the starting address address index 62, and the second level sub address 63.
  • the secondary instruction block is stored in the secondary buffer memory 42, and the secondary cache is replaced by a secondary cache block specified by the secondary block number 67; and the secondary block number 67 is stored in the tertiary active table 50.
  • the entry 80 pointed to by 65, 62, and 63 above and the valid bit 81 in the entry is set to 'valid'.
  • the scanner 43 scans the above-mentioned two-level instruction block, extracts the branch instruction information therein, and generates a track to be stored in the secondary track table 88.
  • the system controller further blocks one of the first-level sub-addresses 64 of the first-level sub-address 64 in the above-mentioned two-level block number 67 to be stored in the first-level buffer memory 22 by the first-level cache replacement logic to the first-order block.
  • the entry 76 in the secondary active table 40 pointed to by 67, 64 above is set and the valid bit 77 in the entry is set to 'valid'.
  • the system controller combines the above-mentioned first-order block number 68 into the first-order block offset BNY in the starting address.
  • the start instruction of the above-mentioned thread which is placed in the tracker 47 as the BN1 address, causes the read pointer 28 to point to the above-mentioned thread in the level 1 buffer memory 22 to also point to the corresponding entry in the track table 20.
  • the push operation to the processor core thereafter is similar to the previous embodiments.
  • the new thread start address injected by the operating system, or the hard disk address generated by the scanner 43 or the indirect branch address generator 102 is selected by the selector 54 and sent to the tag unit in 51 for matching. When the match is successful, the resulting BN3 address is addressed to the three-level active list 50.
  • the secondary active table 40 is addressed by BN2 in the entry. If the entry of the 50 output is 'invalid', the secondary instruction block is output to the secondary buffer memory 42 by the above-mentioned BN3 address direct addressing memory 112 (main memory).
  • the main memory cache block overwrites the instruction block that was originally present in the cache block. This replacement process from hard disk to main memory is completely controlled by hardware, and basically no software operation is required.
  • the permutation logic can use various algorithms such as LRU, NRU (not recently used), FIFO, clock, etc.
  • the conversion detection buffer TLB is not required in the embodiment 51 of the embodiment of FIG. 12, and the hard disk address is a physical address.
  • the starting address injected by the operating system is a physical address, whereby the resulting main memory address BN3 (for addressing memory 112) of the address mapping is a mapping of physical addresses.
  • the remaining BN2 addresses, which are mappings of BN3 addresses, are also mappings of physical addresses.
  • the memory 111 (hard disk) is the virtual memory of the memory 112 (main memory), and the memory 112 (main memory) is the buffer of the memory 111 (hard disk). Therefore, there is no case where the address space of the program is larger than the address space of the main memory.
  • the same program executed at the same time has the same BN3 address, and the BN3 addresses of different programs executed at the same time must be different. Therefore, the same virtual address of different programs at the same time will be mapped to different BN addresses without confusion.
  • the processor core in the push architecture does not generate an instruction address. Therefore, the physical hard disk address can be directly used as the address of the processor. It is not necessary to generate a virtual address by a processor core as in an existing processor system, and then map to a physical address to access the memory.
  • the memory 111 and the memory 112 in the embodiment of Fig. 12 can be packaged in a package as a memory.
  • the interface between the processor and the memory in the embodiment of FIG. 12 additionally adds a cache address BN3 bus 114.
  • the boundary between the memory and the processor in the embodiment of Fig. 12 is shown as a broken line, it is also possible to move some of the functional blocks from one side of the boundary to the other.
  • the TLB and the tag unit TAG in the three-stage active list 50, 51 are placed on the memory side above the dotted line, which is also logically equivalent to the embodiment of FIG. 12 and the embodiment of FIG.
  • non-volatile memory 111 chip and the single or multiple memory 112 chips and the memory chip below the dotted line in FIG. 12 are connected to each other through the TSV via hole, and are packaged in a single package.
  • Figure 13 is another embodiment of the processor/memory system of the present invention.
  • the embodiment of Figure 13 is a more general representation of the embodiment of Figures 8, 11, and 12.
  • a four-level active table 120, a four-level correlation table 121 and a four-level buffer memory 122 are added, which are addressed by the BN4 bus 123 generated by 51.
  • a three-level track table 118, a three-level correlation table 117, is also added, in which the count values extracted from the three-level active table 50 in the embodiment of FIG. 8, FIG. 11, and FIG. 12 are stored, so that the format of each level active table is consistent. . That is, there is no count value in 50 in the embodiment of Fig. 13, and the count value is stored in 117.
  • the lowest level 111 of the memory hierarchy in the embodiment of Figure 13 is a memory, addressed by memory address 113.
  • the remaining different levels of memory with each memory level of 111 are addressed by the corresponding BN cache address.
  • the lowest level cache, that is, the fourth level buffer 122 in the figure is an associated structure of the road group.
  • the remaining higher memory buffer levels are all associative.
  • the scanner 43 is located between the quaternary buffer memory 122 and the tier buffer memory 112.
  • TLB/TAG 51 is in the level 4 cache.
  • Each cache level higher than the level of the scanner 43 has a track table such as 118, 88, 20.
  • Each cache level except the highest cache level has active tables such as 120, 50, and 40.
  • Each cache level has a related table such as 121, 117, 103, 37.
  • the format of each storage table is shown in Figure 14.
  • Figure 14 is a diagram showing the format of each storage table in the embodiment of Figure 13.
  • the format of the tag unit in the embodiment 51 of Fig. 13 is the physical tag 86.
  • the CAM format of the TLB in 51 is the thread number 83 and the virtual tag 84, and the RAM format is the physical tag 85.
  • the thread number 83 and the virtual tag 84 selected by the selector 54 in FIG. 13 are mapped to the physical tag 85 in the TLB; the index address 62 in the virtual address is read out to match the physical tags 86 and 85 in the tag unit to obtain the road number 65. .
  • the road number 65 and the index address 62 in the virtual address are joined together to form a four-level cache block address 123.
  • each entry in the track table contains type 11, cache block address BNX 12 and BNY13, may also contain SBNY 15 to determine the branch execution time point.
  • Cache block address 12 in each level of the track table It may be a BN format of this level or a lower level, for example, 12 of the three-level track table 118 may be a BN3X or BN4X format.
  • the active table entry has a buffer block number 76 of the corresponding sub-block, and the format is a cache block number higher than the current level, for example, the BN2X is stored in the third-level active table 50; and the corresponding valid bit 77 .
  • the function of the active table is to map the cache address of this level to a higher level cache address.
  • the correlation table has a count value of 70, the meaning of which is the number of entries in the storage hierarchy or the high-level storage hierarchy track table with the cache block as a branch target; and a lower-level cache block number 71 corresponding to the cache block; And the track table entry address 72 and its corresponding valid bit 73 in the storage hierarchy with the cache block as a branch target.
  • the pointer 74 shared by each channel points to the cache block that has not been replaced for the longest time as described above; if the count value 70 corresponding to the cache block is smaller than the preset replacement threshold, the cache block can be replaced.
  • the table entry in the track table is addressed with 72 addresses of 73 'valid', and the level cache block number in the track table entry is replaced with the lower layer cache block number 71.
  • the exception is that the four-level correlation table 121 has only the count value of 70, and there is no 71, 72, 73. Since there is no track table in the hierarchy, there is no need to perform address replacement in the above-mentioned track table entry.
  • the scanner 43 extracts the information of the branch address in the instruction block, generates a track entry type, and also calculates a branch target address.
  • the branch target address is selected by the selector 54 to be sent to 51 to match the tag unit. If there is no match, the branch target address is addressed to the memory 111 via the bus 113, and the corresponding instruction block is read into the memory 122 and selected by the four-level cache replacement logic (four-level active table 120 and four-level correlation table 121, etc.). Level 4 cache block. If matched, the matched BN4X address 123 addresses the four-level active table 120.
  • the BN3X address in the entry is combined with the BNY of the branch target address into a BN3 address and stored in the third-level track via the bus 125.
  • FIG. 15 is an address format of the processor system in the embodiment of FIG.
  • the memory address is divided into a tag 61, an index 62, a tertiary subaddress 126, a secondary subaddress 63, and a primary subaddress. 64, with an intra-block offset (BNY) of 13.
  • the address BN4 of the quaternary buffer is composed of a road number 65 and an index 62, a three-level sub-address 126, a second-level sub-address 63, a first-level sub-address 64, and an intra-block offset (BNY) 13; The portion of 13 is collectively referred to as BN4X.
  • the address BN3 of the third-level buffer is composed of a third-level cache block number 128, a second-level sub-address 63, a first-level sub-address 64, and an intra-block offset (BNY) 13; Known as BN3X.
  • the address BN2 of the secondary buffer is composed of a secondary cache block number 67 and a primary subaddress 64, and an intra-block offset (BNY) 13; each of the offsets 13 in the block is collectively referred to as BN2X, addressing.
  • the address BN1 of the primary buffer is composed of a primary cache block number 68 (BN1X) and an intra-block offset (BNY) 13.
  • the intra-block offset (BNY) 13 in the above four address formats is the same, and the BNY portion does not change when the address conversion is performed.
  • the corresponding track is read by the tertiary track table 118 via the bus 119, and the BN4 format address in the track entry is addressed to the four-level active list.
  • the above BN4 address addressing memory 122 on the 119 bus reads the corresponding instruction block and fills the memory 112 with the BN3X address pointed by the third level cache replacement logic (the third active table 50 and the third level correlation table 117, etc.).
  • a three-level cache block. The BN3X address is stored in the entry of the four-level active table 120 pointed to by the BN4 address, and is stored in the corresponding entry in the three-level track table 118.
  • the BN3X address is bypassed to the bus 119 and also stored in the secondary track. Corresponding entries in Table 88. If the output on the bus 119 is already a BN3X address, then the BN3X is used. The address is addressed to the three-level active list 50. If the 50 entry is valid, the BN2X address is stored in the corresponding entry in the secondary track table 88; if the 50 entry is invalid, the memory 112 is addressed by the BN3X address on 119.
  • the BN2X is also stored in the entry of the three-level active list 50 addressed by the above BN3X; the BN2X is also stored in the secondary track table 88.
  • the corresponding track is read by the secondary track table 88 via the bus 89, and the BN3 format address in the track entry is addressed to the tertiary active table. 50; if the 50 entry is valid, the BN2X address is filled in the track entry in 88 and bypassed to the bus 89 and also stored in the corresponding entry in the primary track table 20; if the 50 entry is invalid, then 89
  • the above BN3 address addressing memory 112 on the bus reads the corresponding instruction block and fills in the memory 42 with the BN2X address pointed by the L2 cache replacement logic (the secondary active table 40 and the secondary correlation table 103, etc.). Level cache block.
  • the BN2X address is stored in the entry of the three-level active list 50 pointed to by the BN3 address, and is stored in the corresponding entry in the secondary track table 88.
  • the BN2X address is bypassed to the bus 89 and also stored in the first track. Corresponding entries in Table 20. If the output on the bus 89 is already a BN2X address, then the BN2X is used.
  • the address is addressed to the secondary active table 40. If the 40 entry is valid, the BN1X address is stored in the corresponding entry in the primary track table 20. If the 40 entry is invalid, the memory 42 is addressed by the BN2X address on the 89.
  • the instruction block is from the level 1 cache memory 22 to the processor core 23 or IRB
  • the corresponding track is read by the bus 29 in the first track table 20, and the BN2 format address in the track entry is addressed to the secondary active table 40; if the 40 entry is valid, the BN1X address is filled in. Enter the track entry in 20 and bypass to the bus 29; if the 40 entry is invalid, the BN2 address on the 29 bus addresses the memory 42, reads the corresponding instruction block and fills the memory 22 by the first level cache replacement logic (Level 1 correlation table 37, etc.) The first level cache block pointed to by the BN1X address given.
  • Level 1 correlation table 37, etc. Level 1 correlation table 37, etc.
  • the BN1X address is stored in the entry in the secondary active list 40 pointed to by the BN2 address, and is stored in the corresponding entry in the primary track table 20. If the output on the bus 89 is already a BN1 address, then the BN1 The address is stored in the register in the tracker 47, becomes the read pointer 28, addresses the track table 20 and the level 1 cache memory 22, to the processor core 23 or IRB. 39 push instructions. This ensures that the instructions in the level one cache memory 22, their branch destinations and the sequence of the next level one cache block are at least already in the level two cache memory 42 or are being stored into the process 42. The rest of the operations are as described in the previous embodiments and will not be described again.
  • FIG. 13 is shown in an instruction push memory/processor system that executes two branches simultaneously, its memory hierarchy can also be applied to processor cores of other architectures, such as address addressing level 1 cache generated by a processor core or Instructional read buffered out-of-order multi-transmission processor system.
  • the method and system of the embodiment of Figure 13 can be applied to data memory hierarchies and data pushes such that the memory hierarchy also pushes data to the processor core.
  • the following embodiment assumes that the data memory has the same storage hierarchy as the instruction memory, that is, there are memory, four-level cache, three-level cache, two-level cache, one-level cache and data read buffer, and the instruction memory levels. correspond.
  • each BN address can be a DBN (Data Block).
  • the Number is distinguished from the BN address to accommodate separate instruction caches and data caches.
  • the hierarchical address is still in the BN name.
  • Each storage hierarchy also requires data track table DTT, data active table DAL, data related table DCT and pointers to support the operation of data memory storage.
  • FIG. 16 is a format of the data track table, the data active table, and the data related table.
  • the branch target address does not need to be stored in the data track table DTT, so only the block address DBNX of the next data block in the order is stored.
  • 132 and its valid bit 133 are optionally, the block address 130 of one of the data blocks in the storage order and its valid bit 131 can be added for use in reverse order access to the data.
  • the data track table can be completely eliminated.
  • Data active table DAL The format is the same as the active list AL format 76, 77 shown in Figure 14, where the 134 field storage block address DBNX, 135 field stores the corresponding valid bit.
  • a row of DALs of this hierarchy is addressed by a data block address (e.g., block 2 address 67 in FIG. 15), and a set of 134, 135 in the row is addressed by a subaddress (such as subaddress 2 in FIG. 15). If the valid bit 135 is 'active', the higher level block address in 134 is read from the DAL to access the higher level data store. That is, the data active table DAL maps the storage hierarchy address to the address of the upper storage level.
  • the data active table DAL can map the storage hierarchical address to the corresponding high-level storage hierarchical address
  • the data correlation table DCT can map the storage hierarchical address to the lower one storage hierarchical address (the DBLNX represents a low-level address in FIG. 16). ).
  • the pointer 137 is used for buffer replacement.
  • the data cache can be replaced by the instruction cache disclosed in the present invention, but there is no count value in the data cache related table, because no branch instruction jumps into the data cache. The target, therefore, does not need to consider the address in the replacement track table that targets the data cache block, and does not need to record the branch source address.
  • the level 1 cache only needs to record the last replaced cache block with the pointer 137, and the pointer 137 is unidirectionally traversed or replaced by LRU, LFU, and the like.
  • the second, third, and fourth level caches are replaced by the instruction cache, as long as the cache block does not have a high level of corresponding cache blocks.
  • Each of the entries in the active table can be read by one-way traversal of the pointer 137 of the present level. If all the address fields in an entry are "invalid", the corresponding cache block can be replaced.
  • the L1 cache replacement method of the instruction cache disclosed by the present invention may also be implemented by LRU, LFU or the like.
  • the data push memory hierarchy also uses the step size table 150 to record the difference between two adjacent data access addresses of the same data access instruction.
  • FIG. 17, is a step size table format and working principle.
  • 150 is a memory, where each row corresponds to a data access instruction (such as LD Or ST), addressed by the instruction address of the data access instruction.
  • a data address 138 in each row.
  • the format of 138 is DBN1, which is a primary data cache address, and the format is DBN1X and DBNY. Similar to 68 and 13 in Figure 15, The 139 field is the status bit of 138.
  • each set of step sizes is selected by the data access instruction at the branch cycle level of the instruction segment.
  • the straight line represents sequential instructions that are executed sequentially in the direction of the arrow, the arc represents the reverse branch, the intersection represents the branch instruction, and the triangle represents the data access instruction.
  • 146 is a data access instruction
  • the 150 step row of the upper step of FIG. 17 corresponds to 146.
  • the inner loop step of the data access instruction 146 is stored in the 146 corresponding 150.
  • Step field 140 of the row when the branch of the branch instruction 140 is judged as 'no branch', and the branch of the branch instruction 142 is judged as 'execution branch', the middle loop step of the data access instruction 146 is stored in 146 Step field 142 of 150 lines; when the branch of the branch instruction 140 is judged as 'no branch', and the branch of the branch instruction 140 is judged as 'no branch', and the branch of the branch instruction 143 is judged as 'execution branch'
  • the outer loop step size of the data access instruction 146 is stored 146 in a step field 143 corresponding to 150 rows.
  • the branch judgment has priority, so that the reverse branch instruction immediately after the data access instruction has the highest priority, the priorities of the other reverse branch instructions are decremented in order, and the branch judges the high priority branch instruction of the 'execution branch' The low priority branch instruction is masked so as not to affect the readout of the step table 150. Forward branch instructions are not recorded in the step table. Can be added by the adder to the line of 150 The data address DBN1 is added to the branch judgment selection step size, such as 140, to obtain the next data address to access the data storage hierarchy system, and to acquire data in advance to push the processor core.
  • FIG. 18 is another embodiment of the processor/memory system of the present invention.
  • the left half of Fig. 18 is an instruction push processor system similar to the embodiment of Fig. 13, and the right half is a data push memory hierarchy.
  • the processor core 23 has the same function as the module of the same number in the embodiment of Fig.
  • Memory 111 The functions of the four-stage active table 120, the four-level correlation table 121 and the four-level buffer memory 122 are similar to those in the embodiment of FIG. 13, except that not only the storage instructions but also the data and data-related auxiliary information such as the data cache block number are stored.
  • the entry of the four-level active table 120 may store a three-level instruction cache address BN3 or a three-level data cache address DBN3.
  • the selector 54 is now a three-input selector. Scanner 43 except execution In addition to the scan function of the instructions in the embodiment of Fig. 13, the order of the next data block (or the reverse block data block address) of the data block passing through the bus 115 is also calculated.
  • the right half has a three-level data buffer memory 160, a secondary data buffer memory 161, a primary data buffer memory 162, a data read buffer 163, a step size table 150, a three-level data track table 164, and a secondary data track table.
  • 165 a primary data track table 166;
  • the memory 111 in FIG. 18 is addressed by a memory address, the memory 122 is a group associative cache organization structure, and the other levels of buffers are all associative cache organization structures.
  • the memory 111 of the embodiment of Fig. 18 can be used as the main memory of the processor/memory system, and at this time 122 is the last level cache of the processor (Last Level Cache) is a unified cache; Or another system organization mode uses 111 as the hard disk of the system.
  • 122 is the main memory organized by the cache, and 112 is the last instruction cache of the processor, and 160 is the last data cache of the processor.
  • Data read buffer Data Read Buffer, DRB
  • the entry of 163 corresponds one-to-one with the entry of the IRB instruction read buffer 39.
  • the task of the data storage hierarchy is to pre-fill the data to be used by the processor core into the entries in the DRB corresponding to the data access instructions in the IRB.
  • the data is pushed to the processor core 23 with the instructions (data and instructions are not necessarily pushed at the same time, because the data load instructions executed by the processor core and their corresponding data are not in the same pipeline segment of the processor core).
  • the data address is read into the memory 111 via the bus 113 as described in the foregoing embodiment 13 to read a four-level data block, and is stored in the memory 122.
  • the road number (65 in FIG. 15) given by the four-level cache replacement logic is flattened.
  • the data address is stored in the entry in 51 of the tag unit that is also pointed to by 65 and 62.
  • the system further reads the three-level data block from the memory 122 in the above-described 65, 62 together with the three-level sub-address 126 in the data address, and stores it in the three-level data buffer memory 160 via the bus 115, which is given by the three-level data buffer replacement logic.
  • the three-level cache block specified by the third-level data block number 128 stores the three-level block number 128 into the entry field pointed to by 65, 62, and 126 in the four-level active table 120 and sets the field to 'valid'.
  • the 65 and 62 (four-level block numbers) are stored in the entry in the three-level correlation table 174 pointed to by the above 128.
  • the scanner 43 calculates the address of the next three-level data block in the order of the above-mentioned three-level data block (ie, the data address plus the size of a three-level data block), and sends the label unit 51 to the BN4 address to obtain the BN4 address.
  • Accessing the four-level active table 120 maps to the DBN3X address, and DBNY in the data address 13 flattened to get the DBN3 address.
  • the resulting DBN3 or BN4 address is stored in the triple-track table 164 in the 132 field of the entry pointed to by 128 above.
  • next three-level data block is still in the same cache block, add '1' to the above 126, and the original 65, 62 is combined to obtain the DBN3 address of the next third-level data block, without going through 51. Label unit mapping.
  • the next three-level data block in the sequence may also be filled into the third-level buffer memory 160, and the corresponding entries in the 120 and 174 entries are filled as described above; generally, the order of the next three-level data block in the order is not required.
  • the next three level data block is also filled in 160.
  • the system further reads the secondary data block from the tertiary data buffer memory 160 in the above-described 128 along with the secondary subaddress 63 in the data address, and stores it in the secondary data buffer memory 161 which is given by the secondary data cache replacement logic.
  • the secondary cache block specified by the secondary data block number 67 stores the secondary block number 67 into the entry field pointed to by 128, 63 in the tertiary data active table 167 and sets the field to 'valid'.
  • the 128 three-level block number
  • a '1' is added to the above 63, and the third-level active table 167 is addressed by an address of 128 and the 63 after the addition of the '1'. If the entry is 'valid', the next level is determined. The cache block is already in the secondary cache; if the entry is 'invalid', the secondary data block is read from the tertiary data buffer memory 160 by the address of the 128 and the added 63 after the '1', and is stored in the second data block.
  • the entry pointed to by the above 128 in the third-level track table 164 is read out via the bus 190. If the content of the entry is in the BN4 format, The BN4 address accesses the four-level active list 120 via bus 197. If the 120 entry is 'valid', the original BN4 is replaced with the entry pointed to by the DBN3 address stored in the entry 164 in the entry; if the 120 entry is 'invalid', the access to the memory 122 is read by the BN4 address on the bus 197.
  • next three levels of data blocks are stored in the memory 160, and the corresponding entries in 164, 167, 174 and 120 are filled in the manner described above. This ensures that when the contents of a three-level data block are stored in the secondary data buffer, the next three-level data block is stored in the three-level data buffer.
  • the entry pointed to by the above 128 in the three-level track table 164 is in the DBN3 format
  • the DBN3 is addressed to the three-level active table 167 via the bus 190 as described above, so that the secondary buffer memory 161 is being filled.
  • the next secondary data block in the order of the secondary data blocks is also filled in 161.
  • the data block in the reverse order can also be stored in the data cache as needed, and the 130 field in the track table is used. It is also possible to completely eliminate the data track tables 164, 165, 166. At this time, the system does not automatically fill in the order of the three-level data cache block boundary or the reverse order of the secondary data block. Pre-filling of the other data storage levels is done in the same way.
  • the system further reads out the primary data block from the secondary data buffer memory 161 by the above-mentioned 67 and the first-level sub-address 64 of the data address, and stores it in the primary data buffer memory 162 by the primary data cache replacement logic.
  • the first level cache block specified by the primary data block number 68; and the first level block number 68 is stored in the entry field of the secondary data active table 168 pointed to by 67, 64 and the field is set to 'valid' .
  • the 67 (secondary block number) is stored in the entry of the primary correlation table 176 pointed to by the above 68.
  • the entry pointed to by the above 67 in the secondary track table 165 is read.
  • the three-level active table 167 is addressed by the BN3 address via the bus 185, such as the 167 table.
  • the entry 'valid' that is, the BN2X address in the 167 entry is written back to 165 via the bus 189 instead of the BN3X address.
  • the entry 167 is 'invalid', the address is read by the address-receiving three-level data buffer memory 160 at 185, and the second-level data block is stored in the secondary data buffer memory 161 in the secondary cache block address given by the cache replacement logic.
  • Another 67-pointed L2 cache block is another 67-pointed L2 cache block.
  • the other 67 is also stored in the entry of the 185 address in the three-level data active table 167, and is also stored in the secondary data track table 165 instead of the BN3X address.
  • the corresponding entry is also established for the second level cache block in the secondary data active table 168 and the secondary data related table 175 by using the 67 address, wherein the 175 entry stores the BN3X address. This ensures that when the contents of a secondary data block are stored in the primary data buffer, the next secondary data block is stored in the secondary data cache.
  • the system further takes the above 68 with the DBNY in the data address 13 is stored as the primary data cache address DBN1 via the bus 193 in the row 138 field corresponding to the data load command in the step size table 150, and the row 139 status field is set to '1'.
  • the system accesses the primary data buffer memory 162 by the above DBN1, and the read data is stored in the DRB.
  • the data can be pushed to the processor core 23 for processing with the instruction.
  • the system begins prefetching the next data to the DRB for pushing the next time the same data load instruction is executed.
  • the state field 139 is '1', and the process of prefetching data for pushing is exactly the same as above, except that when the new 68 and 13 (DBN1) are generated, the DBN1 and the original 138 field in the original step size table 150 are first generated. The previous DBN1 is subtracted, and the difference is stored as a step into the branch to determine the selected entry, such as 140. The new DBN1 is then written to the 138 field to replace the old address, and the status field 139 is set to '2'.
  • DBN1 new 68 and 13
  • the system After the second data is pushed to the processor core 23, when a branch instruction following the data load instruction determines that its branch is an 'execution branch', the system starts prefetching the next data into the DRB for next execution. Push when the same data load instruction. Thus, the status field 139 is '2' and the system no longer waits for the processor core 23 to calculate the data address. Instead, the step size table 150 directly outputs the DBN1 address in the 138 field in the corresponding row of the data load instruction, and the branch determines the selected step size (such as 140), and adds it in the adder 173. The system makes a boundary determination for the output 181 of 173.
  • selector 192 selects 181 to access primary data buffer memory 162, and the read data is stored in the corresponding entry in DRB for push.
  • the address on 181 is stored as DBN1 in the corresponding field in the 138 field in the step table. If 181 is beyond the boundary of the primary data cache block, but does not exceed the adjacent primary cache block boundary, the primary data track table 166 is addressed by 181, and the DBN1X address 132 of the next primary data block is read out.
  • DBN1X address 130 of a data block in reverse order is output via bus 191, selected by selector 192, and accessed in conjunction with DBNY address 13 on 181 to access memory 162, and the read data is stored in the corresponding entry in DRB for push.
  • the above-mentioned flattened address DBN1 is stored in the corresponding field 138 field in the step table 150. In both cases, the state field 139 in 150 remains unchanged at '2'. If the address 132 outputted by 166 is in BN2X format, the system addresses the secondary data active table 168 by 191, such as the 168 entry 'valid', that is, the BN1X address in the 168 entry is written back to the bus 184 to replace the BN2X. address.
  • the 168 entry is 'invalid', that is, the address of the secondary data buffer 161 is addressed by the address 191, and the primary data block is read into the primary cache block address of the primary data buffer memory 162 and given by the cache replacement logic.
  • the 68 is also stored in the entry addressed by the 191 in the secondary data active table 168, and is also stored in the primary data track table 166 in place of the BN2X address.
  • the system addresses the primary correlation table 176 with the DBN1 address 138 and the DBN1 address for the DBN2 address for output via the bus 182.
  • Adder 172 adds the DBN2 addresses on steps 140 and 182, and outputs 183 to the secondary data active table 168, such as its entry 'valid', with the DBN1X address in the entry and DBNY on 183. 13 splicing, accessing the primary data buffer memory 162 via the bus 184, reading the data into the DRB entry to be pushed; and storing the DBN1 address on the 184 into the corresponding row 138 field in the step table 150, maintaining 139 The domain is '2' unchanged.
  • the secondary data buffer memory 161 is addressed by 183, and the read primary data block is stored in the primary data buffer memory 162 by the primary data cache replacement logic.
  • the first level cache block specified by the primary data block number 68 is given.
  • the system is combined with the DBNY on the 68 and 183 as the DBN1 address access 162, and the read data is stored in the DRB entry to be pushed; and the DBN1 address is stored in the corresponding row in the step table 138 field, and the 139 field is maintained. Not changed to '2'.
  • the system addresses the secondary correlation table 175 with the DBN2 address on the bus 182 described above, and maps the DBN2 address to the DBN3 address for output via the bus 186.
  • the adder 171 adds the DBN3 address on the step sizes 140 and 186, and the output 188 addresses the three-level data active table 167. If the entry in the 167 is 'valid', the DBN2X address and the 188 are in the entry. DBNY 13 flattened, the secondary data active table 168 is addressed via the bus 189. If the entry in the 168 is 'valid', the DBNY on the bus 188 is directly combined with the DBN1X address in the entry.
  • the read data is stored in the DRB entry to be pushed; and the DBN1 address is stored in the corresponding row 138 field in the step table, and the 139 field is maintained. '2' does not change. If the entry in '168' is 'invalid', the secondary data buffer memory 161 is addressed by the DBN2 address on the bus 189, and the read primary data block is stored in the primary data buffer memory 166 by the primary data cache replacement logic. The primary data cache block number 68 pointed to by the primary cache block; the 68 is also stored in the entry in 168 addressed by the bus 189, the entry being set to 'valid'.
  • the system is combined with the DBNY on the 68 and 189 as the DBN1 address access 162, and the read data is stored in the DRB entry to be pushed; and the DBN1 address is stored in the corresponding row in the step table 138 field, and the 139 field is maintained. Not changed to '2'.
  • the system addresses the level 3 correlation table 174 with the DBN3 address on bus 186 above, and maps the DBN3 address to the BN4 address for output via bus 196.
  • the adder 170 adds the DBN4 addresses on the step sizes 140 and 196, and addresses the four-level active table 120 with its output 197. If the entry in the entry is 'valid' in 120, the DBN3X address in the entry is 197. DBNY 13 is split, and the three-level data active table 167 is addressed via the bus 125. If the entry in the 167 is 'valid', the DBNY on the bus 125 is directly combined with the DBN2X address in the entry.
  • the secondary data active table 168 is accessed via the bus 189 as a DBN2 address. If the entry in '167 is 'invalid', the secondary data buffer memory 161 is addressed by the DBN2 address on the bus 189, and the read primary data block is stored in the primary data buffer memory 162 by the primary data cache replacement logic. The primary data cache block number 68 pointed to by the primary cache block; the 68 is also stored in the entry in 168 addressed by the bus 189, the entry being set to 'valid'. Accessing the secondary data active table 168 with the DBN2 address on bus 189 and subsequent operations are the same as described in the previous paragraph.
  • the system accesses 162 with the DBN1 address, and the read data is stored in the DRB163 entry to be pushed; and the DBN1 address is stored in the corresponding field 138 field in the step table, and the 139 field is kept unchanged by ‘2’.
  • the system reads the corresponding label 61 by the label unit in the BN4 address addressing 51 on the bus 196, and sends it to the adder 169 via the bus 113. 169 associates the label 61 with the step size 140. Plus, the sum 198 is selected by the selector 54 and sent to the tag unit in 51 for matching. If the match results in a new BN4 address, the four-level active list 120 is addressed via bus 123 at the new BN4 address. If the entry in the table is 'valid', the DBN3X address in the entry is addressed to the tertiary active table 167 via the bus 125.
  • the final system accesses 162 by the DBN1 address obtained by the active table mapping of each level, and the read data is stored in the DRB entry to be pushed; and the DBN1 address is stored in the corresponding field of the 138 field in the step table, and the 139 field is maintained. '2' does not change. If the corresponding data block in a certain storage hierarchy does not exist in the process, the system will automatically read the data block from the lower one storage level and store it in the cache block specified by the cache replacement logic in the hierarchy, and the cache block address is also The low-level active table is stored, and the lower-level cache block number is stored in the related table of the level to establish a two-way mapping relationship.
  • Data storage can be done in a similar way, or it can be stored in a write buffer (write) Buffer), when the data cache is idle, write the data in the write buffer back to the data cache.
  • the processor core is required to send the correct data address through the bus 49 to compare with the guessed DBN1 address. If it is different, it is necessary to discard the data of the guess load and its subsequent execution result, load the data with the correct data address on the bus 49, and set the corresponding 139 field to '0', and recalculate the step size to 150.
  • the guessed load address is also compared to the address in the write buffer to determine that the loaded data is updated.
  • the DBN address can be mapped to a data address to compare with the data address on 49. It is also possible to map the upper address of 49 to the DBN address and compare it with the DBN address generated by the system guess. Further, if the valid bit of the step size read by the branch judgment in the step size table 150 is "invalid", the step size is also stored in the corresponding step size field as described above.
  • the data memory hierarchy in the embodiment of FIG. 18 has the lowest level cache associated with the path group, the level has a label unit, and may also have a TLB of the virtual real address translation; the level may be addressed by the memory address matching the label unit in 51. Or directly addressed by the buffer address BN4.
  • the rest of the data cache is fully associative and is addressed by the buffer address DBN.
  • the mapping between the DBN and the BN4 is performed by the active table and the related table.
  • the role of the active table is to map the low-level buffer address to the high-level buffer address; the role of the related table is to map the high-level buffer address to the low-level buffer address. Please refer to Figure 19 for its mechanism of action.
  • FIG. 19 is a schematic diagram of the action mechanism of the data cache hierarchy in the embodiment of FIG. 18.
  • 200 is a four level cache block containing two level three cache blocks 201, 202.
  • Each L3 cache block further contains two L2 cache blocks, such as 201 containing L2 cache blocks 203 and 204.
  • Each L2 cache block further contains two L1 cache blocks, such as 203 containing L1 cache blocks 205 and 206.
  • the system obtains the next level of the same data load instruction by the minimum mapping step and the least delay according to the length of the step 140.
  • the data cache address is read in advance by the primary data cache memory 162 to store the data in the corresponding entry in the DRB.
  • the 138 address pointing to 205 is added to 140, and the sum does not exceed the boundary of 205, the sum 181 is used as the new primary data cache address to address the primary data cache 162.
  • the read data is stored in the DRB.
  • the adder 172 adds the 182 address to the step size 140, and the sum 183 addresses the corresponding entry of the L2 cache block 203 in the L2 active table 168, from which the DBN1X address of the L1 cache block 206 is read, and 183
  • the DBNY 13 cooperates to address the primary data buffer memory 162 for DBN1 and also stores 138 fields in 150. If the address of the sequential next cache block 206 is stored in the corresponding entry of the 205 cache block in the primary data track table 166, the address of 206 can also be obtained by directly addressing 166 with 181 (ignoring the overflow of the bit in 181).
  • the DBN1 format of the 138 address needs to be mapped to the DBN2 format 182 via the primary correlation table 176, and the DBN2 format is mapped to the DBN3 format 186 and the step size via the secondary correlation table 175.
  • 140 is added, and 174 addresses the corresponding entry of the third-level cache block 201 in the three-level active table 167, from which the DBN2 address 189 of the second-level cache block 204 is read, and then the secondary active table 168 is addressed by 189.
  • the address DBN1 of the level one cache block 207 is mapped to the DBN2 format 182 via the primary correlation table 176, and the DBN2 format is mapped to the DBN3 format 186 and the step size via the secondary correlation table 175.
  • 140 is added, and 174 addresses the corresponding entry of the third-level cache block 201 in the three-level active table 167, from which the DBN2 address 189 of the second-level cache block 204 is read, and then the secondary active table 168 is addressed by
  • the address can be addressed to the first stage buffer memory 162 via the bus 184 to read the data and store it in the DRB 163, and store the address in the 138 field of 150. If the sum of 181 exceeds the boundary of the level three cache block 201, the DBN1 address in 138 is mapped to a DBN2 format address via 176, and then mapped to a DBN3 format address via 175, and then mapped to a BN4 format address via 174; The address four-level active table 120 obtains the DBN3 format address 125; the three-level active table 167 is addressed by the DBN3 address to obtain the DBN2 address 189; the secondary active table 168 is addressed by the DBN2 address, and the address of the first-level cache block 207 is obtained. DBN1. That is, the address can be addressed to the first stage buffer memory 162 via the bus 184 to read the data and store it in the DRB 163, and store the address in the 138 field of 150.
  • the cache blocks of each level in the data cache hierarchy form a tree structure.
  • the four-level cache block is the root of the tree, and the other levels of the cache block are the different levels of the root; the other levels of the cache block are the root of the higher-level cache block.
  • the branches and branches are connected as a tree by a bidirectional address mapping. From the first branch (level 1 cache block), you can reach the root (the same level 4 cache block) by any one of the following first level branches. It is only necessary to match the tag units in 51 if the target is beyond the root range.
  • the target branch and the source branch belong to the same root, and there are fewer mapping levels to be experienced.
  • the target branch and the source branch belong to different roots, and there are many mapping levels that need to be experienced.
  • the embodiment of Figure 18 can be modified to reduce the mapping hierarchy.
  • FIG. 20 is a modified embodiment of the data cache hierarchy in the embodiment of FIG. 18.
  • the three-level data buffer memory 160, the secondary data buffer memory 161, the primary data buffer memory 162, the data read buffer 163, the step size table 150, the three-level data track table 164, and the secondary data track table 165 are shown in FIG. , primary data track table 166;
  • the primary data correlation table 176 format is as shown in 209. There is not only the secondary data cache block number DBN2X of the first-level cache block but also the corresponding three-level data cache block number DBN3X and the fourth-level cache block number DB4X.
  • the operation is similar to that of the embodiment of Fig. 18.
  • the DBN1 address in the 138 field in the corresponding row of the data load instruction and the step size (e.g., 140) selected by the branch are outputted in the adder 173.
  • the system makes a boundary determination for the output 181 of 173. If the boundary is judged to be in the level one cache block, the level one data buffer memory 162 is directly addressed by 181. If the boundary is judged to be outside the first level cache block. then
  • a row 209 in the primary correlation table 176 is addressed by the upper address 138, and the cache address of one of the selections 209 is judged according to the above-described boundary, and is added by the adder 172 and the step size 140, and the sum is 183. If the boundary is determined to be in the secondary cache block, then DBN2X and 140 are added in selection 209, and the sum 183 is sent by the system to the secondary active table 168 for addressing; if the boundary is determined to be in the tertiary cache block, then 209 is selected.
  • the DBN3X is added to the 140, and the sum 183 is sent to the third-level active table 167 by the system; if the boundary is judged to be in the fourth-level cache block, then DBN4 and 140 are added in the selection 209, and the 183 is sent by the system.
  • the four-level active table 120 is addressed. The rest of the operations are the same as those of the embodiment of FIG. 18 and will not be described again.
  • the embodiment of Figure 20 can save the reverse mapping steps and delays from the branch to the root.
  • an adder may be additionally added to add the address formed by DBNY13 in BN4X 138 in 209 to 140, and the sum is used to address the label unit in 51, and map the BN address to the data address so as to be correct with bus 49. Data address comparison.
  • FIG. 21 is an embodiment of prefetching data organized in logical relationships.
  • Data can contain address pointers, which are organized logically.
  • prefetching data organized by other logical relationships may be deduced by analogy, for example, by prefetching data organized by a binary tree.
  • 220-222 is the data in the memory, where 220 is the data, 221 is the address pointer of the left branch of the binary tree, and 222 is the address pointer of the right branch of the binary tree.
  • the data buffer memory 162, the data read buffer 163, the data track table 166, Selector 192, instruction memory 22, IRB 39, and the processor core 23 have the same function as the module of the same number in FIG. Some of the modules not shown in Fig.
  • the entries in the data track table (DTT) 166 in this embodiment correspond one by one to the respective data entries of the data memory (DL1) 162.
  • the engine 226 is responsible for generating an entry for the data track table (DTT) 166.
  • 230-232 is DTT The entry in 166 corresponding to the data 220-222 in 162.
  • Each entry in 166 has a 'valid bit', wherein the data type entry 230 corresponds to the data entry 220, and the pointer entries 231 and 232 contain the address pointers in the DBN formats 221 and 222, respectively.
  • Data type table entries each of which has its identifier to distinguish between the two.
  • the DBN format can directly address the data store 162.
  • the data read pointer 181 controls to read a line of tracks from the data track table 166. If the DBNY value in the pointer is close to the end of a line, the BN address in the track point is terminated according to the line track, and the next line is also read in the address order. And sent to the shifter 225. 225 in the row of tracks or two rows of tracks shifted to the left by the number indicated by DBNY in the data read pointer 181.
  • the learning engine 226 receives the shifted plurality of entries, determines the data type table entry 230 based on the identifiers in the entries, and determines 226 the pointer entries 231, 232 based on the data types in the data type table entry 230. operating.
  • the comparison result 228 generated by the processor core 23 controls the selector 227 to select the plurality of pointers output by 226 to place the data read pointer 181 to address the data memory (DL1) 162 to provide data to the processor core 23.
  • the data value in the 220 entry of the data memory 162 is '6'
  • the 221 entry is the 32-bit address 'L'
  • the 222 entry is the 32-bit address 'R'.
  • the data type in the 230 entry of the data track table 166 is a binary tree
  • the control signal is that the processor core 23 executes its address as 'YYY'.
  • Comparison result produced by the instruction In the 228; 231, the DBN format address pointer 'DBNL' obtained by the 'L' address pointer mapping in 221 is the DBN format address pointer 'DBNR' obtained by the 'R' address mapping in 222.
  • the learning engine 226 detects a plurality of entries from the shifter 225, and selects a data type entry 230 based on the identifier. 226 outputs the 231 and 232 entries from the shifter 225 to the selection based on the binary tree data type in 230. Two inputs to the 227. Suppose the instruction address is 'YYY’ The instruction compares the value '8' to be searched with the 220 value '6' loaded from 23 in (DL1) 162, resulting in a comparison result 228 of '1', which means that the value to be found is greater than the value in the current node 220.
  • control 226 observes the address 28 of the control level 1 memory 22, after it reaches 'YYY', causes the comparison result 228 generated by the processor core to control the selector 227. 228 At this point the control 227 selects the right branch pointer 'DBNR' in the entry 232. Output to the data read pointer 181. If the valid bit in entry 232 is 'valid', then the data pointed to by the right branch pointer in 232 becomes the new current data.
  • the selector 192 selects 181 addressing 162 (DL1), Output new current data stored in DRB 163. 181 also addresses DTT 166, causing 166 to output a corresponding data track containing the new current data to shifter 225.
  • the intra-block offset portion DBNY in the address on 181 controls shifter 225 to shift the data track to the left to enable data type, DBNL address, DBNR Addresses (formats such as 230, 231, 232) are aligned with the input of the learning engine 226.
  • Each entry of DRB 163 corresponds to an intra-block offset address (Offset, DBNY), 162 (DL1) will be the entire data block (if the data specified by data type 230, such as 220-222, exceeds one data block, Then, the data block starting from the 'DBNR' address is crossed to the next data block in the order of addresses.
  • the processor core 23 uses the data address generated by executing the load instruction (Data). Address) 94 Offset part of the addressing DRB 163, read the current data and its left branch address pointer, right branch address pointer (format such as 220, 221,222). The processor core 23 executes an instruction to compare the value '8' to be sought with the current data to produce a comparison result 228.
  • the learning engine 226 monitors the address 28, the comparison result 228 generated by the processor core 23, the data address 94, And corresponding data 223 output by (DL1) 162 to generate a data track (Data Track) entry to be stored in DTT 166.
  • the data cache system sends the data address 94 generated by the processor core 23 to the tag unit 51 (not shown) for matching, and maps to the DBN address 184. 184.
  • the data memory 162 is addressed and the read data is output to the processor core 23 via 223.
  • the learning engine 226 records the address on 94 and the data on 223 that is addressed by the entry in the data store 162. 226 also compares the newly generated data address 94 with the previously recorded data on 223.
  • the learning engine 226 matches the newly generated data address 94 and maps the resulting DBN to the same 223 data in the record.
  • the entry in the corresponding data track table 166 of the data entry And set these entries to 'valid'. That is, the address pointer 'L' in 221 is matched, the mapped 'DBNL' is stored in 231, and the address pointer 'R' in 222 is matched and the mapped 'DBNR' is stored in 232.
  • 226 can also record and compare the mapped BN format data with the address.
  • the 226 judges the data memory 162 entry that meets the following conditions as a 'data' (non-pointer) entry.
  • the condition is that the data address of the entry itself is only one or a few data lengths from the above-mentioned address of the entry containing the address pointer, and the data on 223 in the plurality of instruction loops is never the same as the address on the last 94.
  • the range of the instruction loop can be determined by the IRB
  • the branch instruction address of the reverse jump in 39 and its branch target instruction address are determined.
  • the data track table 166 entry corresponding to the 'data' entry in the data store 162 is the data type entry.
  • the learning engine 226 will monitor the resulting rule (ie, when the 28 address is 'YYY', if 228 is '0', the BN address in 231 is selected, and if 228 is '1', the BN address in 232 is selected) and the 'data' is stored ( Here is the corresponding data track table entry for 220) (here 230), And set the entry to 'valid'.
  • the valid bits in the data type table entry may be complex digits, such as greater than a preset value of 'valid'; no greater than the default value of 'invalid'.
  • the comparison result 228 generated by the processor core 23 to execute the instruction controls the selector 227 to select the address pointer to cause the data read pointer 181 to move along the binary tree.
  • the learning engine 226 controls the same set of data and its address pointer (e.g., 220-222) to be read from the data cache 162 and stored in the DRB. 163, read by the data address 94 generated by the processor core 23.
  • the delay of addressing the data memory 162 by the data address 94 after the tag unit is matched is avoided in this process.
  • Data read buffer DRB The access delay of 163 is a single clock cycle, and is typically less than the access latency of 162.
  • the data read buffer can be organized in the manner of the embodiment of FIG. 18, that is, the entry of 163 corresponds one-to-one with the entry of the IRB instruction read buffer 39.
  • a field is also added to each entry in the data track table (DTT) 166 for recording the address or flag of the instruction for reading data in the data memory 162 corresponding to the entry (for example, the load instruction is in the instruction).
  • the learning engine 226 controls the reading of the data in 162 according to the entry in 166, the data is stored in the DRB corresponding to the flag in the entry. 163 entries.
  • the learning engine 226 performs a learning.
  • the learning proceeds are stored in the data track table 166 in the form of data types and address pointers.
  • the type of data read from the data track table is used to control 226 itself to process other entries read from the data track, such as moving an entry of input 226 to a particular 226 output, or controlling the comparison result 228.
  • the polarity of the selector 227 selects the correct address pointer under the control of 228 and places the data read pointer 181.
  • the address data memory 162 outputs data (e.g., 220).
  • the data type also controls 226 to generate and output a single or multiple subsequent addresses (adding an increment to the correct pointer address, The increment is an integer multiple of the data word length) and the addressing 162 outputs other data of the same group (eg, 221, 222). Therefore, the data type is the control setting for 226, such as the IRB address or flag when the comparison result 228 is generated, the polarity of 228, and the number of subsequent addresses to be generated.
  • the learning engine 226 also matches the DBN address of the bus 181 with the data address 94 generated by the processor core 23, and maps the resulting DBN.
  • the effective value in the data type table item in the corresponding DTT 166 is decremented by '1', and the DBN obtained by the mapping is obtained.
  • 184 is placed on bus 181 to address data memory 162 to read the correct data, and DTT 166 is also addressed to read the corresponding track entry.
  • the learning engine 226 relearns the 166 entries whose effective value is reduced to '0'.
  • the embodiment of Figure 21 can be used in conjunction with the embodiment of Figure 18.
  • the learning engine 226 continuously monitors the type of data in the data track table, as well as the data on the data store output 223 and the data address 94 output by the processor core 23. If the data on 223 is not the same as the address on the next 94, the DTT corresponding to the data memory 162 entry that outputs the data will be used. The valid value in the data type entry in 166 is reduced by '1'. If the data on 223 is the same as the address on the next 94, the valid value of the data type entry is incremented by '1'.
  • the system operates the same set of data corresponding to the RMS value greater than a preset data type entry in the manner of the embodiment of FIG.
  • the data contains a data pointer.
  • the system operates in the manner of the embodiment of FIG. 18, in which the effective value is not greater than the preset value, that is, if the address is not included in the data, the data in the DBN address read data memory 162 is stored in the DRB according to the 'step size'. 163 is used by the processor core 23.
  • the 181 upper address generated by the 21 embodiment is the same as the 94 upper address, the effective value is increased by '1'; if not, the effective value is decreased by '1'. This is a reward for the learning engine 226.
  • the data type table entry 230 can further include a field in which the set of data is recorded in accordance with the FIG. 18 embodiment, or the FIG. 21 embodiment, or otherwise.
  • Figure 22 is an embodiment of a handler call (Call) and a function return (Return) instruction.
  • the selector 25 and the register 26 have the same functions as the modules of the same number in the embodiment of Fig. 2.
  • Stack 233 and selector 236 are newly added. Whether the decoding instruction invokes or returns an instruction when the scanner scan instruction extracts the instruction type format is recorded in the field 11 instruction type format (see FIG. 1) in the track table entry.
  • the controller (not shown) controls the BNX in the register 26 and the BNY output from the incrementer 24.
  • Push stack 233 When the instruction type on the track table output 29 in FIG. 22 is a call instruction, and the TAKEN signal 31 is 'branch successful', the controller (not shown) controls the BNX in the register 26 and the BNY output from the incrementer 24. Push stack 233.
  • the controller controls the selector 236 to select the output of the stack 233.
  • the top of the stack BN in 233 is popped into the register 26. Return the program to the next instruction execution of the calling function instruction.
  • the instruction type (field 11) of the indirect branch instruction can also be subdivided to provide guidance to the buffer system.
  • an indirect branch instruction as a duplicate class is recorded in the track table entry 11, and the generated instruction address and step size are recorded in the step size table 150 in FIG.
  • BNX The BNY instruction addresses are respectively stored in the 12 and 13 fields in the track table entry (see the embodiment of FIG. 1), and the step size table only records the step size. The specific operation is the same as that of FIG. 17, and FIG.
  • the processor core using the cache system of the present invention does not need to be reserved.
  • Program counter that generates an instruction address Counter.
  • the program debug hardware breakpoint can be mapped to the BN format address, compared to the tracker's BN, and the interrupt is triggered the same. Accordingly, the processor core does not need to have an associated pipeline segment with instruction fetches.
  • FIG. 23 is another embodiment of the processor system of the present invention.
  • 23 is a modification of the embodiment of FIG. 8, wherein the three-level active list 50, the three-level cached TLB and tag unit 51, the third-level buffer memory 52, the selector 54, the secondary track table 88, and the secondary active table 40 L2 cache memory 42, track table 20, level 1 cache related table 37, level 1 buffer memory 22, instruction read buffer 39, tracker 47, tracker 48, processor core 23 has the same function as the module of the same number in the embodiment of FIG.
  • a Track Read Buffer (TRB) 238 is added, as well as selectors 237, 239.
  • TRB 238 Storage and IRB in TRB 238 The track corresponding to the instruction block stored in 39.
  • the processor core 23 has two front-end pipelines, FT (Fall Through Sequence) and TG (Target Target).
  • Tracker 0 (TR0) 48 provides BNY increment 38
  • control IRB 39 provides sequential instruction flow to the FT pipeline of processor core 23, tracker 1 (TR1) 47 looks ahead to read the TG address on the track along the track in the TRB.
  • TG address in BN1 format addresses L1 instruction memory 22
  • the instruction memory 42 reads the TG instruction, and the TG, which may be executed in the program order at that time, is selected by the BN1 or BN2 format control selector 239 and sent to the TG pipeline.
  • Taken Signal 31 selects the output of the FT or TG front-end pipeline to be executed by the back-end pipeline.
  • the TG instruction block corresponding to the branch instruction from L2 or L1 is selected by the selector 239 to be stored in the IRB. 39.
  • the track from the secondary track table (TT2) 88 or the track table (TT) 20 is also selected by the selector 237 to be stored in the TRB 238 for 47.
  • TR1 read. If the TG command block is read from the L2 instruction memory 42 by the BN2X address on the track, it is also stored in the L1 instruction memory 22 as the primary memory block pointed to by BN1X provided by the replacement logic.
  • the BN1X is also stored in the entry in the AL2 active table 40 pointed to by the BN2X.
  • the BN3 format address on the track output by the secondary track table 88 is sent to the bus via bus 89.
  • AL3 is mapped to the BN2 address (or 52 L3 is addressed when the AL3 entry is invalid, and the read block is stored in 42).
  • the BN2 address replaces the original BN3 address on the track.
  • the output from 88 TT2 or 20 TT or 238 TRB The BN2 format address on the track can be mapped to the BN1 format by 40 AL2 (or the 42 L2 is stored in 22 L1 to obtain the BN1 address).
  • 88 in this embodiment TT2 stores the TG address in BN3 or BN2 format
  • 20 TT only stores the address in BN2 or BN1 format
  • 238 TRB The BN3, BN2 or BN1 format TG address is allowed.
  • the limitation of the BN format in TT2 and TT triggers the instruction to fill from the low-level memory level to the high-level memory level, which avoids the padding triggered by the cache miss in the traditional cache mechanism, so the inevitable missing.
  • the branch target instruction is at the same or next memory level of the direct branch instruction. Because 47 TR1 looks ahead to read the TG address on the track, it can partially or completely cover 42 L2, or 22 L1 access delay. If there are dense branch instructions in the instruction segment, the TG addresses on the corresponding tracks can be intentionally staggered in BN1, BN2 format, and the access delays of 42 and 22 are covered as much as possible. If the address read on the TRB is in the BN3 format, if the corresponding branch is successful, the processor core 23 is waiting to be mapped by the BN3 address (the mapping process is in the track from 88). The TT2 output starts, so the BN2 format can be partially or completely masked by the AL3 or L3 delay.
  • the branch target instruction is executed after the track in 238. If the corresponding branch is unsuccessful, the processor core 23 does not wait, directly executing the next instruction in the sequence, and the mapped BN2 format is filled in the track after being obtained.
  • the track is filled in 20 The line indicated by BN1X provided by the above replacement logic in TT.
  • the system can control the secondary instruction memory 42 or the primary instruction memory 22 to provide the TG command to the processor core 23 according to the track output of the secondary track table 88 or the primary track table 20, and the IRB. 39 provides sequential instructions to the processor core.
  • the process of executing the next instruction block to the sequence is processed by the branch, and the instruction type in the end track point in the track is set as the unconditional branch, so the processing is the same as the above-described branching process.
  • the method and system in this embodiment are also applicable to other multi-storage hierarchical track instruction cache systems, such as the embodiment of Figures 11, 12, 13, and 18.
  • each functional module in FIG. 12 is divided into two ends of a communication channel having a long delay. It is assumed that the memory 111 in Fig. 12 is located at one end of the communication channel and the remaining modules are located at the other end of the communication channel.
  • the communication channel may be between memory from one processor core to another processor core on the same chip; or between memory from one processor lane to another processor lane on the same chip; Between the processor core on the chip and the memory on the other chip; between the processor of one computer and the memory of another computer; at the memory of one processor core or computer and the other end of the wired or wireless network And other communication channels with long delays.
  • the IPv6 address is 128 bits. Assuming the memory address is 64 bits, the IPv6 address and the memory address are combined into a 192-bit address to address the memory at the far end of the network. In order to support the 192-bit address, only the components of 43, 51, and 113 in Figure 12 need to meet the bandwidth of 192 bits, but the functions and operations are the same; the remaining components do not need any bandwidth due to this 192-bit bandwidth. change.
  • the TLB/TAG unit 51 is capable of storing a tag supporting a 192-bit address (such as a 128-bit tag plus a 64-bit memory tag), and the scanner 43 is also capable of providing the current block address of the 192-bit provided by 51 with a branch instruction.
  • the intra-block offset address and the branch offset are added to obtain a branch target address of 192 bits.
  • This 192-bit branch target address matches the content of the tag unit TAG in 51. If there is no match, the 192-bit branch target address is sent via bus 113 to memory 111 at the other end of the channel for instruction. If it matches, the BN3 or BN2 address is stored in the secondary track table 88 as described above in the embodiment of FIG. 12, and details are not described herein.
  • FIG. 18 The above specific embodiment of the application form of the structure of Fig. 12 can also be applied to the structures of Figs. 13 and 18.
  • the memory 111 in FIG. 18 is located at one end of the communication channel and the remaining modules are located at the other end of the communication channel.
  • the operation of the instruction memory at the far end of the communication channel can be supported as long as the bandwidth of the TLB/TAG unit 51, the scanner 43, and the bus 113 can support the memory address width with the network address prefix.
  • the specific embodiment of FIG. 13 is the same as the instruction memory portion of FIG. 18 described above, and will not be described again. In FIG.
  • the bandwidth of the adder 169 and its output bus 198 in which the data address is generated can also support the memory address with the network address prefix as described above. Except for the above 51, 43, 169 modules and the bandwidth of the bus 113, 198, the remaining modules in Figure 18 need not be changed, since the remaining modules operate based on the cache address.
  • the network memory address (network address + memory address) is mapped to the cache address via the tag unit TAG in 51. The width of the cache address depends on the organization of the buffer, regardless of the network memory address.
  • the address on the bus 113 may be transmitted through a packet.
  • the network address in the network memory address may be placed in the packet header, and the network is The memory address in the memory address is placed in the contents of the packet.
  • an arbiter should be present in 111 to determine the order of access.
  • the network address corresponding to the thread is stored in the processor core by the thread register.
  • the adder in the adder 169 or the scanner 43 in Fig. 18 can use the bit width equal to the bit width of the network memory address, but the optimized implementation of the bit width can be as long as the memory address bit width is satisfied.
  • the thread address stored in the thread register is read by the thread number being executed at that time, and the network address stored by the thread is read.
  • the network address is concatenated with the calculated memory address, which is the network memory address, and is sent to the tag unit TAG in 51 to match.
  • the tag unit can store a plurality of network memory addresses, for example, each entry is 192 bits. But there are several ways to optimize. One is to use two tables, one of the entries in Table 2 stores the row number of another table 1 in addition to the label of the storage memory address. The network address is stored in each entry in Table 1. The network address in the network memory address is first matched with the contents of Table 1 to obtain the row number of Table 1. The obtained table 1 row number is combined with the memory address to match Table 2. The matching result of Table 2 is the cache address. If there is no match, the network memory address is fetched from the memory 111 via the bus 113 into the memory 112.
  • the other is to use only the row number (or thread number) of the above thread register in Table 2 except for the label storing the memory address.
  • the thread number (or thread number) of the thread register is combined with the memory address to match Table 2. If there is no match, the network address addressed by the thread register line number (or thread number) in the thread register is concatenated with the memory address as the network memory address is fetched from memory 111 via bus 113 into memory 112. Therefore, the actual cost of the increase is not much.
  • the scanner 43 in the embodiment of Figures 12, 13, and 18 calculates the branch target instruction address of the branch instruction based on the instruction block address from the branch instruction of the tag unit in 51.
  • the physical address is stored in the tag unit in 51, so the branch target instruction address calculated by the scanner 43 is a physical address.
  • the physical address of the branch target instruction can directly match the content in the label unit in 51 as long as it does not cross the physical page boundary, and does not need to be mapped by TLB.
  • the data address generated by the adder 169 based on the physical address in the tag unit of 51 is also the physical address. As long as the physical page boundary is not crossed, the content of the tag unit in 51 can be directly matched without going through TLB mapping.
  • the match is the BN address of the lowest layer cache.
  • the scanner 43 and the data address generator 169 generate physical addresses that can be directly matched to the TAGs in 51.
  • Other addressing the last level cache (last Level The address of the cache) is as shown in Figure 4, 5, bus 29, bus 8, 89 in Figures 8, 11, 12, and the address on bus 119 in Figures 13, 18 are cache address format BN, which can directly address the last level.
  • the cache memory, the active table AL, the associated table CT, and the tag unit TAG in 51 do not need to go through the TLB or tag unit TAG mapping in 51.
  • the systems and methods proposed by the present invention can be used in a variety of computing, data processing systems, information, data storage systems, and communication systems.
  • the system and method proposed by the present invention can mask or significantly Reduce storage system access latency and cache misses.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Advance Control (AREA)

Abstract

Provided are a processor system and a method. When being applied to the fields of processors and computers, a caching system can actively push an instruction and data to a processor core, preventing the processor core from fetching the instruction and data from a cache in a delay mode, thereby improving the processor performance.

Description

一种基于指令和数据推送的处理器系统和方法  Processor system and method based on instruction and data push 技术领域Technical field
本发明涉及计算机,通讯及集成电路领域。  The invention relates to the field of computers, communications and integrated circuits.
背景技术Background technique
存储程序计算机中的中央处理器产生地址送到存储器,从中读取指令或数据送回供中央处理器执行,执行的结果送回存储器中存储。随着技术的进步,存储器的容量增大,其存储器访问延迟增大,存储器访问的通道延迟也增大;而中央处理器的执行速度却增快,因此存储器访问延迟日益成为计算机性能提高的瓶颈。因此,存储程序计算机使用缓存器以掩盖存储器访问延迟以缓解此一瓶颈。但中央处理器用同样的方式向缓存取指令或数据。即中央处理器中的处理器核产生地址送到缓存器,如地址与缓存器中存储的地址标签匹配,则缓存器将相应的信息直接送到处理器核供执行,如此避免了访问存储器的延迟。随着技术的进步,缓存器的容量增大,其缓存器访问延迟增大,访问的通道延迟也增大;而处理器核的执行速度却增快,因此缓存器访问延迟如今成为计算机性能提高的严重瓶颈。 The central processor in the stored program computer generates an address to the memory, from which the read command or data is sent back for execution by the central processing unit, and the result of the execution is sent back to the memory for storage. With the advancement of technology, the capacity of memory increases, the memory access latency increases, and the channel delay of memory access increases. The execution speed of the central processor increases, so memory access latency becomes a bottleneck for computer performance improvement. . Therefore, the stored program computer uses a buffer to mask the memory access latency to alleviate this bottleneck. But the central processor fetches instructions or data into the cache in the same way. That is, the processor core in the central processing unit generates an address and sends the address to the buffer. If the address matches the address tag stored in the buffer, the buffer sends the corresponding information directly to the processor core for execution, thus avoiding access to the memory. delay. As technology advances, the capacity of the buffer increases, the buffer access latency increases, and the channel latency of the access increases. The execution speed of the processor core increases, so the buffer access latency is now an improvement in computer performance. A serious bottleneck.
技术问题technical problem
上述处理器核向存储器取信息(包括指令和数据)供执行的方式可被视为处理器核向存储器拉取(Pull)信息。拉取信息需经历延迟通道两次,一次是处理器将地址送到存储器,一次是存储器将信息送到处理器核。此外,为支持拉取信息的方式,所有存储程序计算机的处理器都有产生和记录指令地址的模块,其流水线结构中必然有取指令的流水线段。现代存储程序计算机取指令通常需要复数个流水线段,加深了流水线,加重了分支预测错误时的损失。另外产生和记录一个长指令地址也需要消耗较多能量。尤其是将变长指令转换为定长微操作执行的计算机需要将定长微操作的地址逆向转换为变长指令的地址对缓存寻址,要有不少代价。 The manner in which the above processor core fetches information (including instructions and data) from the memory for execution can be considered as the processor core pulls (pull) information to the memory. The pull information needs to go through the delay channel twice, once the processor sends the address to the memory, and once the memory sends the information to the processor core. In addition, in order to support the way of pulling information, all the processors of the stored program computer have modules for generating and recording instruction addresses, and the pipeline structure of the instructions must have a pipeline segment for instruction fetching. Modern stored procedure computer fetching instructions usually requires a plurality of pipeline segments, which deepens the pipeline and increases the loss of branch prediction errors. In addition, generating and recording a long instruction address also requires more energy. In particular, a computer that converts a variable length instruction into a fixed-length micro-operation requires that the address of the fixed-length micro-operation be inversely converted to the address of the variable-length instruction to address the cache, which has a cost.
本发明提出的方法与系统装置能直接解决上述或其他的一个或多个困难。 The method and system apparatus proposed by the present invention can directly address one or more of the above or other difficulties.
本发明提出了一种处理器系统,包括:推送缓存器和相应处理器核;其特征在于:所述处理器核不产生和保持指令地址,其流水线中也没有取指令的流水线段;所述处理器核仅向所述推送缓存器提供分支判断以及在执行间接分支指令时提供寄存器堆内存储的基地址;所述推送缓存器提取其存储的指令中的控制流信息并存储,根据所述控制流信息及所述分支判断向所述处理器核推送指令供其执行;所述推送缓存器在遇到间接分支指令时,基于来自所述处理器核的所述基地址向所述处理器核提供正确的间接分支目标指令供其执行。进一步地,所述推送缓存器可向所述处理器核提供分支指令的后续顺序及分支目标两支指令,由所述处理器核产生的分支判断选择执行其中一支指令,因此可以掩盖所述处理器核将所述分支判断传送到所述推送缓存器的延迟。进一步地,所述推送缓存器可以存储间接分支指令的基地址及相应的间接分支目标地址,可以减少或消除推送间接分支目标指令时的延迟,部分或全部掩盖所述处理器核将所述基地址送到所述推送缓存器的延迟。更进一步,推送缓存器可以基于其中存储的控制流信息提前向所述处理器核推送指令,部分或全部掩盖将信息从所述推送缓存器向处理器核传输的延迟。本发明提出的处理器系统的处理器核中不需要有取指令的流水线段,也不需要产生及记录指令地址。 The present invention provides a processor system comprising: a push buffer and a corresponding processor core; wherein: the processor core does not generate and maintain an instruction address, and there is no pipeline segment in the pipeline; The processor core only provides branch decisions to the push buffer and provides a base address stored in the register file when the indirect branch instruction is executed; the push buffer extracts control flow information in the stored instructions thereof and stores Control flow information and the branch determination to push instructions to the processor core for execution thereof; the push buffer, upon encountering an indirect branch instruction, to the processor based on the base address from the processor core The core provides the correct indirect branch target instructions for execution. Further, the push buffer may provide the processor core with a subsequent sequence of branch instructions and two instructions of the branch target, and the branch generated by the processor core determines to execute one of the instructions, thereby masking the The delay at which the processor core passes the branch decision to the push buffer. Further, the push buffer may store the base address of the indirect branch instruction and the corresponding indirect branch target address, and may reduce or eliminate the delay when pushing the indirect branch target instruction, partially or completely masking the processor core to the base The delay at which the address is sent to the push buffer. Still further, the push buffer may push instructions to the processor core in advance based on control flow information stored therein, partially or completely masking the delay in transmitting information from the push buffer to the processor core. The processor core of the processor system proposed by the present invention does not need to have a pipeline stage for fetching instructions, nor does it need to generate and record an instruction address.
本发明提出了一种复数层次缓存的组织形式,其最后(最低)层次缓存(Last Level Cache, LLC)为路组相联组织,有虚实地址变换缓冲器TLB及标签单元TAG,可将存储器虚(virtual)地址经TLB变换为存储器实(physical)地址,所得的存储器实地址再与TAG中内容匹配得到LLC的缓存器地址。由于LLC 缓存器地址由存储器实地址映射所得,因此LLC缓存器地址实际上是实地址。所得的LLC缓存器地址可用于寻址LLC的信息存储器RAM,也可用于选择LLC主动表。LLC主动表中存储了LLC缓存块与较高层缓存器中缓存块的映射关系,即LLC主动表由LLC缓存器地址寻址,而其表项内容是相应的较高层次缓存块地址。本发明中除LLC外其他层次的缓存器都是全相联组织,都以其本层次的缓存器地址直接寻址,不需要标签单元TAG或TLB。本层次的缓存器地址与较高层次缓存器地址通过主动表映射,所述主动表与LLC主动表相似,都是以本层次缓存器地址寻址而表项中存储较高层次缓存器地址。最高层次缓存器有相应的轨道表,其中存储由扫描器扫描、审查被存储进最高层次缓存器存储器RAM指令提取的控制流信息。轨道表由最高层次缓存器地址寻址,其表项中存储分支指令的分支目标地址。循迹器产生最高层次缓存器地址寻址最高层次缓存器存储器的第一读口输出顺序指令推送到处理器核;也以所述最高层次缓存器地址寻址轨道表中的对应表项读出相应分支目标地址,以所述分支目标地址寻址最高层次缓存器存储器的第二读口输出分支目标指令也推送到处理器核。处理器核执行分支指令产生分支判断,选择上述两支指令中的一支执行而放弃另一支。所述分支判断也控制所述循迹器相应地选择两支缓存器地址中的一支,寻址所述最高层次缓存器向处理器核持续推送指令。The invention proposes an organization form of a complex hierarchical cache, and its last (lowest) level cache (Last Level Cache, LLC) is a link group association organization, which has a virtual real address translation buffer TLB and a tag unit TAG, which can convert a virtual address of a memory into a physical address by TLB, and obtain the real address of the memory and the content in the TAG. Match to get the buffer address of the LLC. Due to LLC The buffer address is derived from the memory real address mapping, so the LLC buffer address is actually the real address. The resulting LLC buffer address can be used to address the LLC's information memory RAM and can also be used to select an LLC active table. The LLC active table stores the mapping relationship between the LLC cache block and the cache block in the higher layer buffer, that is, the LLC active table is addressed by the LLC buffer address, and the contents of the entry are the corresponding higher level cache block address. In the present invention, other levels of buffers other than LLC are all associative organizations, which are directly addressed by their own buffer addresses, and do not require a tag unit TAG or TLB. The buffer address of the current level and the higher-level buffer address are mapped by the active table. The active table is similar to the LLC active table, and is addressed by the hierarchical buffer address and the higher-level buffer address is stored in the entry. The highest level buffer has a corresponding track table in which the control stream information scanned by the scanner and reviewed for storage into the highest level buffer memory RAM instruction is stored. The track table is addressed by the highest level buffer address, and its entry stores the branch target address of the branch instruction. The tracker generates a highest level buffer address addressing the first read port output sequence instruction of the highest level buffer memory to be pushed to the processor core; also reading the corresponding entry in the track table by the highest level buffer address addressing The corresponding branch target address, the second read output branch target instruction that addresses the highest level buffer memory with the branch target address is also pushed to the processor core. The processor core executes the branch instruction to generate a branch decision, and selects one of the two instructions to execute and discards the other branch. The branch determination also controls the tracker to select one of the two buffer addresses accordingly, addressing the highest level buffer to continue pushing instructions to the processor core.
本发明提出了一种根据缓存块之间关联度确定可被置换缓存块的缓存置换方法。所述轨道表中记录了从分支源分支跳转到分支目标的途径。本发明另以相关表记录了缓存块内容在低层次缓存器中的相应低层次缓存器地址,跳转入缓存块的分支源途径及跳转入缓存块的分支源的数目。可以根据缓存块中所述跳入的分支源的计数定义缓存块的关联度,计数越小关联度越小,可被预先置换。对同等最小关联度的各缓存块另外可再根据其上一次置换的先后,置换上一次最早置换的缓存块,以避免刚被置换过的缓存块又被置换。缓存块被置换时,以相关表中存储的跳入分支源途径寻址轨道表中表项,用相关表中缓存块内容的相应低层缓存器地址代替该缓存块地址以保持控制信息流的完整性。以上所述是以同一存储层次之间的关联度为依据进行置换。The present invention proposes a cache replacement method for determining a replaceable cache block based on the degree of association between cache blocks. The way from the branch source branch to the branch target is recorded in the track table. In addition, the related table records the corresponding low-level buffer address of the cache block content in the low-level buffer, the branch source path of the jump into the cache block, and the number of branch sources that jump into the cache block. The association degree of the cache block may be defined according to the count of the branch source jumped in the cache block, and the smaller the count, the smaller the degree of association, and may be pre-replaced. For each cache block of the same minimum degree of association, the cache block of the last old replacement may be replaced according to the order of the previous replacement, so as to avoid the cache block that has just been replaced being replaced. When the cache block is replaced, the entry in the track table is addressed by the jump-in branch source path stored in the correlation table, and the cache block address is replaced by the corresponding lower-layer buffer address of the cache block content in the related table to keep the control information flow intact. Sex. The above description is based on the degree of association between the same storage levels.
在不同存储层次之间也可以应用最小关联度置换方法。其方法是记录与缓存块内容相同的高层次缓存块的数目作为关联度,计数越小关联度越小,置换关联度最小的缓存块。这种方法也可以被称为最少子孙法(Least Children),在此子孙指与缓存块内容相同的高层次缓存块。另外也要记录轨道表中以缓存块为分支目标的表项数目(缓存块与轨道表可以在不同存储层次)。当两个数目都为‘0’时,缓存块可被置换。若子孙计数不为‘0’,则将子孙缓存块置换后可置换本缓存块。若轨道表中以缓存块为分支目标的表项数目不为‘0’,则可以等其为‘0’时置换,或以含有本缓存块内容的低层次缓存器地址代替轨道表表项中的本缓存块地址后置换。存储层次间的最小关联度置换也可与前述最早被置换方法共用。The minimum degree of association replacement method can also be applied between different storage levels. The method is to record the number of high-level cache blocks that are the same as the content of the cache block as the degree of association, and the smaller the count, the smaller the degree of association, and the cache block with the smallest degree of association. This method can also be called the least descendant method (Least Children), where the descendant refers to the same high-level cache block as the cache block. In addition, the number of entries in the track table with the cache block as the branch target is also recorded (the cache block and the track table can be at different storage levels). When both numbers are '0', the cache block can be replaced. If the descendant count is not '0', the cache block can be replaced by replacing the descendant cache block. If the number of entries in the track table with the cache block as the branch target is not '0', the replacement may be performed when it is '0', or the low-level buffer address containing the contents of the cache block is substituted for the track table entry. The current cache block address is replaced. The minimum degree of association between storage hierarchies can also be shared with the earliest replaced method described above.
本发明提供了一种将循迹器及处理器核中的寄存器状态暂存到按线程号识别的存储器的方法。所述存储器与所述循迹器及处理器核中的寄存器状态可以按线程互换以切换线程。因为本发明的推送缓存中各线程指令是独立的,因此改变线程时不需清空缓存,不会发生一个线程执行了另一个线程的指令的情形。The present invention provides a method of temporarily storing register states in a tracker and processor core to a memory identified by a thread number. The state of the registers in the memory and the tracker and processor core can be swapped by threads to switch threads. Because the thread instructions in the push cache of the present invention are independent, there is no need to clear the cache when the thread is changed, and no one thread executes the instruction of another thread.
本发明提出了可以同时执行复数个存储器层次提供的指令的方法与处理器系统。 The present invention proposes a method and processor system that can execute instructions provided by a plurality of memory levels simultaneously.
本发明提出了基于轨道表的函数调用与函数返回方法与系统。 The invention proposes a function call and function return method and system based on a track table.
本发明提出了计算机存储器层次组织方法与系统,除了硬盘以外,所述各存储层次,包括传统的主存(main memory)都按缓存组织,由硬件管理,不需操作系统分配存储器。这种方式在指令或数据读取时不需经过标签单元匹配,减少了延迟。 The invention provides a computer memory hierarchical organization method and system, in addition to a hard disk, the storage levels, including traditional main memory (main Memory) is organized by cache and managed by hardware, without the need for the operating system to allocate memory. This method does not need to be matched by the tag unit when the instruction or data is read, which reduces the delay.
本发明提出了一种按层次保留数据间相互关系的全相联缓存方法,根据数据间在不同层次间的双向地址映射避免地址与标签的比较匹配操作。在执行一条装载指令之前,缓存系统根据之前执行同一条装载指令时提取,保留的步长信息,及所述相互关系,提前读取数据向处理器核推送(Serve)。 The invention proposes a full associative caching method for preserving the relationship between data by hierarchy, and avoids the comparison matching operation between the address and the label according to the bidirectional address mapping between different levels of data. Before executing a load instruction, the cache system pushes the data to the processor core (Serve) in advance according to the previous extraction of the same load instruction, the retained step information, and the mutual relationship.
本发明提出了一种提取,记录按逻辑方式组织的数据间(即数据中含有相关数据的地址信息)相互关系的方法与系统。所述方法与系统根据执行装载指令的结果,自主学习,提取数据间的逻辑关系保留在数据轨道表中。数据轨道表中表项与数据存储器表项一一对应。对应数据存储器中‘数据’的数据轨道表项保留分析数据间关系产生的‘数据类型’。对应数据存储器中‘地址’的数据轨道表项保留地址映射后的‘地址指针’。所述‘地址指针’能直接寻址数据存储器读取数据,不需经过标签单元匹配。本方法与系统在未提取到所述逻辑关系之前,按上述数据间相互关系向处理器核推送数据。本方法与系统在提取到所述逻辑关系之后,在执行一条装载指令之前,缓存系统根据之前执行同一条装载指令时提取的,保留在数据轨道表中的所述逻辑关系,以及处理器核执行相关指令提供的比较结果,提前读取数据向处理器核推送。 The present invention proposes a method and system for extracting and recording the interrelationship between logically organized data (i.e., address information containing data in a data). The method and system autonomously learn according to the result of executing the load instruction, and the logical relationship between the extracted data remains in the data track table. The entries in the data track table correspond one-to-one with the data storage table entries. The data track entry corresponding to 'data' in the data memory retains the 'data type' generated by analyzing the relationship between the data. The data address entry corresponding to the 'address' in the data memory retains the address mapped 'address pointer'. The 'address pointer' can directly address the data memory to read data without the need for tag unit matching. The method and system push data to the processor core according to the relationship between the data before the logical relationship is extracted. After the method and the system extract the logical relationship, before the execution of a load instruction, the cache system retains the logical relationship retained in the data track table according to the previous execution of the same load instruction, and the processor core executes The comparison result provided by the relevant instruction, the data is read in advance and pushed to the processor core.
本发明的存储器层次结构方法和系统主动向处理器核推送大部分指令和数据;处理器核在大部分时间内只需提供分支决定或比较结果,以及处理器的流水线停止信号。 The memory hierarchy method and system of the present invention actively pushes most of the instructions and data to the processor core; the processor core only needs to provide branch decision or comparison results, and the pipeline stop signal of the processor most of the time.
本发明提供了一种存储器层次结构和方法,所述系统和方法可以用统一的存储器地址访问处于通讯信道另一端的存储器层次结构。 The present invention provides a memory hierarchy and method that can access a memory hierarchy at the other end of a communication channel with a uniform memory address.
本发明提供了一种包括处理器核和缓存器的处理器系统,所述缓存器向所述处理器核推送指令和数据供所述处理器核执行及处理。 The present invention provides a processor system including a processor core and a buffer that pushes instructions and data to the processor core for execution and processing by the processor core.
本发明所述系统和方法可以为处理器系统中处理器核访问缓存器的双向延迟提供提供基本的解决方案。在传统处理器系统中,处理器核向缓存器发送存储器地址,缓存器根据所述存储器地址向处理器核发送信息(指令或数据)。本发明所述的利用指令间的相关性的系统和方法,则由缓存器向处理器核推送指令,避免了处理器核向缓存器发送存储器地址的延迟。此外,本发明所述的推送缓存器不在处理器核的流水线结构中,因此可以提前推送指令以掩盖缓存器至处理器核的延迟。The system and method of the present invention can provide a basic solution for providing a two-way delay of a processor core access buffer in a processor system. In a conventional processor system, the processor core sends a memory address to the buffer, and the buffer transmits information (instructions or data) to the processor core based on the memory address. The system and method for utilizing correlation between instructions according to the present invention pushes instructions from the buffer to the processor core, thereby avoiding delays in transmitting the memory address by the processor core to the buffer. In addition, the push buffer of the present invention is not in the pipeline structure of the processor core, so instructions can be pushed in advance to mask the delay of the buffer to the processor core.
本发明所述系统和方法还提供了一种多层缓存组织形式,其虚实地址转换及地址映射仅在最低层次缓存(LLC)进行,而非传统缓存中虚实地址转换在最高层次缓存进行,以及地址映射在每一层次缓存进行。所述多层缓存组织形式中各层次缓存都可以由基于由存储器实地址映射所的的缓存器地址寻址,使得全相联的缓存其成本及功耗都近似于直接映射缓存。The system and method of the present invention also provides a multi-layer cache organization form, in which virtual and real address translation and address mapping are performed only in a lowest level cache (LLC), and instead of a virtual cache in a conventional cache, the highest level cache is performed, and The address map is cached at each level. Each level of cache in the multi-layer cache organization can be addressed by a buffer address based on the real address mapping of the memory, such that the cost and power consumption of the fully associative cache approximates the direct map cache.
本发明所述系统和方法还提供了一种基于缓存块间关联度的缓存置换方法,适用于利用指令间关系(控制信息流)的缓存器。The system and method of the present invention also provides a cache replacement method based on the degree of association between cache blocks, which is suitable for a buffer that utilizes an inter-instruction relationship (control information flow).
对于本领域专业人士而言,本发明的其他优点和应用是显见的。Other advantages and applications of the present invention will be apparent to those skilled in the art.
图1是本发明所述基于轨道表的缓存系统的实施例;1 is an embodiment of a track table based cache system of the present invention;
图2是本发明所述处理器系统的一个实施例;2 is an embodiment of a processor system of the present invention;
图3是本发明所述处理器系统的另一个实施例;Figure 3 is another embodiment of the processor system of the present invention;
图4是本发明所述处理器系统的另一个实施例;4 is another embodiment of the processor system of the present invention;
图5是本发明所述处理器系统的另一个实施例;Figure 5 is another embodiment of the processor system of the present invention;
图6是图5实施例中处理器系统的地址格式;6 is an address format of a processor system in the embodiment of FIG. 5;
图7是图5实施例中处理器系统的部分存储表格式;Figure 7 is a partial storage table format of the processor system in the embodiment of Figure 5;
图8是本发明所述处理器系统的另一个实施例;Figure 8 is another embodiment of the processor system of the present invention;
图9是本发明所述处理器系统的间接分支目标地址产生器的一个实施例;9 is an embodiment of an indirect branch target address generator of the processor system of the present invention;
图10是本发明所述处理器系统中处理器核的流水线结构示意图;10 is a schematic diagram of a pipeline structure of a processor core in the processor system of the present invention;
图11是本发明所述处理器系统的另一个实施例;Figure 11 is another embodiment of the processor system of the present invention;
图12是本发明所述处理器/存储器系统的一个实施例;Figure 12 is an embodiment of the processor/memory system of the present invention;
图13是本发明所述处理器/存储器系统的另一个实施例;Figure 13 is another embodiment of the processor/memory system of the present invention;
图14是为图13实施例中各存储表的格式;Figure 14 is a format of each storage table in the embodiment of Figure 13;
图15是本发明图13实施例中处理器系统的地址格式;Figure 15 is an address format of a processor system in the embodiment of Figure 13 of the present invention;
图16是本发明所述数据轨道表,数据主动表,数据相关表的格式;16 is a format of a data track table, a data active table, and a data related table according to the present invention;
图17是本发明所述步长表格式及工作原理;17 is a format and working principle of the step size table according to the present invention;
图18是本发明所述处理器/存储器系统的另一个实施例;Figure 18 is another embodiment of the processor/memory system of the present invention;
图19是本发明图18实施例中数据缓存层次结构的作用机制示意图;19 is a schematic diagram of the action mechanism of the data cache hierarchy in the embodiment of FIG. 18 of the present invention;
图20是本发明18实施例中数据缓存层次结构的改进实施例;20 is a modified embodiment of a data cache hierarchy in an embodiment of the present invention;
图21是预取按逻辑关系组织的数据的实施例;21 is an embodiment of prefetching data organized in a logical relationship;
图22是处理函数调用与函数返回指令的实施例;22 is an embodiment of a handler function call and a function return instruction;
图23是本发明所述处理器系统的另一个实施例。23 is another embodiment of the processor system of the present invention.
本发明的最佳实施方式是附图18。 The preferred embodiment of the invention is shown in Figure 18.
以下结合附图和具体实施例对本发明提出的高性能缓存系统和方法作进一步详细说明。根据下面说明和权利要求书,本发明的优点和特征将更清楚。需说明的是,附图均采用非常简化的形式且均使用非精准的比例,仅用以方便、明晰地辅助说明本发明实施例的目的。The high performance cache system and method proposed by the present invention are further described in detail below with reference to the accompanying drawings and specific embodiments. Advantages and features of the present invention will be apparent from the description and appended claims. It should be noted that the drawings are in a very simplified form and all use non-precise proportions, and are only for convenience and clarity to assist the purpose of the embodiments of the present invention.
需要说明的是,为了清楚地说明本发明的内容,本发明特举多个实施例以进一步阐释本发明的不同实现方式,其中,该多个实施例是列举式并非穷举式。此外,为了说明的简洁,前实施例中已提及的内容往往在后实施例中予以省略,因此,后实施例中未提及的内容可相应参考前实施例。It should be noted that the various embodiments of the present invention are further illustrated to illustrate the various embodiments of the present invention in order to clearly illustrate the present invention. Further, for the sake of brevity of explanation, the contents already mentioned in the foregoing embodiment are often omitted in the latter embodiment, and therefore, contents not mentioned in the latter embodiment can be referred to the previous embodiment accordingly.
虽然该发明可以以多种形式的修改和替换来扩展,说明书中也列出了一些具体的实施图例并进行详细阐述。应当理解的是,发明者的出发点不是将该发明限于所阐述的特定实施例,正相反,发明者的出发点在于保护所有基于由本权利声明定义的精神或范围内进行的改进、等效转换和修改。同样的元器件号码可能被用于所有附图以代表相同的或类似的部分。Although the invention may be modified in various forms of modifications and substitutions, some specific embodiments of the invention are set forth in the specification and detailed. It should be understood that the inventor's point of departure is not to limit the invention to the particular embodiments set forth, but the inventor's point of departure is to protect all improvements, equivalent transformations and modifications based on the spirit or scope defined by the claims. . The same component numbers may be used in all figures to represent the same or similar parts.
此外,在本说明书中对部分实施例进行了一定的简化,目的是为了能更清楚地表达本发明技术方案。应当理解的是,在本发明技术方案的框架下改变这些实施例的结构、时延、时钟周期差异和内部连接方式,都应属于本发明所附权利要求的保护范围。In addition, some embodiments have been simplified in the present specification in order to more clearly express the technical solutions of the present invention. It should be understood that changing the structure, delay, clock cycle difference and internal connection manner of these embodiments under the framework of the technical solution of the present invention should fall within the protection scope of the appended claims.
可以用一种称为轨道表的数据结构改进处理器系统中的缓存器。轨道表中不但存储有分支指令的分支目标指令信息,还存储有顺序执行的指令信息。图1给出了本发明所述包含轨道表的缓存系统的例子。其中10为本发明所述轨道表的一个实施例。轨道表10由与一级缓存器22同样数目的行和列构成,其中每一行就是一条轨道,对应一级缓存中的一个一级缓存块, 轨道上的每个表项对应一级缓存块中的一条指令。在本例中假设一级缓存中的每个一级缓存块最多包含4个指令,其块内偏移地址BNY分别为0、1、2、3。下面以一级缓存22中的5个指令块,其一级缓存器块地址BN1X分别为‘J’、‘K’、‘L’、‘M’、‘N’,为例进行说明。因此轨道表10中有相应的5条轨道,每条轨道中最多可存放4个表项与24中一级缓存块中最多4条指令对应,也由BNY对轨道中的表项寻址。在本例中,可以通过由一级缓存器块地址BN1X和块内偏移地址BNY构成的一级缓存器地址地址BN1对轨道表10及相应一级缓存器22寻址, 读出轨道表表项以及对应的指令。图1中域11,12,13为轨道表10的表项格式。轨道表的表项格式中有专门的域存储程序流控制信息。其中域11为指令类型格式,按对应的指令的类型可以分为非分支指令及分支指令两大类。其中分支指令的类型可以进一步按照一个维度细分为直接与间接分支,也可以按照另一个维度细分为条件分支及无条件分支。域12中存储的是缓存器块地址,域13中存储的是存储器块内偏移地址。图1中以域12中为一级缓存器BN1X格式,域13中为BNY格式说明。缓存器地址还可以使用其他格式,此时域11中可增设地址格式信息以说明域12,13中的地址格式。非分支指令的轨道表表项中只有一个存储了非分支类型的指令类型域11,而分支指令的表项除指令类型域11外,还有BNX域12及BNY域13。The buffers in the processor system can be improved with a data structure called a track table. In the track table, not only the branch target instruction information of the branch instruction but also the instruction information sequentially executed are stored. Figure 1 shows an example of a cache system including a track table of the present invention. 10 is an embodiment of the track table of the present invention. The track table 10 is composed of the same number of rows and columns as the level 1 buffer 22, wherein each line is a track corresponding to a level 1 cache block in the level 1 cache. Each entry on the track corresponds to an instruction in the level 1 cache block. In this example, it is assumed that each L1 cache block in the L1 cache contains a maximum of 4 instructions, and the intra-block offset addresses BNY are 0, 1, 2, and 3, respectively. Hereinafter, the five instruction blocks in the first-level buffer 22, whose first-level buffer block addresses BN1X are ‘J’, ‘K’, ‘L’, ‘M’, and ‘N’, will be described as an example. Therefore, there are corresponding 5 tracks in the track table 10, and up to 4 items in each track can correspond to up to 4 instructions in the first level cache block of 24, and BNY can also address the items in the track. In this example, the track table 10 and the corresponding level 1 buffer 22 can be addressed by the level 1 buffer address BN1 formed by the level 1 buffer block address BN1X and the in-block offset address BNY. Read the track table entry and the corresponding instruction. The fields 11, 12, and 13 in Fig. 1 are the entry format of the track table 10. There is a special domain storage program flow control information in the table entry format of the track table. The domain 11 is an instruction type format, and can be divided into two categories: non-branch instructions and branch instructions according to the type of the corresponding instruction. The type of the branch instruction may be further subdivided into direct and indirect branches according to one dimension, or may be subdivided into conditional branches and unconditional branches according to another dimension. Stored in domain 12 is the buffer block address, and in domain 13 is the offset within the memory block. In Figure 1, the domain 12 is in the primary buffer BN1X format, and the domain 13 is in the BNY format. The buffer address can also use other formats, in which case address format information can be added in the field 11 to illustrate the address format in the fields 12, 13. Only one of the track table entries of the non-branch instruction stores the instruction type field 11 of the non-branch type, and the entry of the branch instruction has the BNX domain 12 and the BNY domain 13 in addition to the instruction type field 11.
图1的轨道表10中只显示域12与13。例如,表项‘M2’中的值‘J3’表示‘M2’表项所对应的指令的其分支目标指令一级缓存地址为‘J3’。这样,当根据轨道表地址(即一级缓存器地址)读出轨道表10中‘M2’表项时,即可根据表项中域11判断其相应指令为分支指令,根据域12,13得知该指令的分支目标为一级缓存器中‘J3’地址的指令。寻址找到的一级缓存24中的‘J’指令块中BNY为‘3’的指令就是所述分支目标指令。此外,在轨道表10中除了上述 BNY为‘0’~‘3’的列外,还包含一个额外的结束列16,其中每个结束表项只有域11及12,其中域11存储了一个无条件分支的类型,域12中存储了相应行对应的指令块的顺序地址下一指令块的BN1X,即可以根据该BN1X直接在一级缓存中找到所述下一指令块,并在轨道表10中找到该下一指令块对应的轨道。Only fields 12 and 13 are shown in the track table 10 of FIG. For example, the value 'J3' in the entry 'M2' indicates that its branch target instruction level one cache address of the instruction corresponding to the 'M2' entry is 'J3'. Thus, when the 'M2' entry in the track table 10 is read according to the track table address (ie, the level 1 buffer address), the corresponding instruction can be judged as a branch instruction according to the field 11 in the entry, according to the fields 12, 13 The branch target of the instruction is the instruction of the 'J3' address in the level 1 buffer. The instruction that addresses the BNY of the 'J' instruction block in the found first level buffer 24 to '3' is the branch target instruction. In addition, in the track table 10, in addition to the above BNY is outside the column of '0'~'3' and also contains an additional end column 16, where each end entry has only fields 11 and 12, where domain 11 stores an unconditional branch type, and domain 12 stores The sequence address of the corresponding instruction block corresponds to the BN1X of the next instruction block, that is, the next instruction block can be directly found in the L1 cache according to the BN1X, and the track corresponding to the next instruction block is found in the track table 10. .
轨道表10中空白的表项显示对应非分支指令,其余的表项对应分支指令,这些表项中还显示了其对应的分支指令的分支目标(指令)的一级缓存地址(BN1)。对于轨道上的非分支指令表项,其下一条要执行的指令只可能是由该表项同一轨道上右方的表项所代表的指令;对于轨道中的最后一个表项,其下一条要执行的指令只可能是由该轨道上结束表项的内容所指向的一级缓存块中的第一条有效指令;对于轨道上的分支指令表项,其下一条要执行的指令可以是该表项右方的表项所代表的指令,也可以是其表项中的BN指向的指令,由分支判断选择。因此,轨道表10中含有一级缓存中所存储的全部指令的所有程序控制流信息。The blank entries in the track table 10 display corresponding non-branch instructions, and the remaining entries correspond to branch instructions. The entries also show the level 1 cache address (BN1) of the branch target (instruction) of the corresponding branch instruction. For a non-branch instruction entry on a track, the next instruction to be executed may only be an instruction represented by the entry on the right side of the same track of the entry; for the last entry in the track, the next one is to be The executed instruction may only be the first valid instruction in the first-level cache block pointed to by the content of the end entry on the track; for the branch instruction entry on the track, the next instruction to be executed may be the table. The instruction represented by the entry on the right side of the item may also be an instruction pointed to by the BN in the entry of the item, and is selected by the branch. Therefore, the track table 10 contains all the program control flow information of all the instructions stored in the level one cache.
请参考图2,其为是本发明所述处理器系统的一个实施例。在本例中包含一级缓存22,处理器核23,控制器27,如图1中轨道表10一样的轨道表20。增量器(Incrementor)24, 选择器25及寄存器26组成一个循迹器47(虚线内)。处理器核23以分支判断31控制循迹器中选择器25,以流水线停止信号32控制循迹器中寄存器26。选择器25受控制器27和分支判断31的控制选择轨道表20的输出29或增量器24的输出。选择器25的输出被寄存器26寄存,而寄存器26的输出28称为读指针(Read Pointer, RPT),其指令格式为BN1。请注意增量器24的数据宽度等于BNY的宽度,只对读指针中的BNY增‘1’,而不影响其中BN1X的值,如增量结果溢出BNY的宽度(即一级缓存块的容量,比如当增量器24的进位输出为‘1’时),系统会查找结束列中的顺序下个一级缓存块的BN1X以替代本块BN1X;以下实施例均如此,不另做说明。本实施例的系统中的循迹器以读指针28访问(access)轨道表20经总线29输出表项,也访问一级缓存22读出相应指令供处理器核23执行。控制器27对总线29上输出的表项中域11译码。如果域11中的指令类型为非分支,则控制器27控制选择器25选择增量器24的输出,则下一时钟周期读指针增‘1’,从一级缓存22读取顺序下条(Fall Through)指令。如果域11中的指令类型为无条件直接分支,则控制器27控制选择器25选择总线29上的域12,13,下一周期读指针28指向分支目标,从一级缓存22读取分支目标指令。如果域11中的指令类型为直接条件分支,则控制器27让分支判断31控制选择器25,如判断为不执行分支,则下周读指针28由增量器24增‘1’,从一级缓存22中读取顺序指令;如判断为执行分支,则下周读指针指向分支目标,从一级缓存22中读取分支目标指令。当处理器核23中流水线停顿时,通过流水线停顿信号32暂停循迹器中寄存器26的更新,使缓存系统停止向处理器核23提供新的指令。Please refer to FIG. 2, which is an embodiment of the processor system of the present invention. In this example, a level 1 cache 22, a processor core 23, a controller 27, and a track table 20 like the track table 10 of FIG. 1 are included. Incrementor 24, The selector 25 and the register 26 form a tracker 47 (within the dotted line). The processor core 23 controls the selector 25 in the tracker with the branch decision 31, and controls the register 26 in the tracker with the pipeline stop signal 32. The selector 25 is controlled by the controller 27 and the branch decision 31 to select the output 29 of the track table 20 or the output of the incrementer 24. The output of selector 25 is registered by register 26, while the output 28 of register 26 is referred to as a read pointer (Read Pointer, RPT), the instruction format is BN1. Note that the data width of the incrementer 24 is equal to the width of BNY, and only increases the BNY of the read pointer by '1' without affecting the value of BN1X, such as the width of the overflow result of the incremental result (ie, the capacity of the first-level cache block). For example, when the carry output of the incrementer 24 is '1', the system will look for the BN1X of the next level one cache block in the end column to replace the block BN1X; the following embodiments are the same, unless otherwise stated. The tracker in the system of the present embodiment accesses the track table 20 with the read pointer 28 to output the entry via the bus 29, and also accesses the level 1 cache 22 to read the corresponding command for execution by the processor core 23. The controller 27 decodes the field 11 in the entry output on the bus 29. If the instruction type in the field 11 is non-branch, the controller 27 controls the selector 25 to select the output of the incrementer 24, and the read pointer is incremented by '1' in the next clock cycle, and the sequential lower strip is read from the first-level cache 22 ( Fall Through) instruction. If the instruction type in field 11 is an unconditional direct branch, controller 27 controls selector 25 to select fields 12, 13 on bus 29, the next cycle read pointer 28 points to the branch target, and reads the branch target instruction from level one cache 22. . If the instruction type in the field 11 is a direct conditional branch, the controller 27 causes the branch decision 31 to control the selector 25. If it is determined that the branch is not to be executed, the next week the read pointer 28 is incremented by the incrementer 24 by '1', from one The sequence buffer 22 reads the sequence instruction; if it is determined to execute the branch, the next week the read pointer points to the branch target, and the branch target instruction is read from the level 1 cache 22. When the pipeline in processor core 23 stalls, the update of register 26 in the tracker is halted by pipeline stall signal 32, causing the cache system to stop providing new instructions to processor core 23.
回到图1,轨道表10中的非分支表项可被抛弃,以压缩轨道表。压缩轨道表的表项格式除原有的域11,12,13外还增添了源(Source )BNY(SBNY)域15以记录分支指令本身的(源)块内偏移地址,因为压缩后表项在表中有水平位移,虽然还保持各分支表项之间的顺序,但已不复能以BNY直接寻址。压缩轨道表14以压缩表项格式存储了轨道表10中同样的控制流信息。轨道表14中只显示了SBNY域15,BNX域12,与BNY域13。如K行中表项‘1N2’表示该表项代表地址为K1的指令,其分支目标为N2。结束表项16在轨道表14中最右面的一列,通过独立的读口30输出。当读指针28对轨道表14寻址时,用其中的BN1X读出该行对应的所有表项中的SBNY 15的值,并将每个所述SBNY值送到该列对应的比较器(如比较器18等)与该读指针中的BNY 部分17分别比较。这些比较器,若本列的SBNY值小于所述BNY,则输出‘0’,否则输出‘1’。对这些比较器的输出进行检测,按从左到右的顺序找到第一个‘1’,以其控制选择器19经总线29输出该‘1’对应列由BN1X选择的行中的表项内容。例如,当读指针28上的地址为‘M0’、‘M1’或‘M2’时,从左到右三个比较器18等的输出都为‘011’,因此经总线29输出的第一个‘1’对应的表项内容均为‘2J3’。当图2实施例使用14格式的压缩轨道表作为其轨道表20时,控制器27将读指针28上的BNY与轨道表输出总线29上的SBNY做比较。如BNY小于SBNY,则读指针28访问的轨道表表项对应的指令尚在同一读指针28访问的指令之后,此时系统可以继续步进。如BNY等于SBNY,则读指针28访问的轨道表表项正对应访问的指令,此时控制器27可以按照29上的域11中的分支类型控制选择器25执行分支操作。以上图1及图2实施例中缓存系统都以每个时钟周期提供一条指令为例,以便于说明。Returning to Figure 1, the non-branch entries in the track table 10 can be discarded to compress the track table. The format of the table of compressed track table adds the source in addition to the original domain 11, 12, 13 (Source) BNY (SBNY) field 15 to record the (source) intra-block offset address of the branch instruction itself, because the compressed table entry has horizontal displacement in the table, although the order between the branch entries is still maintained, but it is no longer Can be directly addressed by BNY. The compressed track table 14 stores the same control flow information in the track table 10 in a compressed entry format. Only the SBNY field 15, the BNX domain 12, and the BNY domain 13 are shown in the track table 14. For example, the entry "1N2" in the K line indicates that the entry represents an instruction with the address K1, and its branch target is N2. The end table 16 is in the rightmost column of the track table 14 and is output through the independent reading port 30. When the read pointer 28 addresses the track table 14, the BN1X therein is used to read out the SBNY in all entries corresponding to the row. a value of 15, and each of the SBNY values is sent to a corresponding comparator of the column (such as comparator 18, etc.) and BNY in the read pointer Part 17 is compared separately. These comparators output '0' if the SBNY value of this column is less than the BNY, otherwise output '1'. The outputs of these comparators are detected, and the first '1' is found in order from left to right, with its control selector 19 outputting the contents of the entries in the row selected by BN1X in the corresponding column of the '1' via the bus 29. . For example, when the address on the read pointer 28 is 'M0', 'M1' or 'M2', the outputs of the three comparators 18 from left to right are '011', so the first output via the bus 29 The contents of the corresponding entry of '1' are both '2J3'. When the embodiment of Figure 2 uses a 14-format compressed track table as its track table 20, controller 27 compares BNY on read pointer 28 with SBNY on track table output bus 29. If BNY is less than SBNY, the instruction corresponding to the track table entry accessed by the read pointer 28 is still after the instruction accessed by the same read pointer 28, and the system can continue to step. If BNY is equal to SBNY, the track table entry accessed by the read pointer 28 is corresponding to the accessed instruction, at which point the controller 27 can control the selector 25 to perform the branch operation in accordance with the branch type in the domain 11 on 29. In the above embodiments of FIG. 1 and FIG. 2, the cache system provides an instruction for each clock cycle as an example for convenience of description.
请参考图3,其为本发明所述处理器系统的另一个实施例。其中20为一级缓存的轨道表, 22为一级缓存器的存储器RAM, 39为指令读缓冲器(IRB, Instruction Read Buffer),47为循迹器, 91为寄存器,92为选择器,23是处理器核。指令读缓冲IRB 39可存放一个一级指令缓存块的一部分或单数个或复数个一级指令缓存块,由循迹器47的读指针28寻址。读指针28也对轨道表20寻址。轨道表输出的分支目标地址经总线29对一级缓存22寻址,也经总线29送到循迹器47。IRB 39与一级缓存器存储器22共同构成一个双读口的存储器,IRB 39提供第一读口,存储器22提供第二读口,而寄存器91暂存第二读口输出的数据。IRB 39的输出及一级缓存器22的输出由处理器核23输出的分支判断31控制选择器92选择,选择器92输出的指令送到处理器核23中执行。Please refer to FIG. 3, which is another embodiment of the processor system of the present invention. 20 of them are the track tables of the first level cache. 22 is the memory RAM of the first-level buffer, 39 is the Instruction Read Buffer (IRB), and 47 is the tracker. 91 is a register, 92 is a selector, and 23 is a processor core. Instruction read buffer IRB 39 may store a portion of a level one instruction cache block or a single or multiple level one level instruction cache block, which is addressed by the read pointer 28 of the tracker 47. Read pointer 28 also addresses track table 20. The branch target address of the track table output is addressed to the level 1 cache 22 via bus 29 and also to the tracker 47 via bus 29. IRB Together with the primary buffer memory 22, a dual read memory is formed. The IRB 39 provides a first read port, the memory 22 provides a second read port, and the register 91 temporarily stores data output by the second read port. IRB The output of the 39 and the output of the level 1 buffer 22 are controlled by the branch decision 31 output from the processor core 23, and the output of the selector 92 is sent to the processor core 23 for execution.
以下结合图1中轨道表14中的内容说明图3实施例中处理器系统的操作。14中结束列16中各表项均为无条件直接分支类型。为便于说明,在本公开的所有实施例中,均假设14中的其他表项为直接条件分支类型。开始时读指针 28指向地址‘L0’,从IRB 39中读出相应指令,分支判断31的默认值控制选择器92选择来自IRB 39的该指令供处理器核23执行。与此同时读指针 28上的地址‘L0’寻址轨道表14,从总线29输出表项‘0M1’;以29上的地址‘M1’访问一级缓存器22,读出相应分支目标指令存入寄存器91。此时控制器27比较总线29上的SBNY域15及读指针28上的BNY域13,发现二者相等,因此由分支判断31控制选择器92。假设此时31为‘不分支’,则31控制选择器92在下一时钟周期选择IRB 39的输出。下一时钟周期,读指针28步进指向地址‘L1’,从IRB 39中读出相应指令,经选择器92选择供处理器23执行。与此同时读指针 28上的地址‘L1’寻址轨道表14,从总线29输出表项‘3J0’;以29上的地址‘J0’访问一级缓存器22,读出相应指令作为分支目标指令存入寄存器91。此时控制器27比较总线29上的SBNY域15及读指针28上的BNY域13,发现二者不相等,因此按默认值控制选择器92选择IRB 39 的输出供处理器核23 执行。下一时钟周期,读指针28步进指向地址‘L2’,此时控制器27发现总线29上的SBNY域15及读指针28上的BNY域13仍不相等,因此27仍控制选择器92选择IRB 39的输出供处理器核23执行。下一时钟周期,读指针28步进指向地址‘L3’,此时控制器27发现总线29上的SBNY域15及读指针28上的BNY域13相等,因此由分支判断31控制选择器92。假设此时31为‘分支’,控制选择器92选择寄存器91的输出,即地址为‘J0’的分支目标指令,供处理器23执行。与此同时,分支判断31也控制循迹器47选择总线29上的‘J0’放上读指针28,控制将 ‘J’ 一级缓存块存入IRB 39。下一周期,读指针28步进指向‘J1’,控制IRB 39输出相应指令经选择器92选择供处理器核23执行。The operation of the processor system of the embodiment of Fig. 3 is described below in conjunction with the contents of track table 14 of Fig. 1. Each entry in column 16 of 14 is an unconditional direct branch type. For ease of explanation, in all embodiments of the present disclosure, it is assumed that the other entries in 14 are direct conditional branch types. Read pointer at the beginning 28 points to the address 'L0', the corresponding instruction is read from the IRB 39, and the default value of the branch decision 31 controls the selector 92 to select the instruction from the IRB 39 for execution by the processor core 23. Reading pointer at the same time The address 'L0' on the 28 is addressed to the track table 14, the entry '0M1' is output from the bus 29, the first stage buffer 22 is accessed by the address 'M1' on 29, and the corresponding branch target instruction is stored in the register 91. At this time, the controller 27 compares the SBNY field 15 on the bus 29 with the BNY field 13 on the read pointer 28, and finds that the two are equal, so the selector 92 is controlled by the branch decision 31. Assuming 31 is 'no branch' at this time, 31 controls selector 92 to select IRB in the next clock cycle. 39 output. On the next clock cycle, the read pointer 28 is stepped to the address 'L1', the corresponding instruction is read from the IRB 39, and is selected by the selector 92 for execution by the processor 23. Reading pointer at the same time The address 'L1' on 28 addresses the track table 14, outputs the entry '3J0' from the bus 29, accesses the level 1 buffer 22 at the address 'J0' on 29, and reads the corresponding instruction as the branch target instruction in the register 91. . At this time, the controller 27 compares the SBNY field 15 on the bus 29 with the BNY field 13 on the read pointer 28, and finds that the two are not equal, so the selector 92 is selected to select the IRB by default. Output of 39 for processor core 23 carried out. On the next clock cycle, the read pointer 28 steps to the address 'L2', at which point the controller 27 finds that the SBNY field 15 on the bus 29 and the BNY field 13 on the read pointer 28 are still not equal, so 27 still controls the selector 92 selection. IRB The output of 39 is for execution by processor core 23. On the next clock cycle, the read pointer 28 steps to the address 'L3', at which point the controller 27 finds that the SBNY field 15 on the bus 29 and the BNY field 13 on the read pointer 28 are equal, so the selector 92 is controlled by the branch decision 31. Assuming that 31 is 'branch' at this time, control selector 92 selects the output of register 91, i.e., the branch target instruction of address 'J0', for execution by processor 23. At the same time, the branch decision 31 also controls the tracker 47 to select 'J0' on the bus 29 to place the read pointer 28, and the control will The ‘J’ level 1 cache block is stored in IRB 39. In the next cycle, the read pointer 28 steps to 'J1', and the control IRB 39 outputs the corresponding command to be selected by the selector 92 for execution by the processor core 23.
请参考图4,其为本发明所述处理器系统的另一个实施例。其中40 为二级主动表(Active List 2,AL2),41为二级缓存的地址转换缓存器TLB及标签单元TAG,42为二级缓存的存储器RAM,43为扫描器,44为选择器,20为一级缓存的轨道表,37为一级缓存的相关表,22为一级缓存器的存储器RAM,27为控制器,33为选择器,39为指令读缓冲器IRB。增量器24,选择器25,与寄存器26共同构成循迹器47, 增量器34,选择器35,与寄存器36共同构成循迹器48, 23则是处理器核,该核可接收两支指令而在分支判断控制下选择一支执行完成而放弃执行另一支,而45则是暂存处理器各线程状态的寄存器。Please refer to FIG. 4, which is another embodiment of the processor system of the present invention. 40 is the secondary active table (Active List 2, AL2), 41 is the address cache buffer TLB and tag unit TAG of the second level cache, 42 is the memory RAM of the level 2 cache, 43 is the scanner, 44 is the selector, 20 is the track table of the level 1 cache, 37 For the level 1 cache related table, 22 is the memory RAM of the level 1 buffer, 27 is the controller, 33 is the selector, and 39 is the instruction read buffer IRB. The incrementer 24, the selector 25, together with the register 26 constitute a tracker 47, The incrementer 34, the selector 35, together with the register 36 constitute a tracker 48, 23 is the processor core, which can receive two instructions and select one execution completion under branch control and abandon execution of another branch, and 45 is a register for temporarily storing the state of each thread of the processor.
扫描器43审查从二级缓存存储器42存到一级缓存器存储器22的指令块,计算其中的直接分支指令的分支目标地址,其方法是在分支指令本身的存储器地址上加上分支指令中的分支偏移量。计算所得的分支目标地址经选择器44选择后被送到TLB/标签单元41匹配。用匹配所得的二级缓存器地址BN2访问二级主动表40。若该二级缓存器地址对应的指令已被存入一级缓存存储器22,则40中对应表项有效,此时即将该表项中的BN1X块地址与扫描器43产生的该分支指令的类型及块内偏移量BNY合并成一个轨道表表项。若该二级缓存地址对应的指令尚未被存入一级缓存存储器22,则40中对应表项无效,此时即将上述匹配所得的二级缓存地址BN2(含块内偏移量BNY)与扫描器43产生的该分支指令的类型合并成一个轨道表表项。如此产生的一个指令块中的各相应轨道表表项按指令顺序写入轨道表20中与存储器22中上述指令块对应的一条轨道,即完成了该指令块中含有的程序流的提取与存储。The scanner 43 examines the instruction block stored from the L2 cache memory 42 to the L1 buffer memory 22, and calculates the branch target address of the direct branch instruction therein by adding the branch instruction to the memory address of the branch instruction itself. Branch offset. The calculated branch target address is selected by the selector 44 and sent to the TLB/tag unit 41 for matching. The secondary active table 40 is accessed by matching the obtained secondary buffer address BN2. If the instruction corresponding to the L2 cache address has been stored in the L1 cache memory 22, the corresponding entry in 40 is valid, and the BN1X block address in the entry and the type of the branch instruction generated by the scanner 43 are at this time. And the intra-block offset BNY is merged into one track table entry. If the instruction corresponding to the L2 cache address has not been stored in the L1 cache memory 22, the corresponding entry in 40 is invalid. At this time, the L2 cache address BN2 (including the intra-block offset BNY) and the scan are obtained. The type of the branch instruction generated by the unit 43 is merged into one track table entry. Each corresponding track table entry in an instruction block thus generated is written in the instruction sequence to a track corresponding to the instruction block in the memory 22 in the track table 20, that is, the program stream extracted and stored in the instruction block is completed. .
循迹器47产生的读指针28寻址轨道表20读出表项经总线29输出。控制器27译码输出表项中的分支类型及地址格式。如输出的表项中的分支类型为直接分支,而缓存器地址为BN2格式,则控制器27以该BN2地址寻址二级主动表40。若40中表项有效,即将该表项中BN1X填入轨道表20中替代上述表项中BN2X,使其成为BN1格式;若40中表项无效,以该BN2地址寻址二级缓存器存储器42,读出指令块填入一级缓存器存储器22中由一级缓存器置换逻辑所提供的一个一级缓存块,并将该一级缓存块的块号BN1X填入40中上述无效表项并将该表项置为有效,并如上将该BN1X填入轨道表中表项,将该表项中BN2地址替换为BN1地址。上述写入轨道表20的BN1地址可被旁路到总线29上送往循迹器47备用。如经总线29输出的分支类型为直接分支,而缓存器地址为BN1格式,则控制器27使其直接送往循迹器47备用。The read pointer 28 generated by the tracker 47 addresses the track table 20 to read the entry through the bus 29. The controller 27 decodes the branch type and address format in the output entry. If the branch type in the output entry is a direct branch and the buffer address is in the BN2 format, the controller 27 addresses the secondary active table 40 with the BN2 address. If the entry in the entry is valid, the BN1X in the entry is filled in the track table 20 instead of the BN2X in the above entry, so that it becomes the BN1 format; if the entry in the 40 is invalid, the secondary buffer memory is addressed by the BN2 address. 42. The read command block is filled in a first level cache block provided by the level 1 buffer replacement logic in the level 1 buffer memory 22, and the block number BN1X of the level 1 cache block is filled into the above invalid entry in 40. And the entry is valid, and the BN1X is filled in the entry in the track table as above, and the BN2 address in the entry is replaced with the BN1 address. The BN1 address of the write track table 20 described above can be bypassed onto the bus 29 and sent to the tracker 47 for later use. If the branch type output via the bus 29 is a direct branch and the buffer address is in the BN1 format, the controller 27 causes it to be sent directly to the tracker 47 for later use.
如经总线29输出的分支类型为间接分支,则控制器27控制循迹器等待处理器核23计算间接分支目标地址经总线46,选择器44送到二级缓存TLB/标签单元41匹配,以匹配所得的二级缓存地址BN2访问二级主动表40,如40中相应表项无效则以该BN2地址如上寻址二级缓存存储器42读取指令块填入一级缓存存储器22的一个一级缓存块中,将获得的BN1地址旁路到循迹器47备用。相关表(Correration Table, 也可以称为关联表)37是一级缓存器22的置换逻辑的组成部分,其结构及功能将在图7实施例中描述。If the branch type output via the bus 29 is an indirect branch, the controller 27 controls the tracker to wait for the processor core 23 to calculate the indirect branch target address via the bus 46, and the selector 44 sends the match to the L2 cache TMB/tag unit 41 to The matching secondary cache address BN2 accesses the secondary active table 40. If the corresponding entry in the 40 is invalid, the secondary cache memory 42 reads the instruction block and fills the first level cache memory 22 with the BN2 address as described above. In the cache block, the obtained BN1 address is bypassed to the tracker 47 for backup. Related table (Correration Table, which may also be referred to as an association table, 37 is a component of the permutation logic of the level 1 buffer 22, the structure and function of which will be described in the embodiment of FIG.
处理器核23中分支判断流水线段之前的流水线有两支,其中一支接收来自指令读缓冲IRB 39的顺序指令,该支被命名为FT(Fall-through)支; 另一支接收来自一级缓存器存储器22的分支目标指令,该支被命名为TG(Target)支。该两支含有的前端流水线段数由处理器的流水线结构决定,本实施例中以该两支中各含有两个前端流水线段为例说明。处理器核23中的分支判断流水线段执行分支指令,根据产生的分支判断31选择两支指令中的一支完成执行,而放弃执行另一支。在本实施例中以IRB 39 可以存储两个指令块为例,指令读缓冲IRB 39由循迹器48的IPT读指针38寻址。一级指令缓存器22,相关表37及轨道表20由循迹器47的RPT读指针28寻址。The branch in the processor core 23 determines that there are two pipelines before the pipeline segment, one of which receives the IRB from the instruction read buffer. 39 sequential instructions, the branch is named FT (Fall-through) branch; The other receives a branch target instruction from the level 1 buffer memory 22, which is named the TG (Target) branch. The number of front-end pipeline segments included in the two branches is determined by the pipeline structure of the processor. In this embodiment, two front-end pipeline segments are included in each of the two branches as an example. The branch in the processor core 23 determines that the pipeline segment executes the branch instruction, selects one of the two instructions to complete execution based on the generated branch decision 31, and discards execution of the other branch. In this embodiment, IRB 39 can store two instruction blocks as an example, the instruction read buffer IRB 39 is addressed by the IPT read pointer 38 of the tracker 48. The primary command buffer 22, associated table 37 and track table 20 are addressed by the RPT read pointer 28 of the tracker 47.
当处理器核23没有对分支产生判断时,分支判断31 的默认值为‘0’,即不分支,处理器核23选择执行FT支的指令;当处理器核23对分支产生判断时,如判断为‘不分支’则分支判断31的值为‘0’,此时处理器核23选择执行FT支的指令;如判断为‘分支’则分支判断31的值为‘1’,此时处理器核23选择执行TG支的指令。选择器33,25,35都可受分支判断31的控制,当31为‘0’时,上述三个选择器都选择右边的输入;当31为‘1’时,上述三个选择器都选择左边的输入。此外在处理器核23没有对分支产生判断时,选择器33与25还受控制器27的控制。以下结合图1中轨道表14的内容说明图4实施例中处理器系统的操作。开始时M指令块已在指令读缓冲IRB39中,分支判断31为‘1’,选择器25及35均选择左边的输入,IPT读指针 38及PT读指针28都指向地址M1。此时IPT 38中指向的 IRB39中M1指令被送入处理器核中的FT支前端流水线;与此同时,RPT 28指向轨道表20,从独立的读口30读出其中M行的结束表项16的值‘N’,以寻址一级缓存器22输出N指令块存入IRB 39。再经总线29输出轨道表14中M行与BNY地址‘1’匹配的表项2J3。此时指令分支判断31为默认值‘0’,选择器35选择增量器34的输入,IPT 指针38步进,控制IRB 39输出M2,M3,N0指令送到处理器核23的FT支前端流水线。控制器27比较总线29上15域SBNY上的值‘2’与RPT 28上的13域BNY的值‘1’,在他们不相等时控制选择器25选择增量器24的输出,使RPT 28步进,指向M2,此时总线19上SBNY与RPT 读指针28上BNY相等,译码器27控制选择器33及选择器25选择右边的输入,即总线29上BN1地址J3存入寄存器26。此后,控制器27控制RPT读指针28从一级缓存22中读出J3,K0指令送到处理器核23的TG支前端流水线。When the processor core 23 does not make a judgment on the branch, the branch judgment 31 The default value is '0', that is, no branch, the processor core 23 selects an instruction to execute the FT branch; when the processor core 23 makes a judgment on the branch, if the judgment is 'no branch', the value of the branch judgment 31 is '0. ' At this time, the processor core 23 selects the instruction to execute the FT branch; if it is judged as the 'branch', the value of the branch decision 31 is '1', at which time the processor core 23 selects the instruction to execute the TG branch. The selectors 33, 25, 35 can all be controlled by the branch judgment 31. When 31 is '0', the above three selectors select the right input; when 31 is '1', the above three selectors are selected. Input on the left. Further, when the processor core 23 does not make a judgment on the branch, the selectors 33 and 25 are also controlled by the controller 27. The operation of the processor system of the embodiment of Fig. 4 is described below in conjunction with the contents of track table 14 of Fig. 1. At the beginning, the M command block is already in the instruction read buffer IRB39, the branch decision 31 is '1', and the selectors 25 and 35 select the left input, the IPT read pointer. Both 38 and PT read pointer 28 point to address M1. At this point, the M1 instruction in IRB39 pointed to in IPT 38 is sent to the FT branch front-end pipeline in the processor core; at the same time, RPT 28 points to the track table 20, and reads the value 'N' of the end entry 16 of the M line from the independent read port 30 to address the level 1 buffer 22 and output the N command block to the IRB. 39. The entry 2J3 of the track table 14 in which the M line matches the BNY address '1' is output via the bus 29. At this time, the instruction branch judges 31 to be the default value '0', and the selector 35 selects the input of the incrementer 34, IPT. The pointer 38 is stepped, and the IRB 39 output M2, M3, N0 command is sent to the FT branch front-end pipeline of the processor core 23. Controller 27 compares the value '2' on the 15 domain SBNY on bus 29 with the RPT The value of the 13 field BNY on the 28 is '1', and when they are not equal, the selector 25 is controlled to select the output of the incrementer 24 to step the RPT 28 to point to M2, at which point SBNY and RPT on the bus 19 On the read pointer 28, BNY is equal, and the decoder 27 controls the selector 33 and the selector 25 to select the input on the right, that is, the BN1 address J3 on the bus 29 is stored in the register 26. Thereafter, the controller 27 controls the RPT read pointer 28 to read J3 from the level 1 cache 22, and the K0 command is sent to the TG branch front end pipeline of the processor core 23.
M2是分支指令,当其到达处理器核23中进行分支判断的流水线段时,该流水线段执行M2指令,产生分支判断。如分支判断‘31’为‘0’,则处理器核23选择FT支中的M3, N0指令继续执行,而放弃执行TG支中的J3,K0指令。此时分支判断31控制选择器25及35选择增量器34的输出存入寄存器26及36,使RPT 28及IPT 38均指向N1,IPT 38控制IRB 39 输出 N1及后续指令到处理器核23的FT支供持续执行。此时RPT 28 指向轨道表中N行,读出N行的结束表项,将其送到一级缓存器22读取N指令块的顺序下一指令块存入IRB 39。M2 is a branch instruction. When it reaches the pipeline segment in the processor core 23 for branch determination, the pipeline segment executes the M2 instruction to generate a branch decision. If the branch judges that '31' is '0', the processor core 23 selects M3 in the FT branch. The N0 instruction continues to execute, and the J3 and K0 instructions in the TG branch are discarded. At this time, the branch judgment 31 controls the selectors 25 and 35 to select the output of the incrementer 34 to be stored in the registers 26 and 36, so that the RPT 28 and the IPT are made. 38 points to N1, IPT 38 controls IRB 39 output N1 and subsequent instructions to the processor core 23 FT for continuous execution. At this time RPT 28 Pointing to N rows in the track table, reading the end entry of the N row, and sending it to the first level buffer 22 to read the sequence of the N instruction block, the next instruction block is stored in the IRB 39.
如分支判断‘31’为‘1’,则处理器核选择TG支中的J3, K0指令继续执行,而放弃执行FG支中的M3,N0指令。此时分支判断31控制将一级缓存22输出的K行指令存入IRB 39,并控制选择器25及35选择增量器24的输出存入寄存器26及36,使RPT 28及IPT 38均指向K1,IPT 38控制IRB 39 输出 K1及后续指令到处理器核23的FT支供持续执行。RPT 28指向K行,K行的结束表项中L被送到一级缓存器22读出L行,存入IRB 39。如此,则处理器23可以不间断地执行指令,没有因分支导致的流水线停顿。If the branch judges that '31' is '1', the processor core selects J3 in the TG branch. The K0 instruction continues to execute, and the M3, N0 instruction in the FG branch is discarded. At this time, the branch judgment 31 controls to store the K line instruction output by the level 1 cache 22 into the IRB. 39, and controls the selectors 25 and 35 to select the output of the incrementer 24 to be stored in the registers 26 and 36, so that both the RPT 28 and the IPT 38 point to K1, and the IPT 38 controls the IRB 39 output. The FT of K1 and subsequent instructions to the processor core 23 is continuously executed. The RPT 28 points to the K line, and the L of the end line of the K line is sent to the first level buffer 22 to read the L line and store it in the IRB. 39. As such, the processor 23 can execute the instructions without interruption, without the pipeline stall due to the branch.
轨道表中不同线程对应的轨道之间是正交(orthogonal)的,因此可以共存,相互之间不会影响。图4中处理器核产生的间接分支地址46是虚拟地址,与线程号拼合后经选择器44选择,其中索引地址被同时送到41中的TLB及二级标签单元,而其中虚拟标签部分连同线程号被送到TLB中映射为物理标签,该物理标签与二级标签单元中由索引地址读出的各路的标签匹配,匹配所获得的路号(Way number)与虚拟地址中的索引号(Index)拼合,即二级缓存块地址,因此二级缓存地址BN2及由其映射得到的一级缓存地址BN1实际上是由物理地址映射而得而非由虚拟地址映射而得。因此处理器中虚拟地址相同的两个不同线程,其缓存器地址BN实际上是不同的,避免了不同线程不同程序的相同虚拟地址寻址相同缓存地址(address aliasing)的问题。另一方面,不同线程的相同程序的相同虚拟地址,因为会映射到相同的物理地址,其映射所得的缓存器地址也是相同的,避免了相同程序在缓存器中的重复(duplication)问题。基于缓存地址的这种特性,可实现多线程操作。图4中45是寄存器组,其中按线程存放线程号及处理器中的状态寄存器,例如图4中循迹器47中 寄存器26及循迹器48中寄存器36中的内容,以及处理器核23中该线程各寄存器的值。45由线程号49寻址。当处理器要切换线程时,将循迹器47,48中寄存器26及寄存器36中的值,以及处理器核23中寄存器的值都读出,存入45中由此时总线49上的换出线程号指向的表项。然后由总线49向45传送换入线程号,将该线程号指向的表项中的内容换入寄存器26,36及处理器核23中的寄存器,之后在IRB 39 中填入IPT 38指向的指令块及其顺序下个指令块,即可开始对换入线程的操作。轨道表20中及缓存器42及22中各线程的指令是正交的,不会出现一个线程误执行另一个线程的指令的现象。The tracks corresponding to different threads in the track table are orthogonal, so they can coexist and do not affect each other. The indirect branch address 46 generated by the processor core in FIG. 4 is a virtual address, which is selected by the selector 44 after being combined with the thread number, wherein the index address is simultaneously sent to the TLB and the second label unit in 41, and the virtual label portion is The thread number is sent to the TLB and mapped to a physical label, and the physical label matches the label of each path read by the index address in the secondary label unit, and the obtained road number is matched (Way Number) is combined with the index number (Index) in the virtual address, that is, the level 2 cache block address. Therefore, the level 2 cache address BN2 and the level 1 cache address BN1 obtained by the mapping are actually mapped by the physical address instead of Virtual address mapping. Therefore, the two different threads with the same virtual address in the processor have different buffer addresses BN, which avoids the same virtual address of different programs of different threads to address the same cache address (address Aliasing) problem. On the other hand, the same virtual address of the same program of different threads, because it will be mapped to the same physical address, the cache address of the mapping is also the same, avoiding the duplication problem of the same program in the buffer. Based on this feature of the cache address, multi-threaded operations can be implemented. 45 in FIG. 4 is a register bank in which a thread number and a status register in the processor are stored by a thread, such as in the tracker 47 of FIG. The contents of register 26 and register 36 in tracker 48, as well as the values of the registers of the thread in processor core 23. 45 is addressed by thread number 49. When the processor wants to switch threads, the values in the registers 26 and 28 in the trackers 47, 48, and the values in the registers in the processor core 23 are all read out and stored in 45, thereby changing on the bus 49. The entry pointed to by the thread number. The swapped thread number is then transmitted by the bus 49 to 45, and the contents of the entry pointed to by the thread number are swapped into the registers 26, 36 and the registers in the processor core 23, and then at the IRB. 39 filled in IPT The instruction block pointed to by 38 and the next instruction block in the order can start the operation of swapping in the thread. The instructions of each thread in the track table 20 and the buffers 42 and 22 are orthogonal, and there is no phenomenon that one thread mistakenly executes an instruction of another thread.
请参考图5,其为本发明所述处理器系统的另一个实施例。其中二级主动表40,二级缓存的存储器RAM 42,二级扫描器43,轨道表20,一级缓存的相关表37,一级缓存器的存储器RAM 22, 指令读缓冲器39,循迹器47, 循迹器48, 处理器核 23与图4实施例中相同号码的模块功能相同;虽然控制器27,选择器33在图5中为使图易读而省略,但在二级缓存以下的操作与图4实施例相同。图5中增添了三级缓存,由三级主动表50,三级缓存的TLB及标签单元TAG 51及三级缓存器存储器52,三级扫描器53及选择器54组成,代替了图4中二级缓存的TLB及标签单元41,及选择器44。图5实施例中最后级缓存(last level cache),三级缓存器52以路组方式组织,二级缓存器42及一级缓存器22均为全相连方式组织。其中二级缓存器42中每个二级缓存块内含有4个一级缓存块,三级缓存器52中每一路中的三级缓存块又含有4个二级缓存块。Please refer to FIG. 5, which is another embodiment of the processor system of the present invention. Among them, the secondary active table 40, the secondary cache memory RAM 42, secondary scanner 43, track table 20, level 1 cache related table 37, level 1 buffer memory RAM 22, instruction read buffer 39, tracker 47, tracker 48, processor core 23 has the same function as the module of the same number in the embodiment of Fig. 4; although the controller 27 and the selector 33 are omitted in Fig. 5 for making the drawing easy to read, the operation below the secondary buffer is the same as that of the embodiment of Fig. 4. Figure 3 adds a three-level cache, consisting of a three-level active table 50, a three-level cached TLB, and a tag unit TAG. The 51 and the tertiary buffer memory 52, the tertiary scanner 53 and the selector 54 are formed instead of the TLB and label unit 41 of the secondary cache in FIG. 4, and the selector 44. The last level cache in the embodiment of Figure 5 (last Level Cache), the three-level buffer 52 is organized in a way group mode, and the second level buffer 42 and the first level buffer 22 are all connected in a connected manner. Each of the L2 cache blocks 42 has four L1 cache blocks, and the L3 cache block of each of the L3 buffers 52 has 4 L2 cache blocks.
请参考图6,其为图5实施例中处理器系统的地址格式。存储器地址被划分为标签(Tag)61,索引(Index)62,二级子地址(L2 sub_address)63,一级子地址(L1 sub_address) 64,与块内偏移量(BNY)13。 三级缓存器的地址BN3由路号65及索引62,二级子地址63,一级子地址64,与块内偏移量(BNY)13组成;其中路号65与索引62拼合即三级缓存块地址;65,62,63拼合寻址三级缓存块中的一个二级指令块;而除块内偏移量13的各项合称为BN3X,寻址三级缓存块中的一个一级指令块。二级缓存器的地址BN2由二级缓存块号67及一级子地址64,与块内偏移量(BNY)13组成;其中二级缓存块号67寻址一个二级缓存块;除块内偏移量13的各项合称为BN2X,寻址二级缓存块中的一个一级指令块。一级缓存器的地址BN1由一级缓存块号68(BN1X)与块内偏移量(BNY)13组成。上述4种地址格式中的块内偏移量(BNY)13是一样的,进行地址转换时该BNY部分不变化。BN2地址格式中二级块号67指向一个二级缓存块,一级子地址64指向二级缓存块中4个一级指令块中的一个。同理,BN3地址格式中路号65及索引62指向一个三级缓存块,二级子地址63指向其中4个二级指令块中的一个,一级子地址64指向选中的二级指令块中4个一级指令块中的一个。Please refer to FIG. 6, which is an address format of the processor system in the embodiment of FIG. 5. The memory address is divided into a tag (61), an index (Index) 62, and a second-level subaddress (L2). Sub_address) 63, the first level subaddress (L1 sub_address) 64, and the intra-block offset (BNY) 13. The address BN3 of the third-level buffer is composed of a road number 65 and an index 62, a second-level sub-address 63, a first-level sub-address 64, and an intra-block offset (BNY) 13; wherein the road number 65 and the index 62 are combined into three levels. Cache block address; 65, 62, 63 flattened to address a level one instruction block in the level three cache block; and each of the blocks within the block offset 13 is collectively referred to as BN3X, addressing one of the three level cache blocks Level instruction block. The address BN2 of the secondary buffer is composed of a secondary cache block number 67 and a primary subaddress 64, and an intra-block offset (BNY) 13; wherein the secondary cache block number 67 addresses a secondary cache block; Each of the internal offsets 13 is collectively referred to as BN2X, addressing a level one instruction block in the secondary cache block. The address BN1 of the primary buffer is composed of a primary cache block number 68 (BN1X) and an intra-block offset (BNY) 13. The intra-block offset (BNY) 13 in the above four address formats is the same, and the BNY portion does not change when the address conversion is performed. In the BN2 address format, the secondary block number 67 points to a secondary cache block, and the primary subaddress 64 points to one of the four primary instruction blocks in the secondary cache block. Similarly, in the BN3 address format, the road number 65 and the index 62 point to a third-level cache block, and the second-level sub-address 63 points to one of the four second-level instruction blocks, and the first-level sub-address 64 points to the selected two-level instruction block. One of the first level instruction blocks.
请参考图7,其为图5实施例中处理器系统的部分存储表格式。以下结合图5,图6及图7说明。图5里51中标签单元的格式为物理标签86。51中TLB的CAM格式是线程号83以及虚拟标签84,RAM格式是物理标签85。选择器54选择输出的线程号83及虚拟标签84在TLB中被映射为物理标签85;虚拟地址中的索引地址62读出标签单元中的物理标签86与85匹配以获得路号65。路号65以及虚拟地址中的索引地址62拼合形成三级缓存块地址。Please refer to FIG. 7, which is a partial storage table format of the processor system in the embodiment of FIG. 5. This will be described below with reference to Fig. 5, Fig. 6 and Fig. 7. The format of the tag unit in 51 of Figure 5 is the physical tag 86. The CAM format of the TLB in 51 is the thread number 83 and the virtual tag 84, and the RAM format is the physical tag 85. The thread number 83 and the virtual tag 84 of the selector 54 selection output are mapped to the physical tag 85 in the TLB; the index address 62 in the virtual address reads the physical tags 86 and 85 in the tag unit to match the way number 65. The road number 65 and the index address 62 in the virtual address are combined to form a three-level cache block address.
图5中AL3三级主动表50按多路组相联方式组织,每一路中有与L3缓存器52及51中标签单元同样数目的行,同样由索引地址62寻址。每一行中有计数域79及4个BN2X域80,同一行中的复数个80由二级子地址63寻址。每个80域各有其相应有效位81。各路的同一行分享一个三级指针82。AL2二级主动表40按全相联方式组织,有与L2缓存器42同样数目的行,由二级块地址67寻址。每一行中有计数域75及4个BN1X域76,76由一级子地址64寻址。每个76域各有其相应有效位77。各行分享一个二级指针78。CT相关表37按全相联方式组织,有与L1缓存器22同样数目的行,由一级块地址68寻址。每一行中有计数域70, BN2X域71及若干个BN1X域72。每个72域各有其相应有效位77。各行分享一个一级指针74。The AL3 three-level active list 50 of FIG. 5 is organized in a multiplexed associative manner, with the same number of rows in each path as the tag units in the L3 buffers 52 and 51, also addressed by the index address 62. Each row has a count field 79 and four BN2X fields 80, and a plurality of 80s in the same row are addressed by a secondary subaddress 63. Each 80 field has its corresponding valid bit 81. The same line of each way shares a three-level pointer 82. The AL2 secondary active list 40 is organized in a fully associative manner with the same number of rows as the L2 buffer 42 being addressed by the secondary block address 67. There is a count field 75 and four BN1X fields 76 in each row, 76 being addressed by a primary subaddress 64. Each 76 field has its corresponding valid bit 77. Each line shares a secondary pointer 78. The CT correlation table 37 is organized in a fully associative manner with the same number of rows as the L1 buffer 22, addressed by the primary block address 68. There is a count field 70 in each row, BN2X domain 71 and several BN1X domains 72. Each 72 field has its own valid bit 77. Each line shares a first level pointer 74.
当三级缓存器52中一个三级缓存块中的一个二级指令块被存储到二级缓存器42中的一个二级缓存块中,该42中二级缓存块的块号被存储进该三级缓存块在三级主动表50中对应的行中由二级子地址63寻址的表项80,其相应有效位81也被设为‘1’(有效)。该二级缓存块中指令由三级扫描器53译码,其中分支指令中的分支偏移量与该指令的地址相加得到分支目标地址。该二级缓存块中的顺序下个二级缓存块的地址也由本二级缓存块的存储器地址加上一个二级缓存块的大小求得。分支目标地址或顺序下个二级缓存块地址经选择器54选择送到51中的标签单元匹配,如不匹配,则该地址被送到更低层存储器读取指令存入三级缓存存储器52。如此可以保证在二级缓存存储器42中的指令,其分支目标及顺序下个二级缓存块至少已在三级缓存存储器52中或正在存储进52的过程中。When one of the L3 cache blocks is stored in a L2 cache block, the block number of the L2 cache block is stored in the L2 cache block. The entry cache 80 is addressed by the secondary subaddress 63 in the corresponding row of the tertiary active table 50, and its corresponding valid bit 81 is also set to '1' (valid). The instructions in the L2 cache block are decoded by a three-level scanner 53, wherein the branch offset in the branch instruction is added to the address of the instruction to obtain the branch target address. The address of the next L2 cache block in the L2 cache block is also determined by the memory address of the L2 cache block plus the size of a L2 cache block. The branch target address or the next L2 cache block address is selected by the selector 54 to be sent to the tag unit match 51. If not, the address is sent to the lower level memory read command and stored in the L3 cache memory 52. This ensures that the instructions in the L2 cache memory 42, its branch destination and the sequence of the next L2 cache block are at least in the L3 cache memory 52 or are being stored in the process 52.
当二级缓存器42中一个二级缓存块中的一个一级指令块被存储到一级缓存器22中的一个一级缓存块中,该22中一级缓存块的块号被存储进该二级缓存块在二级主动表40中对应的行中由一级子地址64寻址的表项76,其相应有效位77也被设为‘1’(有效)。该一级缓存块中指令由二级扫描器43译码,其中分支指令中的分支偏移量与该指令的地址相加得到分支目标地址。该一级缓存块中的顺序下个一级缓存块的地址也由本一级缓存块的存储器地址加上一个一级缓存块的大小求得。分支目标地址或顺序下个一级缓存块地址经选择器54选择送到标签单元51匹配,如不匹配,则该地址被送到更低层存储器读取指令存入三级缓存存储器52;如匹配,则以匹配所得的三级缓存地址中的65,62,63部分读出三级主动表50中表项80及81。如81为‘0’(无效),则以述匹配所得的三级缓存地址中的65,62,63,64部分对三级缓存存储器52寻址,读出一个二级缓存块存入二级缓存存储器42的一个二级缓存块中,并将这个二级缓存块的块号67及有效值‘1’写入三级主动表50中上述三级缓存地址所寻址的表项80及81中。When a level one instruction block in one of the level two buffer blocks in the level 2 buffer 42 is stored in a level one cache block in the level 1 buffer 22, the block number of the level 1 cache block in the 22 is stored in the The entry cache 76, which is addressed by the primary subaddress 64 in the corresponding row of the secondary cache table 40, has its corresponding valid bit 77 set to '1' (valid). The instructions in the level one cache block are decoded by the secondary scanner 43, wherein the branch offset in the branch instruction is added to the address of the instruction to obtain the branch target address. The address of the next level one cache block in the first level cache block is also obtained by the memory address of the first level cache block plus the size of one level one cache block. The branch target address or the sequence of the next level one cache block address is selected by the selector 54 to be sent to the tag unit 51 for matching. If there is no match, the address is sent to the lower layer memory read instruction and stored in the level three cache memory 52; Then, the entries 80 and 81 in the third-level active list 50 are read out by the 65, 62, 63 portions of the matched third-level cache addresses. If 81 is '0' (invalid), the third-level cache memory 52 is addressed by 65, 62, 63, 64 of the matched third-level cache addresses, and a second-level cache block is read into the second level. In a secondary cache block of the cache memory 42, the block number 67 and the valid value '1' of the second level cache block are written to the entries 80 and 81 addressed by the above-mentioned three-level cache address in the third-level active list 50. in.
如果读出的表项81为‘1’(有效),则以读出的表项80中的BN2X值(67与64)寻址AL2二级主动表40读出表项76及77。如77为‘0’(无效),则以上述BN2X值与BNY拼合成BN2地址(67,64,13)存入轨道表20中正在填写的轨道上与上述分支指令对应的表项中。如76为‘1’(有效),则以表项中的BN1X与BNY拼合成BN1地址(68,13)存入轨道表20中正在填写的轨道上与上述分支指令对应的表项中。此外二级扫描器43译码所得的分支类型11也与上述BN2或BN1地址一起被存入轨道表20的轨道的表项中。对该一级缓存块的顺序下块地址也按上述方式匹配及寻址,如果顺序下个二级指令块尚未在二级缓存器存储器中,则将指令块从三级缓存52存入二级缓存42;并将得到的BN2或BN1地址存入上述轨道最右边的结束表项16中。如此可以保证在一级缓存存储器42中的指令,其分支目标及顺序下个一级缓存块至少已在二级缓存存储器42中或正在存储进42的过程中。If the read entry 81 is '1' (valid), the AL2 secondary active table 40 reads the entries 76 and 77 with the BN2X values (67 and 64) in the read entry 80. If 77 is '0' (invalid), the BN2X value and the BNY combined BN2 address (67, 64, 13) are stored in the entry corresponding to the branch instruction on the track being filled in the track table 20. If 76 is '1' (valid), the BN1 address (68, 13) in the BN1X and BNY in the entry is stored in the entry in the track table 20 corresponding to the branch instruction. Further, the branch type 11 decoded by the secondary scanner 43 is also stored in the entry of the track of the track table 20 together with the above BN2 or BN1 address. The block address of the first level cache block is also matched and addressed in the above manner. If the next level two instruction block is not already in the level 2 buffer memory, the instruction block is stored from the level 3 cache 52 to the second level. The cache 42; and stores the obtained BN2 or BN1 address in the end table entry 16 at the far right of the above track. This ensures that the instructions in the level one cache memory 42 whose branch destination and sequence the next level one cache block are at least already in the level two cache memory 42 or are being stored into the process 42.
本实施例揭示了一种分层次的预取功能,每一存储层次可以保证本存储层次的分支目标至少在,或正在写入低一层次的存储层次中。这就使得处理器核正在执行的指令的分支目标指令在大部分情况下都在一级缓存或二级缓存中,掩盖了对更低存储层次的访问延迟。This embodiment discloses a hierarchical prefetching function, and each storage hierarchy can ensure that the branch target of the storage hierarchy is at least, or is being written into, a lower level storage hierarchy. This allows the branch target instructions of the instruction being executed by the processor core to be in the primary cache or the secondary cache in most cases, masking access latency to lower memory levels.
在上述一级指令块被填入一级缓存存储器22,以及对缓存块的指令扫描建立相应轨道填入轨道表20的同时,也建立相关表37中的相应一行。在相关表37相应行中71域中填入所述一级缓存块的BN2X地址(67及64),以便所述一级缓存块被置换时,可以用所述BN2X地址置换轨道表中以该一级缓存块为目标的表项中该一级缓存块的块号BN1X,以保持轨道表中控制信息流的完整性。同时,也以正被写入轨道表20的轨道中的分支目标中BN1X为地址寻址相关表37中的行,将该行中的计数值70增‘1’,以此记录又有一条分支指令以该行为目标,并将正被写入的轨道本身的一级缓存块号写入其72域中,并将相应73域置为‘1’(有效),以记录分支源的路径(地址)。对于存入轨道结束表项的下一顺序一级缓存块地址,也按类似方式以该地址寻址相关表37中的一行操作。The corresponding row in the correlation table 37 is also established while the first level instruction block is filled in the level 1 cache memory 22, and the instruction track of the cache block is created to fill the track table 20 with the corresponding track. Filling the BN2X address (67 and 64) of the L1 cache block in the corresponding field 71 of the relevant table 37, so that when the L1 cache block is replaced, the BN2X address may be used to replace the track table with the BN2X address. The level 1 cache block is the block number BN1X of the level 1 cache block in the target entry to maintain the integrity of the control information flow in the track table. At the same time, the row in the correlation table 37 is also addressed with the BN1X in the branch target being written into the track of the track table 20, and the count value 70 in the row is incremented by '1', thereby recording another branch. The instruction targets the behavior and writes the first-level cache block number of the track itself being written to its 72-field and sets the corresponding 73-domain to '1' (valid) to record the path of the branch source (address) ). For the next sequential level one cache block address stored in the track end entry, a row in the associated table 37 is also addressed in that manner in that manner.
轨道表20的表项中的分支目标地址格式如上所述可以是BN2或BN1格式。当轨道表表项从总线29输出时,控制器(如图4中27)对其中的分支类型11译码,如其地址格式为BN2则控制器以总线29上的BN2X地址(67及64)寻址二级主动表40读出表项76及77。如77为‘0’(无效),则以该BN2X地址寻址二级缓存存储器42读出一个一级指令块存入一级缓存器存储器22中的一个一级缓存块,并将该一级缓存块号及有效值‘1’存入二级主动表40中上述BN2X地址指向的表项76及77。如77为‘1’(有效),则以76中的BN1X 68写入轨道表中表项12但不改变表项13中的BNY,因此以BN1地址替换了原来的BN2地址。该BN1X地址并可被旁路到总线29上供循迹器47使用。循迹器47寻址轨道表20,一级缓存器存储器22;循迹器48寻址IRB 39为处理器核23提供不间断指令供其执行的过程与图4实施例相同,在此不再赘述。The branch target address format in the entry of the track table 20 may be in the BN2 or BN1 format as described above. When the track table entry is output from the bus 29, the controller (such as 27 in FIG. 4) decodes the branch type 11 therein. If the address format is BN2, the controller searches for the BN2X address (67 and 64) on the bus 29. The address secondary active table 40 reads entries 76 and 77. If 77 is '0' (invalid), the L2 cache memory 42 is addressed by the BN2X address, and a level one instruction block is read into a level one cache block in the level one buffer memory 22, and the level is one level. The cache block number and the valid value '1' are stored in the entries 26 and 77 pointed to by the above BN2X address in the secondary active list 40. If 77 is '1' (valid), then BN1X in 76 68 writes to entry 12 in the track table but does not change BNY in entry 13, thus replacing the original BN2 address with the BN1 address. The BN1X address can be bypassed onto bus 29 for use by tracker 47. Tracker 47 addresses track table 20, level 1 buffer memory 22; tracker 48 addresses IRB The process of providing the processor core 23 with uninterrupted instructions for execution is the same as that of the embodiment of FIG. 4, and details are not described herein again.
本实施例的缓存置换逻辑(Cache Replacement Logic)以最少相关性(Least Correlation, LC)与最早被置换(Earlierst Replacement, ER)相结合的方式(以下简称LCER)确定可被置换的缓存块。相关表37中的计数值70即被用于检测相关性(也称关联度)。计数值越小,表示以该一级缓存块为目标的缓存块数量越少,便于置换。相关表37中各行共用的指针74指向可被置换的行(可置换行中的计数值70须低于一个预设的值)。当由该指针74指向的一级缓存块被置换时,轨道表20中由74指向的相应轨道也由二级扫描器43扫描所置换进的一级缓存块提取的分支类型及分支目标等置换;也以相关表37中74所指向的行中各73域为‘1’(有效)的相应72域中BN1X地址寻址轨道表20中轨道,将该轨道中原来以被置换的一级缓存块号记载的分支目标地址置换成相关表37中74所指的行中71域中的BN2X,使各原来以被置换的一级缓存块中指令为分支目标的指令现以二级缓存缓存器22中的相同指令为分支目标,使得置换该一级缓存块不影响控制信息流。同时也以该BN2X寻址二级主动表40,将40的表项中的计数值75按上述以BN2X值在轨道表20中置换BN1X的次数增加,以记录该二级缓存块增加的相关性;并将该40的表项中与被置换的一级缓存块相应(由BN2X地址中64域指出)的有效位77置为‘0’(无效)。此后指针74沿单一方向移动,停留在下一个满足最少相关性的行上;当指针越出相关表37中所有的行的边界时则移动到另一边界(如超出地址最大的行则从地址最小的行起开始检测最少相关性检测)。指针74的单向移动保证了最早被置换过的一级缓存块优先被置换,即上述ER。检测各行的计数值75与指针74的单向移动实现LCER一级缓存置换策略。这种置换方式每次置换单数个一级缓存块。Cache Replacement Logic of this embodiment with least correlation (Least Correlation, LC) and Earliest Replacement (Earlierst Replacement, The ER) combined method (hereinafter referred to as LCER) determines the cache block that can be replaced. The count value 70 in the correlation table 37 is used to detect the correlation (also called the degree of association). The smaller the count value, the smaller the number of cache blocks targeted to the level 1 cache block, which is convenient for replacement. The pointer 74 shared by each row in the associated table 37 points to a row that can be replaced (the count value 70 in the replaceable row must be lower than a preset value). When the primary cache block pointed to by the pointer 74 is replaced, the corresponding track pointed to by 74 in the track table 20 is also replaced by the branch type and branch target extracted by the secondary cache block replaced by the secondary scanner 43. The track in the track table 20 is also addressed by the BN1X address in the corresponding 72 field in which each of the 73 fields in the associated table 37 is '1' (valid), and the first level cache in the track is replaced. The branch target address recorded in the block number is replaced with BN2X in the 71 field in the row indicated by 74 in the correlation table 37, so that the instruction originally used as the branch target in the replaced first-level cache block is now in the second-level buffer buffer. The same instruction in 22 is a branch target such that replacing the level one cache block does not affect the control information flow. At the same time, the BN2X is used to address the secondary active table 40, and the count value 75 in the entry of 40 is increased by the number of times BN1X is replaced in the track table 20 by the BN2X value to record the correlation of the increase of the secondary cache block. And set the valid bit 77 of the 40 entry corresponding to the replaced primary cache block (indicated by the 64 field in the BN2X address) to '0' (invalid). The pointer 74 then moves in a single direction, staying on the next line that satisfies the least correlation; when the pointer goes beyond the boundary of all the lines in the associated table 37, it moves to the other boundary (if the line exceeding the maximum address is the smallest from the address) The line starts to detect the least correlation detection). The one-way movement of the pointer 74 ensures that the first-order cache block that was replaced the first time is preferentially replaced, that is, the above ER. Detecting the one-way movement of the count value 75 of each row and the pointer 74 implements the LCER level one cache replacement strategy. This permutation method replaces a single number of L1 cache blocks at a time.
此外还可以沿程序顺序用顺序或倒序的方式置换。比如当一个一级缓存块被置换时,将其轨道中结束表项中一级缓存块号BN1X指向的缓存块也置换,是为顺序置换。或当一个一级缓存块被置换时,将其相关表对应行中与顺序前一缓存块对应的72域中一级缓存块号BN1X也置换,是为倒序置换。甚至可以从一个一级缓存块开始既按顺序也按倒序置换。可以按顺序或倒序持续置换直到遇到一个一级缓存块,其相应的相关表37中计数值70超过预设值为止。这种置换方式每次置换复数个一级缓存块。可以视需要选用单数置换方法或复数置换方法。也可以将不同方法混合使用。如正常时使用单数置换方法,当低层缓存缺乏可被置换的缓存块时使用复数置换方法。In addition, it can be replaced in sequential or reverse order in program order. For example, when a level 1 cache block is replaced, the cache block pointed to by the level 1 cache block number BN1X in the end entry in the track is also replaced, which is a sequential replacement. Or when a primary cache block is replaced, the primary cache block number BN1X in the 72 domain corresponding to the previous cache block in the corresponding row of the related table is also replaced, which is a reverse replacement. It can even be replaced in both sequential and reverse order starting from a level 1 cache block. The permutation may be continued in order or in reverse order until a level one cache block is encountered, with the count value 70 in the corresponding correlation table 37 exceeding a preset value. This permutation method replaces a plurality of L1 cache blocks at a time. A singular or multiple replacement method can be used as needed. It is also possible to mix different methods. If the singular permutation method is used normally, the complex permutation method is used when the lower layer cache lacks a cache block that can be replaced.
二级缓存的置换也基于LCER策略。除上述在一级缓存块被置换时将二级主动表40中相应的77域置为‘0’及增加计数值75外;在缓存块从二级缓存存储器42存入一级缓存存储器22时,二级主动表40中的相应表项中的相应有效位77被置为‘1’,一级缓存块号BN1X被写入相应的76域。 每次当由分支目标地址等匹配所得的BN2X被存入轨道表20中,二级主动表40中该BN2X对应的计数值75被增‘1’;每次当轨道表表项中的BN2X被BN1X置换时,二级主动表40中该BN2X对应的计数值75被减‘1’。如此,计数值75记录了一个二级缓存块作为分支目标的次数;而表项中各有效位77则各自记录了该二级缓存块的一部分是否已存入一级缓存器;而表项中各76域则记录各相应一级缓存块的块地址68。二级缓存的置换使共用的二级指针78单向移动,停留在下一个可置换的二级缓存块上。可置换的二级缓存块可定义为其相应二级主动表40表项中计数值75及所有77域为‘0’。即当一个二级缓存块与一级缓存器22中的所有指令都不相关时可被置换,单向移动的指针78则保证了ER。The replacement of the L2 cache is also based on the LCER strategy. In addition to the above, when the primary cache block is replaced, the corresponding 77 field in the secondary active table 40 is set to '0' and the count value is increased 75; when the cache block is stored in the primary cache memory 22 from the secondary cache memory 42 The corresponding valid bit 77 in the corresponding entry in the secondary active table 40 is set to '1', and the primary cache block number BN1X is written to the corresponding 76 field. Each time when the BN2X obtained by the matching of the branch target address or the like is stored in the track table 20, the count value 75 corresponding to the BN2X in the secondary active table 40 is incremented by '1'; each time the BN2X in the track table entry is When the BN1X is replaced, the count value 75 corresponding to the BN2X in the secondary active table 40 is decremented by '1'. Thus, the count value 75 records the number of times a secondary cache block is used as a branch target; and each valid bit 77 in the entry records whether a portion of the secondary cache block has been stored in the primary buffer; Each 76 field records the block address 68 of each corresponding level one cache block. The replacement of the secondary cache causes the shared secondary pointer 78 to move in one direction and stay on the next replaceable secondary cache block. The replaceable secondary cache block can be defined as the count value of 75 in its corresponding secondary active table 40 entry and all 77 fields are '0'. That is, when a secondary cache block is unrelated to all instructions in the primary buffer 22, the one-way moved pointer 78 guarantees the ER.
三级缓存的置换同样基于LCER策略。在缓存块从三级缓存存储器52存入二级缓存存储器42时,三级主动表50中的相应表项中的相应有效位81被置为‘1’,二级缓存块号BN2X被写入相应的80域。本实施例中不使用三级主动表50的表项中的计数值79。三级缓存为路组相联组织形式,对应每个组(同一索引地址)有复数个路,同组各路共用一个指针82。同样可由指针82寻找下一个可被置换的路,在此可置换的路可以是该路中的所有81域均为‘0’。亦即该三级缓存块与二级缓存器42中的指令都不相关,因此可被置换。上述用指针保证刚被置换的缓存块不被再次置换的方法也可以用别的方法代替。The replacement of the L3 cache is also based on the LCER strategy. When the cache block is stored in the L2 cache memory 42 from the L3 cache memory 52, the corresponding valid bit 81 in the corresponding entry in the tertiary active table 50 is set to '1', and the L2 cache block number BN2X is written. The corresponding 80 domain. The count value 79 in the entry of the tertiary active table 50 is not used in this embodiment. The third-level cache is an association form of road groups, and each group (the same index address) has a plurality of paths, and each group shares a pointer 82. The next available path can also be looked up by pointer 82, where the replaceable path can be all 81 fields in the path being '0'. That is, the level three cache block is not related to the instructions in the level two buffer 42, and thus can be replaced. The above method of using the pointer to ensure that the cache block that has just been replaced is not replaced again may be replaced by another method.
本实施例中三级缓存器为组相联组织方式。如果遇到一组中各路都不可置换(三级主动表50的每路中至少有一个81域为‘1’),则可以选择其中81域为‘1’最少的一路的一级缓存块进行复数置换。如某路只有一个81域为‘1’,即该三级缓存块中可以存放的4个二级指令块中只有一个在二级缓存存储器42中,因此可将与该81域对应的80域中的BN2X输出寻址二级主动表40,从中读出按地址顺序第一个有效(其77域为‘1’)的76域中的BN1X号,并计算出从这个一级缓存块到二级缓存块中最后一个有效的一级缓存块一共是N个一级缓存块。即将该BN1X号及一级缓存块数目N送到一级缓存置换逻辑,从该BN1X指向的一级缓存块开始置换N个一级缓存块,并将以这些缓存块为目标的缓存块一并置换,则上述二级缓存块可被置换。之后三级主动表50中上述路组中的所有81域均为‘0’,相应的三级缓存块即可被置换。如果三级缓存块中包含的一级缓存块不连续,则按上述方法设置复数个起点和复数个相应的N值送到一级缓存置换逻辑依次置换。In the embodiment, the third-level buffer is a group association organization manner. If it is not replaceable in each group (at least one of the 81 fields in the three-level active table 50 is '1'), you can select the first-level cache block in which the 81 field is the least '1'. Perform a complex replacement. For example, if only one 81 domain of a certain path is '1', that is, only one of the four secondary instruction blocks that can be stored in the third-level cache block is in the secondary cache memory 42, so the 80 domain corresponding to the 81 domain can be used. The BN2X output in the address addresses the secondary active table 40, from which the BN1X number in the 76 field that is first valid in the address order (its 77 field is '1') is read, and the cache block from this level is calculated to The last valid L1 cache block in the level cache block is a total of N L1 cache blocks. That is, the BN1X number and the first-level cache block number N are sent to the first-level cache replacement logic, and the first-level cache blocks are replaced from the first-level cache block pointed to by the BN1X, and the cache blocks targeted to the cache blocks are combined. Substitution, the above secondary cache block can be replaced. Then all 81 fields in the above-mentioned way group in the three-level active list 50 are '0', and the corresponding level three cache blocks can be replaced. If the level 1 cache block included in the level 3 cache block is not continuous, then a plurality of starting points and a plurality of corresponding N values are set to be sent to the level 1 cache replacement logic in sequence as described above.
图7实施例中各层次中的计数值如三级主动表50中的79,二级主动表40中的75,及(一级)相关表37中的70用于记录缓存块在同一存储层次中的关联度。有更高存储层次的各层次中的各有效位用于记录缓存块在更高存储层次中的关联度,如三级主动表50中的81记录与二级缓存块的关联度,二级主动表40中的77记录与一级缓存块的关联度。相关表37中的73则记录了跳转到一级缓存块的分支源地址。因此可以用37中本缓存块的BN2X地址71代替轨道表20中所述分支源地址指向的各表项中的本缓存块BN1X地址的方法以保持控制流信息的完整性。如此,使得本缓存块可被置换。另外的置换方式可以选择关联度为‘0’的缓存块置换。实质上,本发明所述缓存系统基于控制流信息操作,因此缓存置换的基本原则是无损于控制流信息的完整性。The count values in each level in the embodiment of FIG. 7 are 79 in the third-level active list 50, 75 in the second-level active list 40, and 70 in the (primary) correlation table 37 are used to record the cache block at the same storage level. The degree of relevance in . Each valid bit in each level with a higher storage level is used to record the relevance of the cache block in a higher storage hierarchy, such as the association between the 81 record and the second-level cache block in the three-level active table 50, and the second-level initiative. The degree of association of 77 records in Table 40 with the level one cache block. The 73 in the related table 37 records the branch source address that jumps to the level 1 cache block. Therefore, the method of replacing the current cache block BN1X address in each entry pointed to by the branch source address in the track table 20 can be replaced by the BN2X address 71 of the cache block in 37 to maintain the integrity of the control flow information. In this way, the cache block can be replaced. Another replacement method can select a cache block replacement with a degree of association of '0'. In essence, the cache system of the present invention operates based on control flow information, so the basic principle of cache replacement is that the integrity of the control flow information is not compromised.
请参考图8,其为本发明所述处理器系统的另一个实施例。图8是图5实施例的一个改进,其中三级主动表50,三级缓存的TLB及标签单元51,三级缓存器存储器52,选择器54,二级主动表40,二级缓存的存储器 42,轨道表20,一级缓存的相关表37,一级缓存器的存储器22, 指令读缓冲器39,循迹器47, 循迹器48, 处理器核 23与图4实施例中相同号码的模块功能相同。其中二级扫描器43(可以产生分支类型)被接到从三级缓存器52到二级缓存器42的总线上,实施例中只有这一个扫描器。另外增加了二级轨道表88。图8实施例中各缓存器的组织方式与图5实施例中相同。Please refer to FIG. 8, which is another embodiment of the processor system of the present invention. Figure 8 is a modification of the embodiment of Figure 5, wherein the three-level active list 50, the three-level cached TLB and tag unit 51, the third-level buffer memory 52, the selector 54, the secondary active table 40, the second-level cache memory 42, track table 20, level 1 cache related table 37, level 1 buffer memory 22, instruction read buffer 39, tracker 47, tracker 48, processor core 23 has the same function as the module of the same number in the embodiment of FIG. The secondary scanner 43 (which can generate the branch type) is connected to the bus from the tertiary buffer 52 to the secondary buffer 42, which is the only one in the embodiment. A secondary track table 88 is additionally added. The organization of each buffer in the embodiment of Fig. 8 is the same as that in the embodiment of Fig. 5.
二级轨道表88中每条轨道对应二级缓存器42中一个二级缓存块。每条二级轨道中含有4条一级轨道,每条一级轨道对应二级缓存块中的一个一级指令块。二级轨道表88中的一级轨道其格式也采取图1中的SBNY 15,类型11,BNX 12及BNY 13的格式,地址格式可以是BN3或BN2格式。扫描器43对从三级缓存器存储器52送到二级缓存存储器42存储的二级缓存块进行扫描审查,对其中的分支指令计算其分支目标地址。分支目标地址经选择器54选择送到TLB/标签单元51匹配成BN3地址,BN3地址寻址三级主动表50检测表项是否有效(相应缓存块是否已存入二级缓存存储器42);若有效,将表项中的BN2X地址与BN3地址中的BNY拼合成BN2地址连同扫描器产生的SBNY 15与类型11存入二级主动表88中与该分支指令对应的表项;若无效,则直接以BN3地址连同SBNY 15与类型11存入88中表项。Each track in the secondary track table 88 corresponds to a secondary cache block in the secondary buffer 42. Each secondary track contains four primary tracks, each corresponding to one primary instruction block in the secondary cache block. The format of the first-order orbit in the secondary track table 88 also takes the SBNY in Figure 1. 15, type 11, BNX 12 and BNY 13 format, the address format can be BN3 or BN2 format. The scanner 43 scans the L2 cache block stored from the L3 buffer memory 52 to the L2 cache memory 42 and calculates the branch target address of the branch instruction therein. The branch target address is selected by the selector 54 to be sent to the TLB/tag unit 51 to be matched to the BN3 address, and the BN3 address addressing three-level active table 50 detects whether the entry is valid (whether the corresponding cache block has been stored in the secondary cache memory 42); Valid, the BN2X address in the entry is combined with the BNY in the BN3 address to form the BN2 address along with the SBNY generated by the scanner. 15 and type 11 are stored in the entry of the secondary active table 88 corresponding to the branch instruction; if invalid, the entry is directly stored in the 88 entry with the BN3 address together with SBNY 15 and type 11.
当二级缓存器存储器42的二级缓存块中的一个一级指令块被存入一级缓存器存储器22中的一级缓存块时,二级轨道表88从总线89输出对应的一级轨道存入轨道表20。如果该轨道上的表项中地址是BN3地址格式,则以该地址寻址三级主动表50,如表项有效位81无效,即按前述方式将二级缓存块从三级缓存器52中存入二级缓存器42的一个二级缓存块中,并将该二级缓存块号与BN3地址中二级子地址64拼合形成BN2X地址存入三级主动表50中80域;如表项有效,即将表项中的BN2X存入二级轨道表88中替代原来的BN3X地址。该BN2X也被旁路到 总线89上以供存入轨道表20。本实施例使用三级主动表50中的计数值79。与图6实施例中对二级主动表中计数值75的使用方法相似,当BN3地址被写入二级轨道表88时,其相应的三级主动表50中的计数值79增加,当从二级轨道表88输出的BN3地址在三级主动表50中映射为BN2地址时,其相应计数值79减少。三级缓存置换时不但要检查各有效位81的值,也要检查计数值79。When a primary instruction block in the secondary cache block of the secondary buffer memory 42 is stored in the primary cache block in the primary buffer memory 22, the secondary track table 88 outputs the corresponding primary track from the bus 89. Deposited in track table 20. If the address in the entry on the track is in the BN3 address format, the three-level active table 50 is addressed by the address, and if the entry valid bit 81 is invalid, the secondary cache block is removed from the tertiary buffer 52 in the foregoing manner. Stored in a L2 cache block of the L2 buffer, and the L2 cache block number is combined with the second-level sub-address 64 of the BN3 address to form a BN2X address and stored in the 80-domain of the third-level active table 50; Valid, that is, BN2X in the entry is stored in the secondary track table 88 instead of the original BN3X address. The BN2X is also bypassed to Bus 89 is provided for storage in track table 20. This embodiment uses the count value 79 in the three-stage active meter 50. Similar to the method of using the count value 75 in the secondary active table in the embodiment of FIG. 6, when the BN3 address is written to the secondary track table 88, the count value 79 in the corresponding three-level active list 50 is increased. When the BN3 address output by the secondary track table 88 is mapped to the BN2 address in the tertiary active table 50, its corresponding count value 79 is decreased. In the third-level cache replacement, not only the value of each valid bit 81 but also the count value of 79 is checked.
总线89上的 BN2地址也被用于寻址二级主动表40,如40中表项有效位77无效,则以BN2地址存入轨道表20中的表项,如40中表项有效位77有效,则以40表项中的BN1X地址拼合BN2地址中的BNY地址存入轨道表20中的表项。当BN2地址从轨道表20经总线29输出时,被用以寻址二级主动表40,如表项中有效位77无效,则以该BN2地址访问二级缓存存储器42读出一个一级缓存块存入一级缓存存储器22中的一个一级缓存块号,将该一级缓存块号BN1X存入二级主动表40的76域,并将该BN1X存入轨道表20,也可将该BN1X旁路到总线29上供循迹器使用。本实施例中二级主动表88中轨道表项的地址可以是BN3或BN2格式,主动表20中轨道表项的地址可以是BN2或BN1格式。另外一种策略,则是填入轨道表20中的都是BN1地址,如果总线89上的地址是BN2格式,且寻址二级主动表40表项有效位77无效,则以该BN2地址访问二级缓存存储器42读出一个一级缓存块存入一级缓存存储器22中的一个一级缓存块号,并将该一级缓存块号BN1X存入二级主动表40的76域,将其相应77域设为有效;并将该BN1X存入轨道表20,也可将该BN1X旁路到总线29上供循迹器使用;如40中77位有效,则以表项76域中的BN1X直接填写轨道表20并旁路到总线29上供使用。On bus 89 The BN2 address is also used to address the secondary active table 40. If the valid entry 77 of the entry in the 40 is invalid, the BN2 address is stored in the entry in the track table 20. If the valid entry 77 of the entry in the 40 is valid, The BN1X address in the 40 entry is combined with the BNY address in the BN2 address and stored in the entry in the track table 20. When the BN2 address is output from the track table 20 via the bus 29, it is used to address the secondary active table 40. If the valid bit 77 in the entry is invalid, the secondary cache memory 42 is accessed by the BN2 address to read a level 1 cache. The block is stored in a first-level cache block number in the first-level cache memory 22, and the first-level cache block number BN1X is stored in the 76 field of the second-level active table 40, and the BN1X is stored in the track table 20, and the block BN1X is bypassed onto bus 29 for use by the tracker. In this embodiment, the address of the track entry in the secondary active table 88 may be in the BN3 or BN2 format, and the address of the track entry in the active list 20 may be in the BN2 or BN1 format. Another strategy is to fill in the track table 20 with the BN1 address. If the address on the bus 89 is in the BN2 format, and the address active address of the secondary active table 40 entry is invalid, the BN2 address is accessed. The L2 cache memory 42 reads a level 1 cache block number stored in the L1 cache memory 22, and stores the L1 cache block number BN1X in the 76 field of the L2 active table 40, and The corresponding 77 field is set to be valid; and the BN1X is stored in the track table 20, and the BN1X can also be bypassed to the bus 29 for use by the tracker; if 77 bits in 40 are valid, the BN1X in the table 76 field is used. The track table 20 is directly filled in and bypassed to the bus 29 for use.
请参考图9,其为本发明所述处理器系统的间接分支目标地址产生器的一个实施例。间接分支目标地址一般由处理器核内寄存器堆中存储的一个基地址与间接分支指令中含有的分支偏移量相加获得。图9中93为加法器,39为IRB,95为复数个带比较器的寄存器,96为复数个寄存器,两者间是CAM-RAM的关系,一一对应。98为选择器。另外15,11,12,13为轨道表20经总线29输出的表项内容。系统会为每条间接分支指令安排一组寄存器95和96。加法器93以及IRB 39则是所有间接分支指令共用。间接分支指令的轨道表20的表项中15域SBNY,11域类型与图1中定义相同;但12域则改为用于存放寄存器堆(RF)地址,13域用于存储寄存器95,96的组号。当扫描器43译码所扫描的一条指令为间接分支指令时,按前述方式产生轨道表表项的15域及11域,将指令中的基地址寄存器堆号置于12域,而将13域置为‘无效’。当一个对应间接分支指令的表项第一次从轨道表20经总线输出时,其为‘无效’的13域使系统为其分配一组寄存器95,96(一组中有复数行CAM-RAM),该组寄存器的组号被存入轨道表表项13。轨道表表项15域寻址IRB 39从中读出该间接分支指令中的分支偏移量送到加法器93的一个输入端;以轨道表表项12寻址寄存器堆读取其中的基地址;或如图9所示,检测寄存器堆的写地址,当该写地址与轨道表表项12域中地址相同时,将从处理器核中执行单元传输执行结果写回寄存器堆的总线94连接到加法器93的另一个输入端。加法器93的输出46即为分支目标地址,该地址被送到TLB/标签单元51匹配。同时总线94上的基地址也被存入轨道表表项13域所指向的寄存器组中的95寄存器中可用的一行;分支目标指令匹配所得的BN1地址经总线89存入13域指向的寄存器组中96寄存器中的同一行。Please refer to FIG. 9, which is an embodiment of an indirect branch target address generator of the processor system of the present invention. The indirect branch target address is generally obtained by adding a base address stored in the register file in the processor core to the branch offset contained in the indirect branch instruction. In Fig. 9, 93 is an adder, 39 is an IRB, 95 is a plurality of registers with comparators, 96 is a plurality of registers, and the relationship between them is CAM-RAM, one-to-one correspondence. 98 is a selector. In addition, 15, 11, 12, and 13 are contents of the entry of the track table 20 output via the bus 29. A set of registers 95 and 96 is arranged for each indirect branch instruction. Adder 93 and IRB 39 is the sharing of all indirect branch instructions. The field of the track table 20 of the indirect branch instruction has 15 fields SBNY, and the 11 field type is the same as that defined in FIG. 1; however, the 12 field is instead used to store the register file (RF) address, and the 13 field is used to store the register 95, 96. Group number. When the scanner 43 decodes the scanned one instruction into an indirect branch instruction, the 15 field and the 11 field of the track table entry are generated as described above, and the base address register file number in the instruction is placed in the 12 domain, and the 13 domain is Set to 'invalid'. When an entry corresponding to an indirect branch instruction is output from the track table 20 via the bus for the first time, the 13 fields that are 'invalid' cause the system to allocate a set of registers 95, 96 (multiple rows of CAM-RAM in a group) The group number of the set of registers is stored in the track table entry 13. Track table entry 15 domain addressing IRB 39 from which the branch offset in the indirect branch instruction is sent to an input of the adder 93; the register address of the register file is read by the track table entry 12; or as shown in FIG. The write address of the heap, when the write address is the same as the address in the track table entry 12 field, connects the bus 94, which writes the execution result of the execution unit transfer from the processor core back to the register file, to the other input of the adder 93. The output 46 of the adder 93 is the branch target address, which is sent to the TLB/tag unit 51 for matching. At the same time, the base address on the bus 94 is also stored in a row available in the 95 register in the register group pointed to by the track table entry 13 field; the branch target instruction matches the resulting BN1 address stored in the register field pointed to by the 13 field via the bus 89. The same line in the 96 registers.
当13域为‘无效’或当其‘有效’但总线94上的基地址与寄存器95中的内容不匹配时,选择器98选择总线89上的BN1地址经总线99输出。当总线29上表项的类型为间接分支指令时,总线99的地址供循迹器47使用;总线29上表项类型为其他类型时选择总线29上的地址供循迹器47使用。下一次执行同一条间接分支指令时,总线29上轨道表表项中13域中的寄存器组号选择相应的寄存器组95及96,12域中的寄存器堆地址选择写回该寄存器堆表项的总线94上数据与寄存器95中的内容比较,如匹配,则相应寄存器96行中的BN1地址经总线97输出,由选择器98选择供循迹器使用;如不匹配,则如前所述由加法器93计算间接分支目标地址匹配成BN1地址放上总线89,选择器98选择总线89上地址输出。不匹配也导致总线94上的基地址及总线89上的BN1地址被存入寄存器95,96中未被使用的一行中。置换逻辑负责为总线29的间接分支类型中域13为‘无效’的表项分配寄存器组95,96,方式可以是LRU等。如此本实施例可以将间接分支指令的基地址映射为一级缓存器地址BN1,省却了地址计算及地址映射的步骤。When the 13 field is 'invalid' or when it is 'active' but the base address on bus 94 does not match the contents of register 95, selector 98 selects the BN1 address on bus 89 to be output via bus 99. When the type of the entry on the bus 29 is an indirect branch instruction, the address of the bus 99 is used by the tracker 47; the address on the bus 29 is selected for use by the tracker 47 when the entry type of the entry is other. The next time the same indirect branch instruction is executed, the register group number in the 13 field in the track table entry on bus 29 selects the corresponding register bank 95 and 96. The register file address in the 12 field selects the bus written back to the register file table entry. The data on 94 is compared with the contents of register 95. If matched, the BN1 address in the 96 rows of the corresponding register is output via bus 97, and is selected by the selector 98 for use by the tracker; if not, the addition is as described above. The device 93 calculates that the indirect branch target address matches the BN1 address on the bus 89, and the selector 98 selects the address output on the bus 89. A mismatch also causes the base address on bus 94 and the BN1 address on bus 89 to be stored in a row in registers 95, 96 that are not used. The permutation logic is responsible for allocating register sets 95, 96 to the entries of the indirect branch type of bus 29 that are "invalid" in the field 13, which may be LRU or the like. Thus, in this embodiment, the base address of the indirect branch instruction can be mapped to the level 1 buffer address BN1, and the steps of address calculation and address mapping are omitted.
请参考图10,其为本发明所述处理器系统中处理器核的流水线结构示意图。100为传统计算机或处理器核的典型流水线结构,分为I,D,E,M,W段。其中I段为取指令段,D为指令译码段,E为指令执行段,M为数据访问段,W为寄存器写回段。101为本发明中处理器核的流水线段,与100相比少了I段。传统处理器核产生指令地址,送到存储器或缓存器以读取(拉取)指令。本发明的缓存系统自动向处理器核推送指令,只需要处理器核提供一个分支判断31以决定程序走向,一个停流水线信号32以同步缓存系统与处理器核。因此使用本发明的缓存系统的处理器核的流水线结构与传统流水线结构不同,不需要有取指令的流水线段。此外,使用本发明的缓存系统的处理器核也不需要保持指令地址(Program Counter, PC)。如图9所述,产生间接分支目标地址基于寄存器堆内的基地址,不需要用PC地址。其他指令也由缓存系统的BN地址访问,不用PC。因此使用本发明的缓存系统的处理器核中不需保持PC。Please refer to FIG. 10 , which is a schematic diagram of a pipeline structure of a processor core in a processor system according to the present invention. 100 is a typical pipeline structure of a traditional computer or processor core, divided into I, D, E, M, W segments. The I segment is the instruction segment, D is the instruction decoding segment, E is the instruction execution segment, M is the data access segment, and W is the register write segment. 101 is the pipeline segment of the processor core in the present invention, and has less than one segment compared with 100. A conventional processor core generates an instruction address that is sent to a memory or buffer to read (pull) the instruction. The cache system of the present invention automatically pushes instructions to the processor core, requiring only the processor core to provide a branch decision 31 to determine the program direction, and a stall pipeline signal 32 to synchronize the cache system with the processor core. Therefore, the pipeline structure of the processor core using the cache system of the present invention is different from the conventional pipeline structure, and there is no need for a pipeline segment for instruction fetching. In addition, the processor core using the cache system of the present invention does not need to maintain the instruction address (Program) Counter, PC). As shown in Figure 9, the indirect branch target address is generated based on the base address in the register file, and no PC address is required. Other instructions are also accessed by the BN address of the cache system, without the PC. Therefore, it is not necessary to maintain the PC in the processor core using the cache system of the present invention.
请参考图11,其为本发明所述处理器系统的另一个实施例。图11是图8实施例的一个改进,其中三级主动表50,三级缓存的TLB及标签单元51,三级缓存器存储器52,选择器54,扫描器43,二级轨道表88,二级主动表40,二级缓存存储器 42,轨道表20,一级缓存的相关表37,一级缓存器的存储器22, 指令读缓冲器39,循迹器47, 循迹器48, 处理器核 23与图8实施例中相同号码的模块功能相同。增添了二级相关表103,以及102。102即图9实施例中所示的间接分支目标地址产生器。图11实施例中缓存器组织形式与图5及图8实施例相同。Please refer to FIG. 11, which is another embodiment of the processor system of the present invention. Figure 11 is a modification of the embodiment of Figure 8, wherein the three-level active list 50, the three-level cached TLB and tag unit 51, the third-level buffer memory 52, the selector 54, the scanner 43, the secondary track table 88, two Active table 40, secondary cache memory 42, track table 20, level 1 cache related table 37, level 1 buffer memory 22, instruction read buffer 39, tracker 47, tracker 48, processor core 23 has the same function as the module of the same number in the embodiment of FIG. A secondary correlation table 103 is added, and 102. 102 is the indirect branch target address generator shown in the embodiment of FIG. The organization of the buffer in the embodiment of Fig. 11 is the same as that of the embodiment of Figs. 5 and 8.
二级相关表102与相关表37的结构类似。其中对应每个二级缓存块有计数值,与该二级缓存块相应的三级缓存地址,以本二级缓存块为分支目标的分支源指令的源地址及其有效信号(可参考图7中CT格式);如同在相关表中一样,计数值是分支源指令的数目。当扫描器43产生与二级缓存块相应的轨道填入二级轨道表88时,以填入的轨道表项中的BN2格式分支目标地址寻址二级相关表103中的行(以下称目标行),将正在填入二级轨道表88的轨道(下称源轨道)的二级缓存器地址填入目标行中的源地址域并将其有效信号设为‘有效’,并将目标行计数增‘1’。也在与源轨道对应的二级相关表103中的行中填入源轨道的对应三级缓存器地址。另外当填入二级轨道表88的表项中地址为BN3格式时,以所述BN3地址寻址三级主动表50表项,使其中的计数值79增‘1’。The secondary correlation table 102 is similar in structure to the related table 37. Each of the L2 cache blocks has a count value, a L3 cache address corresponding to the L2 cache block, a source address of the branch source instruction with the L2 cache block as a branch target, and a valid signal thereof (refer to FIG. 7) Medium CT format); as in the related table, the count value is the number of branch source instructions. When the scanner 43 generates a track corresponding to the L2 cache block and fills the secondary track table 88, the BN2 format branch target address in the filled track entry addresses the row in the secondary correlation table 103 (hereinafter referred to as the target) Line), fill in the secondary buffer address of the track (referred to as the source track) that is filling in the secondary track table 88 into the source address field in the target row and set its valid signal to 'valid', and the target line The count is increased by '1'. The corresponding level three buffer address of the source track is also filled in the row in the secondary correlation table 103 corresponding to the source track. In addition, when the address in the entry of the secondary track table 88 is in the BN3 format, the three-level active list 50 entry is addressed by the BN3 address, and the count value 79 therein is increased by '1'.
轨道表20的输出29上表项地址格式为BN2格式时,会被用以寻址二级主动表40,若相应表项为无效,则需以该BN2(以下称源BN2地址)地址从二级缓存器存储器42中读取指令块填入一级缓存器22中由置换逻辑指定的一级缓存块。此时由该源BN2地址寻址二级轨道表88输出相应轨道送往轨道表20存储。当88的输出89上是BN3地址格式(以下称目标BN3地址)时,该目标BN3地址被送到三级主动表50映射为BN2地址(以下称目标BN2地址),此时该目标BN3指向的三级主动表表项中计数值减‘1’,而二级相关表103中目标BN2地址指向的目标行中的值被增‘1’; 目标BN3地址被存入同一目标行中;而源BN2地址也被存入同一目标行中,其相应有效位被设为‘有效’。When the format of the entry on the output 29 of the track table 20 is in the BN2 format, it will be used to address the secondary active table 40. If the corresponding entry is invalid, the BN2 (hereinafter referred to as the source BN2 address) address is required from the second The read instruction block in the level buffer memory 42 fills the level one cache block specified by the permutation logic in the level one buffer 22. At this time, the source track BN2 addresses the secondary track table 88 to output the corresponding track to the track table 20 for storage. When the output 89 of 88 is in the BN3 address format (hereinafter referred to as the target BN3 address), the target BN3 address is sent to the tertiary active table 50 to be mapped to the BN2 address (hereinafter referred to as the target BN2 address), and the target BN3 is pointed at this time. The count value in the third-level active table entry is reduced by '1', and the value in the target row pointed to by the target BN2 address in the secondary correlation table 103 is increased by '1'; The target BN3 address is stored in the same destination row; the source BN2 address is also stored in the same destination row, and its corresponding valid bit is set to 'active'.
当一个二级缓存块被替换时,二级指针78指向二级相关表103中该可置换二级缓存块的相应目标行,从中读出各有效的BN2源地址,以各该BN2源地址寻址二级轨道表88将相应表项中的BN2目标地址(指向上述目标行)用103中目标行中的BN3目标地址替换,并将103中目标行中各BN2源地址的有效位置为‘无效’。此时103中目标行中计数值减去等于有效的BN2源地址的值,并以上述BN3目标地址寻址寻址三级主动表50中表项,将其计数值79增加与103中计数值减去的值相同的值。When a L2 cache block is replaced, the secondary pointer 78 points to the corresponding target row of the replaceable L2 cache block in the secondary correlation table 103, from which each valid BN2 source address is read, and each BN2 source address is searched for. The address secondary track table 88 replaces the BN2 target address (pointing to the above target line) in the corresponding entry with the BN3 target address in the target row in 103, and sets the valid position of each BN2 source address in the target row in 103 to be invalid. '. At this time, the count value in the target row in 103 is subtracted from the value of the valid BN2 source address, and the entry in the third-level active table 50 is addressed by the above BN3 target address, and the count value 79 is increased and the count value in 103 is incremented. The value subtracted is the same value.
上述的缓存器置换方法都是基于包含性缓存器(inclusive cache)描述,即高缓存层次的内容一定在低缓存层次中。还可以将最少关联缓存置换方法应用与非包含性缓存器(non-exclusive cache)。可以在高层次缓存块对应的相关表中增设一个锁定信号位,当该锁定信号位为‘0’时,其操作与上述同;当该锁定信号位为‘1’时,则相应缓存块只有在其关联度为‘0’时,即没有分支指令以该缓存块为目标时(此处将顺序上一指令块的结束表项也视为存储有无条件分支指令),可置换该缓存块。在相关表37中,此即当一个上述锁定信号位为‘1’的一级缓存块只有当其相应计数值70 为‘0’,及所有的有效位73都为‘0’时方可被置换。在二级相关表103中,上述锁定信号位为‘1’的二级缓存块只有当其相应计数值及所有有效位都为‘0’时方可被置换。The above buffer replacement methods are based on inclusive buffers (inclusive) Cache) description, that is, the content of the high cache level must be in the low cache hierarchy. You can also apply the least associated cache replacement method to a non-inclusive buffer (non-exclusive) Cache). A lock signal bit may be added to the correlation table corresponding to the high-level cache block. When the lock signal bit is '0', the operation is the same as the above; when the lock signal bit is '1', the corresponding cache block is only When the degree of association is '0', that is, when no branch instruction targets the cache block (here, the end table entry of the previous instruction block is also regarded as storing the unconditional branch instruction), the cache block can be replaced. In the correlation table 37, this is the first level cache block when one of the above lock signal bits is '1' only when its corresponding count value is 70. It can be replaced when it is '0' and all valid bits 73 are '0'. In the secondary correlation table 103, the above-mentioned L2 cache block whose lock signal bit is '1' can be replaced only when its corresponding count value and all valid bits are '0'.
例如当三级缓存要置换一个组(set)其中一路(way)的三级缓存块时,可以三级指针83上的BN3地址寻址三级主动表50中的表项,以其中所有有效的BN2地址寻址二级相关表103中的行并将其中锁定信号设为‘1’。此后该三级缓存块即可被置换。置换后缓存器即工作于非包含性状态。所述锁定信号设为‘1’的二级缓存块中相应的三级缓存块已被置换,因此不能以将二级轨道表88的表项中的BN2地址用相应的BN3地址替换的方法保持控制信息流的完整性,要等到二级缓存块的关联度为‘0’时,该二级缓存块才可以被置换。For example, when the level 3 cache is to replace a level three of the cache blocks of the way, the entries in the level three active table 50 can be addressed by the BN3 address on the level 3 pointer 83, with all valid entries. The BN2 address addresses the row in the secondary correlation table 103 and sets the lock signal to '1'. Thereafter, the level three cache block can be replaced. After the replacement, the buffer works in a non-inclusive state. The corresponding three-level cache block in the L2 cache block whose lock signal is set to '1' has been replaced, and therefore cannot be maintained by replacing the BN2 address in the entry of the second track table 88 with the corresponding BN3 address. To control the integrity of the information flow, the secondary cache block can be replaced until the association degree of the secondary cache block is '0'.
如果将所有高层次缓存都假设为有一个为‘1’的锁定信号,即高层次缓存块只有在关联度为‘0’时才可被置换;并且在主动表对应一个缓存块的表项中的所有高层次子缓存块的有效位(如三级主动表50中的81)都为‘1’,且表项中的计数值(如50中的79)为‘0’时将该三级缓存块设为可置换,则缓存器是排他性(exclusive)组织方式。也可以设置缓存器的置换方式为在所有缓存层次的缓存块在关联度为‘0’时置换。If all high-level caches are assumed to have a lock signal of '1', that is, the high-level cache block can be replaced only when the degree of association is '0'; and in the entry of the active table corresponding to one cache block The valid bits of all high-level sub-cache blocks (such as 81 in the three-level active table 50) are all '1', and the three-level when the count value in the entry (such as 79 in 50) is '0' The cache block is set to be replaceable, and the buffer is an exclusive organization. It is also possible to set the buffer replacement method so that the cache blocks at all cache levels are replaced when the degree of association is '0'.
图11中102即图9实施例中的间接分支目标地址产生器,其接受轨道表20输出的总线29上表项控制,从处理器核23获取基地址94,产生间接分支目标地址46经选择器54送往51中进行虚实地址转换及地址映射,输出BN1分支目标地址99供循迹器47使用。当总线29上表项的类型为间接分支指令时,循迹器47选择102输出的地址99;当总线29上表项的类型为其他指令时,循迹器47选择轨道表20输出的总线29上的地址。从图11实施例中可见所有指令均由缓存系统向处理器核23推送,处理器核23只向缓存系统提供分支判断31及间接分支的基地址94。间接分支目标地址产生器102也可被应用于图4,图5,及图8实施例使其中所有指令都由缓存系统向处理器推送。102 is the indirect branch target address generator in the embodiment of FIG. 9, which accepts the entry control on the bus 29 output from the track table 20, obtains the base address 94 from the processor core 23, and generates the indirect branch target address 46. The processor 54 sends 51 to perform virtual real address translation and address mapping, and outputs a BN1 branch target address 99 for use by the tracker 47. When the type of the entry on the bus 29 is an indirect branch instruction, the tracker 47 selects the address 99 output by 102; when the type of the entry on the bus 29 is another instruction, the tracker 47 selects the bus 29 output from the track table 20. The address on. It can be seen from the embodiment of Figure 11 that all instructions are pushed by the cache system to the processor core 23, which only provides the branch system 31 and the base address 94 of the indirect branch to the cache system. The indirect branch target address generator 102 can also be applied to the FIG. 4, FIG. 5, and FIG. 8 embodiments in which all instructions are pushed by the cache system to the processor.
可以进一步将图4,图5,图8及图11实施例中的方法应用于控制对存储器寻址。请看图12,其为本发明所述处理器/存储器系统的一个实施例。图12实施例在图11实施例的基础上将所述方法应用于处理器外的存储器,其他实施例都可以按此类推。图12中虚线以下是处理器中的功能块及连线,除了没有三级缓存存储器52以外,与图11实施例中完全一样。其中三级主动表50,三级缓存的TLB及标签单元51,选择器54,扫描器43,二级轨道表88,二级主动表40,二级缓存存储器 42,二级相关表103,间接分支目标地址产生器102,轨道表20,一级缓存的相关表37,一级缓存器的存储器22, 指令读缓冲器39,循迹器47, 循迹器48, 处理器核 23与图11实施例中相同号码的模块功能相同。图12中虚线以上新增了存储器111以及其地址总线113;也新增了存储器112以及其地址总线114;总线115将存储器112输出的信息块送到虚线以下处理器中二级缓存器存储器42存储,这些信息中的指令也由扫描器43扫描并如之前实施例所述提取分支指令信息。其中存储器111按存储器组织,由在51的TAG中未获得匹配的存储器地址113(其来源为102或43产生的虚拟存储器地址经51中TLB映射所得物理地址)寻址。其中存储器112按缓存器组织,由在51的TAG中获得匹配产生的,或由二级轨道表88经89输出的,三级缓存器地址114寻址。实际上是将处理器外的存储器112作为三级缓存器存储器以代替图11实施例中的52。存储器111即图4,5,8,11中未显示但描述了的低层次存储器。因此图12实施例与图11实施例相比,除了将处理器中的最后级(三级)缓存的存储器(在图11中为52)搬到处理器外(在图12中为112),实际上两个实施例是逻辑等效的。图12实施例中缓存器(包含作为三级缓存器存储器的存储器112)组织形式与图11实施例相同。The methods of the embodiments of Figures 4, 5, 8, and 11 can be further applied to control addressing memory. Turning to Figure 12, an embodiment of a processor/memory system of the present invention is shown. The embodiment of Figure 12 applies the method to a memory external to the processor based on the embodiment of Figure 11, and other embodiments may be deduced by analogy. Below the dashed line in Fig. 12 are the functional blocks and connections in the processor, which are identical to the embodiment of Fig. 11 except that there is no tertiary cache memory 52. The three-level active table 50, the three-level cached TLB and tag unit 51, the selector 54, the scanner 43, the secondary track table 88, the secondary active table 40, and the second-level cache memory 42, secondary correlation table 103, indirect branch target address generator 102, track table 20, level 1 cache related table 37, level 1 buffer memory 22, instruction read buffer 39, tracker 47, tracker 48, Processor core 23 has the same function as the module of the same number in the embodiment of FIG. The memory 111 and its address bus 113 are added above the dotted line in FIG. 12; the memory 112 and its address bus 114 are also added; the bus 115 sends the information block outputted by the memory 112 to the second level buffer memory 42 in the processor below the dotted line. The instructions in the information are also scanned by the scanner 43 and the branch instruction information is extracted as described in the previous embodiment. The memory 111 is organized by memory, and is addressed by a memory address 113 that is not obtained in the TAG of 51 (the virtual memory address generated by the source 102 or 43 is mapped to the physical address obtained by TLB in 51). The memory 112 is organized by buffer, which is generated by a match obtained in the TAG of 51, or is output by the secondary track table 88 via 89, which is addressed by the tertiary buffer address 114. The memory 112 outside the processor is actually used as a tertiary buffer memory instead of 52 in the embodiment of FIG. The memory 111 is a low level memory not shown but described in Figures 4, 5, 8, and 11. Thus, the embodiment of FIG. 12 is compared to the embodiment of FIG. 11 except that the last stage (level 3) cached memory (52 in FIG. 11) in the processor is moved outside the processor (112 in FIG. 12). In fact the two embodiments are logically equivalent. The organization of the buffer (including the memory 112 as a three-level buffer memory) in the embodiment of Fig. 12 is the same as that of the embodiment of Fig. 11.
图12实施例中的结构可以有几种不同的应用。第一种应用形式为:存储器111为容量较大当访问延迟也较大的存储器;而存储器112为容量较小但访问延迟也较小的存储器。即存储器112作为存储器111的缓存。所述存储器可以由任何合适的存储设备构成,如:寄存器(register)或寄存器堆(register file)、静态存储器(SRAM)、动态存储器(DRAM)、闪存存储器(Flash memory)、硬盘(HD)、固态硬盘(SSD)以及任何一种合适的存储器件或未来的新形态存储器。这种应用的操作与图11实施例是一样的。即扫描器43扫描从存储器112经总线115送到二级缓存器存储器42的指令块,计算其中直接分支指令的虚拟分支目标地址,将虚拟分支目标地址送到选择器54(102也产生间接分支指令的虚拟分支目标地址经总线46送到54),经54选择后在51中TLB映射为物理地址,该物理地址与51中TAG匹配。如果不匹配,则该物理地址经地址总线113被送到存储器111读取相应指令块存入存储器112中由前述三级缓存器置换逻辑所指出的可被置换的三级缓存块中,并将该三级缓存块号与选择器54输出的低位地址合并成BN3地址存入二级轨道表88。如果匹配,则如之前实施例所述,以匹配所得的路号,选择器54输出的索引地址等拼合成BN3地址用以寻址三级轨道表50读取BN2地址存入二级轨道表88;如50中的表项‘无效’,则直接以BN3存入88。其余操作与实施例相同,在此不再赘述。The structure in the embodiment of Figure 12 can have several different applications. The first application form is that the memory 111 is a memory having a large capacity and a large access delay; and the memory 112 is a memory having a small capacity but a small access delay. That is, the memory 112 serves as a cache of the memory 111. The memory can be constructed of any suitable storage device, such as a register or register file (register) File), static memory (SRAM), dynamic memory (DRAM), flash memory (Flash Memory), hard disk (HD), solid state drive (SSD), and any suitable storage device or future new form of memory. The operation of this application is the same as the embodiment of Fig. 11. That is, the scanner 43 scans the instruction block sent from the memory 112 to the secondary buffer memory 42 via the bus 115, calculates the virtual branch target address of the direct branch instruction, and sends the virtual branch target address to the selector 54 (102 also generates an indirect branch. The virtual branch target address of the instruction is sent to 54 via bus 46. After 54 selection, the TLB is mapped to a physical address in 51, and the physical address matches the TAG in 51. If there is no match, the physical address is sent to the memory 111 via the address bus 113 to read the corresponding instruction block and stored in the memory 112 in the replaceable level three cache block indicated by the third level buffer replacement logic, and The three-level cache block number is merged with the lower address output by the selector 54 into a BN3 address and stored in the secondary track table 88. If there is a match, as described in the previous embodiment, the matching road number is matched, the index address output by the selector 54 is integrated into the BN3 address for addressing the three-level track table 50, and the BN2 address is stored in the secondary track table 88. If the entry in 50 is 'invalid', it will be directly stored in 88 with BN3. The rest of the operations are the same as those in the embodiment, and are not described herein again.
第一种应用的一个具体实施例可以是以闪存存储器(Flash memory)作为存储器111,而以DRAM 作为存储器112。闪存存储器容量较大,成本较低,但是访问延迟较大,且可写次数有限。DRAM存储器容量较小,成本较高,但是访问延迟较小,且可写次数无限。因此图12实施例中结构发挥了闪存及DRAM各自的优势而掩盖了各自的劣势。在此第一种应用中111与112共同作为计算机系统的主存储器(main mamory,内存)使用。在111以外还有更低存储层次如硬盘等。第一种应用适用于现有的计算机系统,可以使用现有的操作系统。现有计算机中由操作系统中的存储管理器管理内存,即记录那些内存是正在使用的,那些内存是空闲的;在进程需要时为其分配内存,在进程使用后释放内存。因为由软件进行存储管理,执行效率比较低。A specific embodiment of the first application may be a flash memory as the memory 111 and a DRAM. As the memory 112. Flash memory has a large capacity and low cost, but the access latency is large and the number of writes is limited. DRAM memory is small in size and costly, but the access latency is small and the number of writes is unlimited. Therefore, the structure in the embodiment of Fig. 12 takes advantage of the respective advantages of flash memory and DRAM to mask their respective disadvantages. In this first application, 111 and 112 are collectively used as the main memory of the computer system (main Mamory, memory) use. There are lower storage levels such as hard drives outside of 111. The first application is suitable for existing computer systems and can use existing operating systems. In the existing computer, the memory is managed by the storage manager in the operating system, that is, the memory is being used, and the memory is free; when the process needs it, it allocates memory and releases the memory after the process uses it. Because of the storage management by software, the execution efficiency is relatively low.
图12实施例的第二种应用,则以非易失性的存储器(如硬盘,固态硬盘,闪存等),作为存储器111;而以易失性或非易失性的存储器作为存储器112。在此图12实施例的第二种应用中,111是作为计算机中的硬盘使用;而112作为计算机中的内存存储器使用,但112是按缓存器组织的,因此可以由处理器的硬件对112做存储管理。在这种系统结构中,不或很少针对指令使用操作系统中的存储管理器。存储器111中的指令如前述按块存入存储器112中,在某个具体实施例中,所述指令块可以是虚拟存储器(virtual memory)中一个页面,此时51中标签单元TAG的每个标签可以代表一个页面。The second application of the embodiment of Fig. 12 uses a nonvolatile memory (such as a hard disk, a solid state hard disk, a flash memory, etc.) as the memory 111; and a volatile or nonvolatile memory as the memory 112. In the second application of the embodiment of Fig. 12, 111 is used as a hard disk in a computer; and 112 is used as a memory memory in a computer, but 112 is organized by a buffer, and thus may be a hardware pair 112 of the processor. Do storage management. In this system architecture, the storage manager in the operating system is used with little or no instruction. The instructions in the memory 111 are stored in the memory 112 in blocks as previously described. In a particular embodiment, the blocks of instructions may be virtual memories (virtual Memory) A page in which each tag of the tag unit TAG in 51 can represent a page.
设此具体实施例中的地址为图6中所示格式,存储器111(硬盘)地址113被划分为标签61,索引62,二级子地址63,一级子地址64,与一级块内偏移量(BNY)13。此例中的存储器111(硬盘)地址可以有较普通主存地址更大的地址空间,以寻址整个硬盘,其中63,64与13拼合即为一个页面内的偏移地址;61与62拼合即为页号。存储器112(主存,即前述实施例中三级缓存器)的地址BN3由路号65及索引62,二级子地址63,一级子地址64,与块内偏移量(BNY)13组成;其中路号65与索引62拼合即主存112的块地址,而一个块即上述一个页面;65,62,63拼合寻址主存指令块(页面)中的一个二级指令块;而除块内偏移量13的各项合称为BN3X,寻址主存指令块(页面)中的一个一级指令块。二级缓存器的地址BN2由二级缓存块号67及一级子地址64,与块内偏移量(BNY)13组成;其中二级缓存块号67寻址一个二级缓存块;除块内偏移量13的各项合称为BN2X,寻址二级缓存块中的一个一级指令块。一级缓存器的地址BN1由一级缓存块号68(BN1X)与块内偏移量(BNY)13组成。上述4种地址格式中的块内偏移量(BNY)13是一样的,进行地址转换时该BNY部分不变化。BN2地址格式中二级块号67指向一个二级缓存块,一级子地址64指向二级缓存块中4个一级指令块中的一个。同理,BN3地址格式中的路号65及索引62指向一个主存指令块,二级子地址63指向主存指令块中若干个二级指令块中的一个,一级子地址64指向选中的二级指令块中若干个一级指令块中的一个。The address in this embodiment is the format shown in FIG. 6. The memory 111 (hard disk) address 113 is divided into a label 61, an index 62, a secondary sub-address 63, a primary sub-address 64, and a primary block internal offset. Transfer (BNY) 13. The memory 111 (hard disk) address in this example may have a larger address space than the normal main memory address to address the entire hard disk, wherein 63, 64 and 13 are combined to be an offset address within a page; 61 and 62 are combined That is the page number. The address BN3 of the memory 112 (main memory, that is, the third-level buffer in the foregoing embodiment) is composed of a road number 65 and an index 62, a second-level sub-address 63, a first-level sub-address 64, and an intra-block offset (BNY) 13. Wherein the road number 65 is combined with the index 62, that is, the block address of the main memory 112, and one block is the above one page; 65, 62, 63 is flattened to address a second level instruction block in the main memory instruction block (page); Each of the intra-block offsets 13 is collectively referred to as BN3X, addressing one of the primary instruction blocks (pages). The address BN2 of the secondary buffer is composed of a secondary cache block number 67 and a primary subaddress 64, and an intra-block offset (BNY) 13; wherein the secondary cache block number 67 addresses a secondary cache block; Each of the internal offsets 13 is collectively referred to as BN2X, addressing a level one instruction block in the secondary cache block. The address BN1 of the primary buffer is composed of a primary cache block number 68 (BN1X) and an intra-block offset (BNY) 13. The intra-block offset (BNY) 13 in the above four address formats is the same, and the BNY portion does not change when the address conversion is performed. In the BN2 address format, the secondary block number 67 points to a secondary cache block, and the primary subaddress 64 points to one of the four primary instruction blocks in the secondary cache block. Similarly, the road number 65 and the index 62 in the BN3 address format point to a main memory instruction block, and the second level sub-address 63 points to one of several second-level instruction blocks in the main memory instruction block, and the first-level sub-address 64 points to the selected one. One of several first-level instruction blocks in a level two instruction block.
当操作系统控制图12中处理器开始执行一个新的线程时,使新线程的起点的地址(存储器111地址格式)经选择器54(假设在此具体实施例中选择器54有第三个输入供起点地址进入),送到51中。起点地址中的索引62寻址51中标签单元TAG,读出各路中的标签内容与起点地址中的标签61匹配。如果不匹配,则该起点地址中的61与62经总线113寻址存储器111读出相应页面(指令块)存入存储器112中由起点地址中索引62指出的一组(set)中由主存(即前述实施例中三级缓存器)置换逻辑以路号65指定的一路(way)中;此时也将起点地址中的61与62域存入51中标签单元中的同一路同一组中。When the operating system controls the processor in Figure 12 to begin executing a new thread, the address of the starting point of the new thread (memory 111 address format) is passed through selector 54 (assuming that selector 54 has a third input in this embodiment) For the starting address to enter), sent to 51. The index 62 in the start address addresses the tag unit TAG in 51, and the contents of the tag in each path are read to match the tag 61 in the start address. If there is no match, 61 and 62 of the start address are read out by the bus 113 to store the corresponding page (instruction block) in the memory 111, and stored in the memory 112 in the set indicated by the index 62 in the start address by the main memory. (that is, the third-level buffer in the foregoing embodiment) replaces the logic in a way specified by the road number 65; at this time, the 61 and 62 fields in the starting address are also stored in the same group in the same label unit in 51. .
此后,或当起点地址中的61与标签单元中的标签内容匹配时,系统控制器以上述路号65,起点地址中索引62,二级子地址63从存储器112(主存)中读出一个二级指令块存入二级缓存器存储器42中,由二级缓存置换逻辑以二级块号67指定的的一个二级缓存块;并将该二级块号67存入三级主动表50中由上述65,62,及63指向的表项80并将表项中的有效位81置为‘有效’。扫描器43扫描上述二级指令块,提取其中分支指令信息,产生轨道存入二级轨道表88。此后系统控制器进一步以上述二级块号67拼合起点地址中一级子地址64读取42中的一个一级指令块存入一级缓存器存储器22中由一级缓存置换逻辑以一级块号68指定的一个一级缓存块;二级轨道表88中相应轨道也被存入轨道表20,过程中轨道上BN3格式的地址如前述被替换为BN2;该一级块号68也被存入二级主动表40中由上述67,64指向的表项76并将表项中的有效位77置为‘有效’。最后系统控制器将上述一级块号68拼合起点地址中一级块内偏移量BNY 13作为BN1地址置入循迹器47中寄存器26,使读指针28指向一级缓存器存储器22中的上述线程的起点指令也指向轨道表20中的相应表项。此后向处理器核的推送操作与前述各实施例类似。总而言之,操作系统注入的新线程起点地址,或扫描器43或间接分支地址产生器102产生的硬盘地址经选择器54选择后被送到51中的标签单元匹配。当匹配成功时,匹配所得BN3地址寻址三级主动表50。如50输出的表项‘有效’,则以表项中BN2寻址二级主动表40。如50输出的表项‘无效’,则以上述BN3地址直接寻址存储器112(主存)向二级缓存器存储器42输出二级指令块。当上述硬盘地址在51中的标签单元中匹配不成功时,则经总线113寻址存储器111(硬盘),读出相应指令块(页面)存入存储器112(主存)中由缓存置换逻辑指定的主存缓存块,覆盖原来存在该缓存块中的指令块。这个从硬盘到主存的置换过程完全是由硬件控制完成的,基本不需要软件操作。置换逻辑可使用各种算法如LRU,NRU (not recently used,最近未使用),FIFO,时钟(clock)等。Thereafter, or when the 61 in the starting address matches the content of the tag in the tag unit, the system controller reads one from the memory 112 (main memory) with the path number 65, the starting address address index 62, and the second level sub address 63. The secondary instruction block is stored in the secondary buffer memory 42, and the secondary cache is replaced by a secondary cache block specified by the secondary block number 67; and the secondary block number 67 is stored in the tertiary active table 50. The entry 80 pointed to by 65, 62, and 63 above and the valid bit 81 in the entry is set to 'valid'. The scanner 43 scans the above-mentioned two-level instruction block, extracts the branch instruction information therein, and generates a track to be stored in the secondary track table 88. Thereafter, the system controller further blocks one of the first-level sub-addresses 64 of the first-level sub-address 64 in the above-mentioned two-level block number 67 to be stored in the first-level buffer memory 22 by the first-level cache replacement logic to the first-order block. A first-level cache block designated by the number 68; the corresponding track in the secondary track table 88 is also stored in the track table 20, and the address of the BN3 format on the track is replaced with BN2 as described above; the first block number 68 is also stored. The entry 76 in the secondary active table 40 pointed to by 67, 64 above is set and the valid bit 77 in the entry is set to 'valid'. Finally, the system controller combines the above-mentioned first-order block number 68 into the first-order block offset BNY in the starting address. The start instruction of the above-mentioned thread, which is placed in the tracker 47 as the BN1 address, causes the read pointer 28 to point to the above-mentioned thread in the level 1 buffer memory 22 to also point to the corresponding entry in the track table 20. The push operation to the processor core thereafter is similar to the previous embodiments. In summary, the new thread start address injected by the operating system, or the hard disk address generated by the scanner 43 or the indirect branch address generator 102 is selected by the selector 54 and sent to the tag unit in 51 for matching. When the match is successful, the resulting BN3 address is addressed to the three-level active list 50. If the entry of 50 output is 'valid', the secondary active table 40 is addressed by BN2 in the entry. If the entry of the 50 output is 'invalid', the secondary instruction block is output to the secondary buffer memory 42 by the above-mentioned BN3 address direct addressing memory 112 (main memory). When the hard disk address is unsuccessful in the tag unit in 51, the memory 111 (hard disk) is addressed via the bus 113, and the corresponding instruction block (page) is read into the memory 112 (main memory) and specified by the cache replacement logic. The main memory cache block overwrites the instruction block that was originally present in the cache block. This replacement process from hard disk to main memory is completely controlled by hardware, and basically no software operation is required. The permutation logic can use various algorithms such as LRU, NRU (not recently used), FIFO, clock, etc.
如果上述硬盘地址的地址空间大于或等于存储器111的地址空间,则图12实施例中51中不需要有转换检测缓冲器TLB,且硬盘地址是物理地址。由操作系统注入的起点地址是物理地址,由此地址映射所得的主存地址BN3(用于寻址存储器112)是物理地址的映射。其余BN2地址,BN1地址是BN3地址的映射,因此也是物理地址的映射。存储器111(硬盘)是存储器112(主存)的虚拟内存,而存储器112(主存)是存储器111(硬盘)的缓存。因此不存在程序的地址空间大于主存的地址空间的情形。同一时刻执行的复数个同一程序其BN3地址相同,同一时刻执行的不同程序其BN3地址必定不同。因此同一时刻不同的程序的相同虚拟地址会被映射成不同的BN地址,不会混淆。推送体系结构中处理器核并不产生指令地址。因此可以直接以物理硬盘地址作为处理器的地址。不需要如同现有的处理器系统中由处理器核产生虚地址,然后映射为物理地址访问存储器。If the address space of the hard disk address is greater than or equal to the address space of the memory 111, the conversion detection buffer TLB is not required in the embodiment 51 of the embodiment of FIG. 12, and the hard disk address is a physical address. The starting address injected by the operating system is a physical address, whereby the resulting main memory address BN3 (for addressing memory 112) of the address mapping is a mapping of physical addresses. The remaining BN2 addresses, which are mappings of BN3 addresses, are also mappings of physical addresses. The memory 111 (hard disk) is the virtual memory of the memory 112 (main memory), and the memory 112 (main memory) is the buffer of the memory 111 (hard disk). Therefore, there is no case where the address space of the program is larger than the address space of the main memory. The same program executed at the same time has the same BN3 address, and the BN3 addresses of different programs executed at the same time must be different. Therefore, the same virtual address of different programs at the same time will be mapped to different BN addresses without confusion. The processor core in the push architecture does not generate an instruction address. Therefore, the physical hard disk address can be directly used as the address of the processor. It is not necessary to generate a virtual address by a processor core as in an existing processor system, and then map to a physical address to access the memory.
可以将图12实施例中存储器111及存储器112封装在一个封装中作为存储器。图12实施例中处理器与存储器之间的接口除了现有的存储器地址总线113以及指令总线115以外,还另外增加了缓存地址BN3总线114。虽然图12实施例中存储器与处理器的分界如同虚线所示,但也可以将一些功能块从分界的一侧移动到另一侧。比如将三级主动表50,51中的TLB及标签单元TAG放置在虚线以上的存储器侧,其与图12实施例以及图11实施例还是逻辑等效的。另外可以将单数或复数个非易失性的存储器111芯片与单数个或复数个存储器112芯片以及图12中虚线以下的存储器芯片(可增添对外接口)通过TSV通孔相互连接,封装在单一封装中作为微型物理尺度的完整计算机。The memory 111 and the memory 112 in the embodiment of Fig. 12 can be packaged in a package as a memory. In addition to the existing memory address bus 113 and the instruction bus 115, the interface between the processor and the memory in the embodiment of FIG. 12 additionally adds a cache address BN3 bus 114. Although the boundary between the memory and the processor in the embodiment of Fig. 12 is shown as a broken line, it is also possible to move some of the functional blocks from one side of the boundary to the other. For example, the TLB and the tag unit TAG in the three-stage active list 50, 51 are placed on the memory side above the dotted line, which is also logically equivalent to the embodiment of FIG. 12 and the embodiment of FIG. In addition, the singular or plural non-volatile memory 111 chip and the single or multiple memory 112 chips and the memory chip below the dotted line in FIG. 12 (the external interface can be added) are connected to each other through the TSV via hole, and are packaged in a single package. A complete computer in the microphysical scale.
请看图13,其为本发明所述处理器/存储器系统的另一个实施例。图13实施例是图8,图11,图12实施例的更通用的表达方式。其中存储器111,三级缓存器存储器112,三级主动表50,三级缓存的TLB及标签单元51,选择器54,扫描器43,二级轨道表88,二级主动表40,二级缓存存储器 42,二级相关表103,间接分支目标地址产生器102,轨道表20,一级相关表37,一级缓存器存储器22, 指令读缓冲器39,循迹器47, 循迹器48, 处理器核 23与图12实施例中相同号码的模块功能相同。新增了四级主动表120,四级相关表121及四级缓存器存储器122,由51产生的BN4总线123寻址。也新增了三级轨道表118,三级相关表117,其中存储从图8,图11,图12实施例中三级主动表50中提取出来的计数值,使各层级主动表的格式一致。即图13实施例中50中没有计数值,该计数值保存在117中。Please refer to Figure 13, which is another embodiment of the processor/memory system of the present invention. The embodiment of Figure 13 is a more general representation of the embodiment of Figures 8, 11, and 12. The memory 111, the third-level buffer memory 112, the three-level active table 50, the TLB and tag unit 51 of the third-level cache, the selector 54, the scanner 43, the secondary track table 88, the secondary active table 40, and the second-level cache Memory 42, a secondary correlation table 103, an indirect branch target address generator 102, a track table 20, a level correlation table 37, a level 1 buffer memory 22, an instruction read buffer 39, a tracker 47, a tracker 48, Processor core 23 has the same function as the module of the same number in the embodiment of Fig. 12. A four-level active table 120, a four-level correlation table 121 and a four-level buffer memory 122 are added, which are addressed by the BN4 bus 123 generated by 51. A three-level track table 118, a three-level correlation table 117, is also added, in which the count values extracted from the three-level active table 50 in the embodiment of FIG. 8, FIG. 11, and FIG. 12 are stored, so that the format of each level active table is consistent. . That is, there is no count value in 50 in the embodiment of Fig. 13, and the count value is stored in 117.
图13实施例中存储器层次结构的最低层次111为存储器,由存储器地址113寻址。其余各存储器层次均为111的不同层次缓存由相应BN缓存地址寻址。其中最低层缓存,即图中四级缓存器122为路组相联组织结构。其余更高存储缓存器层次均为全相联结构。扫描器43位于四级缓存器存储器122与三级缓存器存储器112之间。TLB/TAG 51在四级缓存中。较扫描器43层次高的各缓存层次均有轨道表如118,88,20。除最高缓存层次外的各缓存层次均有主动表如120,50,40。各缓存层次均有相关表如121,117,103,37。各存储表的格式请见图14。The lowest level 111 of the memory hierarchy in the embodiment of Figure 13 is a memory, addressed by memory address 113. The remaining different levels of memory with each memory level of 111 are addressed by the corresponding BN cache address. The lowest level cache, that is, the fourth level buffer 122 in the figure is an associated structure of the road group. The remaining higher memory buffer levels are all associative. The scanner 43 is located between the quaternary buffer memory 122 and the tier buffer memory 112. TLB/TAG 51 is in the level 4 cache. Each cache level higher than the level of the scanner 43 has a track table such as 118, 88, 20. Each cache level except the highest cache level has active tables such as 120, 50, and 40. Each cache level has a related table such as 121, 117, 103, 37. The format of each storage table is shown in Figure 14.
图14为图13实施例中各存储表的格式。图13实施例里51中标签单元的格式为物理标签86。51中TLB的CAM格式是线程号83以及虚拟标签84,RAM格式是物理标签85。图13中选择器54选择输出的线程号83及虚拟标签84在TLB中被映射为物理标签85;虚拟地址中的索引地址62读出标签单元中的物理标签86与85匹配以获得路号65。路号65以及虚拟地址中的索引地址62拼合形成四级缓存块地址123。也可以如前述51中不设TLB,以选择器54选择的物理地址直接与TAG中物理标签86匹配。图14中轨道表各表项含有类型11,缓存块地址BNX 12及BNY13,还可以含有SBNY 15以确定分支执行时间点。每一层次的轨道表中的缓存块地址12 可以是本层次或低一层次的BN格式,如三级轨道表118中12可以是BN3X或BN4X格式。主动表表项中有相应子块的缓存器块号76,其格式为比本层次高一层次的缓存块号,如三级主动表50中存储的是BN2X;另外还有相应的有效位77。主动表的功能是将本层次的缓存地址映射为高一层次的缓存地址。相关表中有计数值70,其意义是本存储层次或高一存储层次轨道表中以该缓存块为分支目标的表项数;另有与该缓存块相应的低一层缓存块号71;以及本存储层次中以该缓存块为分支目标的轨道表表项地址72及其相应有效位73。各路共用的指针74如前所述指向最长时间未被置换的缓存块;如该缓存块对应的计数值70小于预设置换阈值,则该缓存块可被置换。置换时以73‘有效’的72中地址寻址轨道表中表项,以低一层缓存块号71替换轨道表表项中的本层次缓存块号。例外的是四级相关表121中只有计数值70,而无71,72,73,因为该层次没有轨道表,无需进行上述轨道表表项内的地址替换。Figure 14 is a diagram showing the format of each storage table in the embodiment of Figure 13. The format of the tag unit in the embodiment 51 of Fig. 13 is the physical tag 86. The CAM format of the TLB in 51 is the thread number 83 and the virtual tag 84, and the RAM format is the physical tag 85. The thread number 83 and the virtual tag 84 selected by the selector 54 in FIG. 13 are mapped to the physical tag 85 in the TLB; the index address 62 in the virtual address is read out to match the physical tags 86 and 85 in the tag unit to obtain the road number 65. . The road number 65 and the index address 62 in the virtual address are joined together to form a four-level cache block address 123. Alternatively, the TLB is not set as in the foregoing 51, and the physical address selected by the selector 54 is directly matched with the physical tag 86 in the TAG. In Figure 14, each entry in the track table contains type 11, cache block address BNX 12 and BNY13, may also contain SBNY 15 to determine the branch execution time point. Cache block address 12 in each level of the track table It may be a BN format of this level or a lower level, for example, 12 of the three-level track table 118 may be a BN3X or BN4X format. The active table entry has a buffer block number 76 of the corresponding sub-block, and the format is a cache block number higher than the current level, for example, the BN2X is stored in the third-level active table 50; and the corresponding valid bit 77 . The function of the active table is to map the cache address of this level to a higher level cache address. The correlation table has a count value of 70, the meaning of which is the number of entries in the storage hierarchy or the high-level storage hierarchy track table with the cache block as a branch target; and a lower-level cache block number 71 corresponding to the cache block; And the track table entry address 72 and its corresponding valid bit 73 in the storage hierarchy with the cache block as a branch target. The pointer 74 shared by each channel points to the cache block that has not been replaced for the longest time as described above; if the count value 70 corresponding to the cache block is smaller than the preset replacement threshold, the cache block can be replaced. At the time of replacement, the table entry in the track table is addressed with 72 addresses of 73 'valid', and the level cache block number in the track table entry is replaced with the lower layer cache block number 71. The exception is that the four-level correlation table 121 has only the count value of 70, and there is no 71, 72, 73. Since there is no track table in the hierarchy, there is no need to perform address replacement in the above-mentioned track table entry.
当一个指令块从存储器122(四级缓存器存储器)经总线向三级缓存存储器112传送时,扫描器43提取指令块中分支地址的信息,产生轨道表项类型,也计算分支目标地址。所述分支目标地址经选择器54选择送到51中与标签单元匹配。如不匹配,则所述分支目标地址经总线113寻址存储器111,读出相应指令块存入存储器122中由四级缓存置换逻辑(四级主动表120及四级相关表121等)选定的四级缓存块。如匹配,则匹配所得的BN4X地址123寻址四级主动表120,若该120表项有效,则以表项中BN3X地址与分支目标地址的BNY拼合为BN3地址经总线125存入三级轨道表118中与该分支指令对应的表项;若该120表项无效,则直接以BN4X地址与上述BNY地址拼合成BN4地址存入118中表项。When an instruction block is transferred from the memory 122 (four-level buffer memory) to the third-level cache memory 112 via the bus, the scanner 43 extracts the information of the branch address in the instruction block, generates a track entry type, and also calculates a branch target address. The branch target address is selected by the selector 54 to be sent to 51 to match the tag unit. If there is no match, the branch target address is addressed to the memory 111 via the bus 113, and the corresponding instruction block is read into the memory 122 and selected by the four-level cache replacement logic (four-level active table 120 and four-level correlation table 121, etc.). Level 4 cache block. If matched, the matched BN4X address 123 addresses the four-level active table 120. If the 120 entry is valid, the BN3X address in the entry is combined with the BNY of the branch target address into a BN3 address and stored in the third-level track via the bus 125. The entry corresponding to the branch instruction in the table 118; if the 120 entry is invalid, the BN4X address is directly combined with the BNY address to be added to the entry in the BN4 address.
请参考图15,其为图13实施例中处理器系统的地址格式。存储器地址被划分为标签61,索引62,三级子地址126,二级子地址63,一级子地址 64,与块内偏移量(BNY)13。四级缓存器的地址BN4由路号65及索引62,三级子地址126,二级子地址63,一级子地址64,与块内偏移量(BNY)13组成;其中除BNY 13的部分合称为BN4X。 三级缓存器的地址BN3由三级缓存块号128,二级子地址63,一级子地址64,与块内偏移量(BNY)13组成;而除块内偏移量13的各项合称为BN3X。二级缓存器的地址BN2由二级缓存块号67及一级子地址64,与块内偏移量(BNY)13组成;除块内偏移量13的各项合称为BN2X,寻址二级缓存块中的一个一级指令块。一级缓存器的地址BN1由一级缓存块号68(BN1X)与块内偏移量(BNY)13组成。上述4种地址格式中的块内偏移量(BNY)13是一样的,进行地址转换时该BNY部分不变化。Please refer to FIG. 15, which is an address format of the processor system in the embodiment of FIG. The memory address is divided into a tag 61, an index 62, a tertiary subaddress 126, a secondary subaddress 63, and a primary subaddress. 64, with an intra-block offset (BNY) of 13. The address BN4 of the quaternary buffer is composed of a road number 65 and an index 62, a three-level sub-address 126, a second-level sub-address 63, a first-level sub-address 64, and an intra-block offset (BNY) 13; The portion of 13 is collectively referred to as BN4X. The address BN3 of the third-level buffer is composed of a third-level cache block number 128, a second-level sub-address 63, a first-level sub-address 64, and an intra-block offset (BNY) 13; Known as BN3X. The address BN2 of the secondary buffer is composed of a secondary cache block number 67 and a primary subaddress 64, and an intra-block offset (BNY) 13; each of the offsets 13 in the block is collectively referred to as BN2X, addressing. A level one instruction block in the level 2 cache block. The address BN1 of the primary buffer is composed of a primary cache block number 68 (BN1X) and an intra-block offset (BNY) 13. The intra-block offset (BNY) 13 in the above four address formats is the same, and the BNY portion does not change when the address conversion is performed.
当二级指令块从三级缓存存储器112向二级缓存存储器42填充时,相应轨道由三级轨道表118中经总线119读出,其轨道表项中的BN4格式地址寻址四级主动表120;若该120表项有效,以其中的BN3X地址填入118中的轨道表项并旁路到总线119也存入二级轨道表88中的对应表项;若该120表项无效,则119总线上的上述BN4地址寻址存储器122,读出相应指令块填入存储器112中由三级缓存置换逻辑(三级主动表50及三级相关表117等)所给出的BN3X地址指向的三级缓存块。该BN3X地址被存入四级主动表120中由上述BN4地址指向的表项,被存入三级轨道表118中相应表项,该BN3X地址并被旁路至总线119也存入二级轨道表88中的对应表项。如总线119上输出的已经是BN3X地址,则以该BN3X 地址寻址三级主动表50,若该50表项有效则以其中BN2X地址存入二级轨道表88中的对应表项;若该50表项无效则以119上的BN3X地址寻址存储器112,读出相应二级缓存块存入二级缓存器存储器42中由二级缓存置换逻辑(二级主动表40及二级相关表103等)给出的BN2X地址指向的二级缓存块;该BN2X也被存入三级主动表50中由上述BN3X寻址的表项;该BN2X也被存入二级轨道表88中。When the secondary instruction block is filled from the tertiary cache memory 112 to the secondary cache memory 42, the corresponding track is read by the tertiary track table 118 via the bus 119, and the BN4 format address in the track entry is addressed to the four-level active list. 120; if the 120 entry is valid, the track entry in the 118 is filled in with the BN3X address and bypassed to the bus 119 and also stored in the corresponding entry in the secondary track table 88; if the 120 entry is invalid, then The above BN4 address addressing memory 122 on the 119 bus reads the corresponding instruction block and fills the memory 112 with the BN3X address pointed by the third level cache replacement logic (the third active table 50 and the third level correlation table 117, etc.). A three-level cache block. The BN3X address is stored in the entry of the four-level active table 120 pointed to by the BN4 address, and is stored in the corresponding entry in the three-level track table 118. The BN3X address is bypassed to the bus 119 and also stored in the secondary track. Corresponding entries in Table 88. If the output on the bus 119 is already a BN3X address, then the BN3X is used. The address is addressed to the three-level active list 50. If the 50 entry is valid, the BN2X address is stored in the corresponding entry in the secondary track table 88; if the 50 entry is invalid, the memory 112 is addressed by the BN3X address on 119. Reading the corresponding L2 cache block into the L2 cache memory 42 and the L2 cache address pointed by the BN2X address given by the L2 cache replacement logic (the secondary active table 40 and the secondary correlation table 103, etc.); The BN2X is also stored in the entry of the three-level active list 50 addressed by the above BN3X; the BN2X is also stored in the secondary track table 88.
当一级指令块从二级缓存存储器42向一级缓存存储器22填充时,相应轨道由二级轨道表88中经总线89读出,其轨道表项中的BN3格式地址寻址三级主动表50;若该50表项有效,其中的BN2X地址填入88中的轨道表项并旁路到总线89也存入一级轨道表20中的对应表项;若该50表项无效,则89总线上的上述BN3地址寻址存储器112,读出相应指令块填入存储器42中由二级缓存置换逻辑(二级主动表40及二级相关表103等)所给出的BN2X地址指向的二级缓存块。该BN2X地址被存入三级主动表50中由上述BN3地址指向的表项,被存入二级轨道表88中相应表项,该BN2X地址并被旁路至总线89也存入一级轨道表20中的对应表项。如总线89上输出的已经是BN2X地址,则以该BN2X 地址寻址二级主动表40,若该40表项有效则以其中BN1X地址存入一级轨道表20中的对应表项;若该40表项无效则以89上的BN2X地址寻址存储器42,读出相应一级缓存块存入一级缓存器存储器22中由一级缓存置换逻辑(一级相关表37等)给出的BN1X地址指向的一级缓存块;该BN1X也被存入二级主动表40中由上述BN2X寻址的表项;该BN1X也被存入一级轨道表20中。When the primary instruction block is filled from the secondary cache memory 42 to the primary cache memory 22, the corresponding track is read by the secondary track table 88 via the bus 89, and the BN3 format address in the track entry is addressed to the tertiary active table. 50; if the 50 entry is valid, the BN2X address is filled in the track entry in 88 and bypassed to the bus 89 and also stored in the corresponding entry in the primary track table 20; if the 50 entry is invalid, then 89 The above BN3 address addressing memory 112 on the bus reads the corresponding instruction block and fills in the memory 42 with the BN2X address pointed by the L2 cache replacement logic (the secondary active table 40 and the secondary correlation table 103, etc.). Level cache block. The BN2X address is stored in the entry of the three-level active list 50 pointed to by the BN3 address, and is stored in the corresponding entry in the secondary track table 88. The BN2X address is bypassed to the bus 89 and also stored in the first track. Corresponding entries in Table 20. If the output on the bus 89 is already a BN2X address, then the BN2X is used. The address is addressed to the secondary active table 40. If the 40 entry is valid, the BN1X address is stored in the corresponding entry in the primary track table 20. If the 40 entry is invalid, the memory 42 is addressed by the BN2X address on the 89. Reading the corresponding level 1 cache block into the level 1 cache block pointed to by the BN1X address given by the level 1 cache replacement logic (primary correlation table 37, etc.) in the first level buffer memory 22; the BN1X is also stored in the second level. The entry in the active list 40 addressed by the above BN2X; the BN1X is also stored in the primary track table 20.
当指令块从一级缓存存储器22向处理器核23或IRB 39推送时,其相应轨道由一级轨道表20中经总线29读出,其轨道表项中的BN2格式地址寻址二级主动表40;若该40表项有效,以其中的BN1X地址填入20中的轨道表项并旁路到总线29;若该40表项无效,则29总线上的上述BN2地址寻址存储器42,读出相应指令块填入存储器22中由一级缓存置换逻辑(一级相关表37等)所给出的BN1X地址指向的一级缓存块。该BN1X地址被存入二级主动表40中由上述BN2地址指向的表项,被存入一级轨道表20中相应表项。如总线89上输出的已经是BN1地址,则该BN1 地址被存入循迹器47中的寄存器,成为读指针28,寻址轨道表20及一级缓存存储器22,向处理器核23或IRB 39推送指令。如此可以保证在一级缓存存储器22中的指令,其分支目标及顺序下个一级缓存块至少已在二级缓存存储器42中或正在存储进42的过程中。其余操作如之前实施例所述,不再赘述。When the instruction block is from the level 1 cache memory 22 to the processor core 23 or IRB When the push is 39, the corresponding track is read by the bus 29 in the first track table 20, and the BN2 format address in the track entry is addressed to the secondary active table 40; if the 40 entry is valid, the BN1X address is filled in. Enter the track entry in 20 and bypass to the bus 29; if the 40 entry is invalid, the BN2 address on the 29 bus addresses the memory 42, reads the corresponding instruction block and fills the memory 22 by the first level cache replacement logic (Level 1 correlation table 37, etc.) The first level cache block pointed to by the BN1X address given. The BN1X address is stored in the entry in the secondary active list 40 pointed to by the BN2 address, and is stored in the corresponding entry in the primary track table 20. If the output on the bus 89 is already a BN1 address, then the BN1 The address is stored in the register in the tracker 47, becomes the read pointer 28, addresses the track table 20 and the level 1 cache memory 22, to the processor core 23 or IRB. 39 push instructions. This ensures that the instructions in the level one cache memory 22, their branch destinations and the sequence of the next level one cache block are at least already in the level two cache memory 42 or are being stored into the process 42. The rest of the operations are as described in the previous embodiments and will not be described again.
虽然图13实施例以同时执行分支的两支的指令推送存储器/处理器系统展示,其存储器层次结构也可以适用于其他结构的处理器核,如由处理器核产生地址寻址一级缓存或指令读缓冲的乱序多发射处理器系统。可以将图13实施例的方法与系统应用于数据存储器层次结构及数据推送,使存储器层次结构也向处理器核推送数据。为便于说明,以下实施例假设数据存储器有与指令存储器同样的存储层次,即有存储器,四级缓存,三级缓存,二级缓存,一级缓存及数据读缓冲器,与指令存储器各层次相对应。因此数据存储器层次的地址格式也就如同图15 实施例一样,只是存储器地址此时是数据地址而非指令地址,各BN地址可以是DBN(Data Block Number)地址以区别与BN地址,以适应分立的指令缓存及数据缓存。如在某个存储层次以单一存储器作为统一缓存(Unified Cache存储指令与数据)则该层次地址仍以BN名之。Although the embodiment of FIG. 13 is shown in an instruction push memory/processor system that executes two branches simultaneously, its memory hierarchy can also be applied to processor cores of other architectures, such as address addressing level 1 cache generated by a processor core or Instructional read buffered out-of-order multi-transmission processor system. The method and system of the embodiment of Figure 13 can be applied to data memory hierarchies and data pushes such that the memory hierarchy also pushes data to the processor core. For convenience of explanation, the following embodiment assumes that the data memory has the same storage hierarchy as the instruction memory, that is, there are memory, four-level cache, three-level cache, two-level cache, one-level cache and data read buffer, and the instruction memory levels. correspond. So the address format of the data memory hierarchy is like Figure 15 In the same embodiment, only the memory address is a data address instead of an instruction address, and each BN address can be a DBN (Data Block). The Number) address is distinguished from the BN address to accommodate separate instruction caches and data caches. Such as a single storage as a unified cache at a storage level (Unified The Cache stores instructions and data. The hierarchical address is still in the BN name.
每个存储层次也同样需要数据轨道表DTT,数据主动表DAL,数据相关表DCT及指针以支持数据存储器存储的操作。请参考图16,其为所述数据轨道表,数据主动表,数据相关表的格式。数据轨道表DTT中不需存储分支目标地址,因此只需要存储顺序下一个数据块的块地址DBNX 132以及其有效位133。可选的可以增加存储顺序上一个数据块的块地址130以及其有效位131,以便在逆序访问数据时使用。另外也可以完全不用数据轨道表。数据主动表DAL 的格式与图14中所示的主动表AL格式76,77相同,其中134域存储数据块地址DBNX,135域存储相应的有效位。由数据块地址(如图15中的块2地址67)寻址本层次的DAL的一行,由子地址(如图15中的子2地址64)寻址该行中的一组134,135。如有效位135‘有效’,则将134中的高一层次块地址从DAL中读出以访问高一层次的数据存储器。即数据主动表DAL将存储层次地址映射为高一存储层次的地址。数据相关表DCT中则只存储相应的低一存储层次地址136。即数据主动表DAL可以将存储层次地址映射为相应的高一存储层次地址,而数据相关表DCT中可以将存储层次地址映射为低一存储层次地址(图16中用DBLNX代表是低一层次地址)。指针137则被用于做缓存器替换,数据缓存的置换方式可以用本发明所公开的指令缓存的置换方式,但数据缓存的相关表中没有计数值,因为没有分支指令以数据缓存为跳入目标,因此置换时不需考虑替换轨道表中以数据缓存块为目标的地址,也不需记录分支源地址。一级缓存只需以指针137记录上次替换的缓存块,指针137单向遍历,或以LRU,LFU等方式置换。二,三,四级缓存如同指令缓存的置换方式,只要缓存块没有高层次的相应缓存块即可被替换。可各以本层次的指针137单向遍历,读出主动表中各表项,如某表项中所有地址域都‘无效’,则相应缓存块可被替换。本发明所公开的指令缓存的一级缓存置换方式也可以用LRU,LFU等方式。Each storage hierarchy also requires data track table DTT, data active table DAL, data related table DCT and pointers to support the operation of data memory storage. Please refer to FIG. 16, which is a format of the data track table, the data active table, and the data related table. The branch target address does not need to be stored in the data track table DTT, so only the block address DBNX of the next data block in the order is stored. 132 and its valid bit 133. Optionally, the block address 130 of one of the data blocks in the storage order and its valid bit 131 can be added for use in reverse order access to the data. In addition, the data track table can be completely eliminated. Data active table DAL The format is the same as the active list AL format 76, 77 shown in Figure 14, where the 134 field storage block address DBNX, 135 field stores the corresponding valid bit. A row of DALs of this hierarchy is addressed by a data block address (e.g., block 2 address 67 in FIG. 15), and a set of 134, 135 in the row is addressed by a subaddress (such as subaddress 2 in FIG. 15). If the valid bit 135 is 'active', the higher level block address in 134 is read from the DAL to access the higher level data store. That is, the data active table DAL maps the storage hierarchy address to the address of the upper storage level. Only the corresponding lower one storage hierarchy address 136 is stored in the data correlation table DCT. That is, the data active table DAL can map the storage hierarchical address to the corresponding high-level storage hierarchical address, and the data correlation table DCT can map the storage hierarchical address to the lower one storage hierarchical address (the DBLNX represents a low-level address in FIG. 16). ). The pointer 137 is used for buffer replacement. The data cache can be replaced by the instruction cache disclosed in the present invention, but there is no count value in the data cache related table, because no branch instruction jumps into the data cache. The target, therefore, does not need to consider the address in the replacement track table that targets the data cache block, and does not need to record the branch source address. The level 1 cache only needs to record the last replaced cache block with the pointer 137, and the pointer 137 is unidirectionally traversed or replaced by LRU, LFU, and the like. The second, third, and fourth level caches are replaced by the instruction cache, as long as the cache block does not have a high level of corresponding cache blocks. Each of the entries in the active table can be read by one-way traversal of the pointer 137 of the present level. If all the address fields in an entry are "invalid", the corresponding cache block can be replaced. The L1 cache replacement method of the instruction cache disclosed by the present invention may also be implemented by LRU, LFU or the like.
数据推送存储器层次结构还使用步长表150以记录同一数据访问指令的相邻两次数据访问地址的差-步长(stride)。请参考图17,其为步长表格式及工作原理。150是个存储器,其中每一行对应一条数据访问指令(比如LD 或者 ST),由该数据访问指令的指令地址寻址。每行中有数据地址138,在以下的实施例中138的格式是DBN1,即一级数据缓存地址,其格式为DBN1X及DBNY, 类似图15中68及13, 139域为138的状态位。另外还有多组步长其中一组为140及相应有效位141;142及143是其他组的步长。每组步长如140及其相应有效位141,由所述数据访问指令在指令段的分支循环层次选择。请参考图17下部, 直线代表顺序指令沿箭头方向顺序执行,弧代表反向分支,交叉代表分支指令,三角代表数据访问指令。其中146为数据访问指令,图17上部的步长表150行对应146,其中当分支指令140的分支判断为‘执行分支’时,该数据访问指令146的内循环步长被存入146对应150行的步长域140;当分支指令140的分支判断为‘不分支’时,且分支指令142的分支判断为‘执行分支’时,该数据访问指令146的中循环步长被存入146对应150行的步长域142;当分支指令140的分支判断为‘不分支’时,且分支指令140的分支判断为‘不分支’时,且分支指令143的分支判断为‘执行分支’时,该数据访问指令146的外循环步长被存入146对应150行的步长域143。即分支判断是有优先权的,以紧接着数据访问指令之后的反向分支指令优先权最高,其他反向分支指令的优先权按次序递减,分支判断为‘执行分支’的高优先权分支指令会掩盖低优先权的分支指令使其不影响步长表150的读出。正向分支指令不在步长表中记录。可以由加法器将150的行中138 数据地址 DBN1与分支判断选择的步长如140等相加,获得下一数据地址以访问数据存储层次系统,提前获取数据向处理器核推送。The data push memory hierarchy also uses the step size table 150 to record the difference between two adjacent data access addresses of the same data access instruction. Please refer to FIG. 17, which is a step size table format and working principle. 150 is a memory, where each row corresponds to a data access instruction (such as LD Or ST), addressed by the instruction address of the data access instruction. There is a data address 138 in each row. In the following embodiment, the format of 138 is DBN1, which is a primary data cache address, and the format is DBN1X and DBNY. Similar to 68 and 13 in Figure 15, The 139 field is the status bit of 138. In addition, there are a plurality of sets of steps, one of which is 140 and the corresponding valid bits 141; 142 and 143 are the step sizes of the other groups. Each set of step sizes, such as 140 and its corresponding valid bit 141, is selected by the data access instruction at the branch cycle level of the instruction segment. Please refer to the lower part of Figure 17, The straight line represents sequential instructions that are executed sequentially in the direction of the arrow, the arc represents the reverse branch, the intersection represents the branch instruction, and the triangle represents the data access instruction. Wherein 146 is a data access instruction, and the 150 step row of the upper step of FIG. 17 corresponds to 146. When the branch of the branch instruction 140 is determined to be an 'execution branch', the inner loop step of the data access instruction 146 is stored in the 146 corresponding 150. Step field 140 of the row; when the branch of the branch instruction 140 is judged as 'no branch', and the branch of the branch instruction 142 is judged as 'execution branch', the middle loop step of the data access instruction 146 is stored in 146 Step field 142 of 150 lines; when the branch of the branch instruction 140 is judged as 'no branch', and the branch of the branch instruction 140 is judged as 'no branch', and the branch of the branch instruction 143 is judged as 'execution branch' The outer loop step size of the data access instruction 146 is stored 146 in a step field 143 corresponding to 150 rows. That is, the branch judgment has priority, so that the reverse branch instruction immediately after the data access instruction has the highest priority, the priorities of the other reverse branch instructions are decremented in order, and the branch judges the high priority branch instruction of the 'execution branch' The low priority branch instruction is masked so as not to affect the readout of the step table 150. Forward branch instructions are not recorded in the step table. Can be added by the adder to the line of 150 The data address DBN1 is added to the branch judgment selection step size, such as 140, to obtain the next data address to access the data storage hierarchy system, and to acquire data in advance to push the processor core.
请参考图18,其为本发明所述处理器/存储器系统的另一个实施例。图18的左半部为与图13实施例相似的指令推送处理器系统,右半部为数据推送存储器层次结构。其中三级轨道表118,三级相关表117,三级缓存器存储器112,三级主动表50,三级缓存的TLB及标签单元51,扫描器43,二级轨道表88,二级主动表40,二级缓存存储器 42,二级相关表103,间接分支目标地址产生器102,轨道表20,一级相关表37,一级缓存器存储器22, 指令读缓冲器39,循迹器47, 循迹器48, 处理器核 23与图13实施例中相同号码的模块功能相同。存储器111, 四级主动表120,四级相关表121及四级缓存器存储器122功能与图13实施例中相似,差别在于不但存储指令,也存储数据及数据相关的辅助信息如数据缓存块号码等。四级主动表120的表项中可以存储三级指令缓存地址BN3,也可以存储三级数据缓存地址DBN3。选择器54现在是三输入选择器。扫描器43除执行 图13实施例中对指令的扫描功能外,还计算经过总线115的数据块的顺序下个数据块地址(或逆序上个数据块地址)。右半部有三级数据缓存器存储器160,二级数据缓存器存储器161,一级数据缓存器存储器162,数据读缓冲163;步长表150;三级数据轨道表164,二级数据轨道表165,一级数据轨道表166; 三级数据主动表167,二级数据主动表168;加法器169,170,171,172,173;三级数据相关表174,二级数据相关表175,一级数据相关表176;及选择器192。Please refer to FIG. 18, which is another embodiment of the processor/memory system of the present invention. The left half of Fig. 18 is an instruction push processor system similar to the embodiment of Fig. 13, and the right half is a data push memory hierarchy. The three-level track table 118, the three-level correlation table 117, the third-level buffer memory 112, the three-level active table 50, the three-level buffer TLB and the tag unit 51, the scanner 43, the second-level track table 88, and the second-level active table 40, L2 cache memory 42, a secondary correlation table 103, an indirect branch target address generator 102, a track table 20, a level correlation table 37, a level 1 buffer memory 22, an instruction read buffer 39, a tracker 47, a tracker 48, The processor core 23 has the same function as the module of the same number in the embodiment of Fig. 13. Memory 111, The functions of the four-stage active table 120, the four-level correlation table 121 and the four-level buffer memory 122 are similar to those in the embodiment of FIG. 13, except that not only the storage instructions but also the data and data-related auxiliary information such as the data cache block number are stored. The entry of the four-level active table 120 may store a three-level instruction cache address BN3 or a three-level data cache address DBN3. The selector 54 is now a three-input selector. Scanner 43 except execution In addition to the scan function of the instructions in the embodiment of Fig. 13, the order of the next data block (or the reverse block data block address) of the data block passing through the bus 115 is also calculated. The right half has a three-level data buffer memory 160, a secondary data buffer memory 161, a primary data buffer memory 162, a data read buffer 163, a step size table 150, a three-level data track table 164, and a secondary data track table. 165, a primary data track table 166; Three-level data active table 167, secondary data active table 168; adders 169, 170, 171, 172, 173; three-level data related table 174, secondary data related table 175, primary data related table 176; 192.
图18中存储器111是以存储器地址寻址,存储器122是组相联缓存组织结构,其他各级缓存器都是全相联缓存组织结构。如同图13实施例,图18实施例中存储器111可以作为处理器/存储器系统的主存,此时122是处理器的最后一级缓存(Last Level Cache),是统一缓存; 或者另一种系统组织方式以111作为系统的硬盘,此时122是按缓存方式组织的主存,而112则是处理器的最后一级指令缓存,160是处理器的最后一级数据缓存。图18实施例中左半部的指令推送与图13实施例完全一致,在此不再赘述。以下描述右半部的数据推送过程。数据读缓冲(Data Read Buffer, DRB) 163的表项与IRB指令读缓冲39的表项一一对应。当IRB中的一条数据装载指令被IPT指针38推送到处理器核23中执行时,其相应的DRB表项中的数据也被38读出经总线196推送到处理器核23供处理。因此数据存储层次结构的任务是将处理器核将要用到的数据预先填入DRB中与IRB中数据访问指令相应的表项, 使所述数据随指令被推送到处理器核23(数据与指令不一定同时推送,因处理器核执行的数据装载指令与其相应数据进入处理器核通常不在同一流水线段)。The memory 111 in FIG. 18 is addressed by a memory address, the memory 122 is a group associative cache organization structure, and the other levels of buffers are all associative cache organization structures. As with the embodiment of Fig. 13, the memory 111 of the embodiment of Fig. 18 can be used as the main memory of the processor/memory system, and at this time 122 is the last level cache of the processor (Last Level Cache) is a unified cache; Or another system organization mode uses 111 as the hard disk of the system. At this time, 122 is the main memory organized by the cache, and 112 is the last instruction cache of the processor, and 160 is the last data cache of the processor. The instruction push of the left half of the embodiment of FIG. 18 is completely the same as that of the embodiment of FIG. 13, and details are not described herein again. The data push process of the right half is described below. Data read buffer (Data Read Buffer, DRB) The entry of 163 corresponds one-to-one with the entry of the IRB instruction read buffer 39. When a data load instruction in the IRB is pushed by the IPT pointer 38 to the processor core 23, the data in its corresponding DRB entry is also 38 read out via the bus 196 to the processor core 23 for processing. Therefore, the task of the data storage hierarchy is to pre-fill the data to be used by the processor core into the entries in the DRB corresponding to the data access instructions in the IRB. The data is pushed to the processor core 23 with the instructions (data and instructions are not necessarily pushed at the same time, because the data load instructions executed by the processor core and their corresponding data are not in the same pipeline segment of the processor core).
当一个一级指令块被存入IRB 39时,其相应DRB 163被清空。当译码器(处理器核23中的指令译码器或此时附属于IRB 39的专用指令译码器)译出送往处理器核23的一条指令为数据装载指令时,系统为其在步长表150中分配一行供其专用。该行的状态位139被设为‘0’。根据该为‘0’的状态位,系统使处理器核23执行该数据装载指令产生的数据地址经总线94输出,经102旁路后经总线46,选择器54送往51中匹配。如不匹配,则如前述实施例13将数据地址经总线113访问存储器111读取一个四级数据块,存入存储器122中由四级缓存置换逻辑给出的路号(图15中65)拼合数据地址中的索引62指向的四级缓存块。并将该数据地址存入51中标签单元中同样由65及62指向的表项。When a level one instruction block is stored in IRB 39, its corresponding DRB 163 was emptied. When the decoder (the instruction decoder in the processor core 23 is attached to the IRB at this time) The dedicated instruction decoder of 39) when an instruction sent to the processor core 23 is a data load instruction, the system allocates a row for its exclusive use in the step size table 150. The status bit 139 of the line is set to '0'. Based on the status bit of '0', the system causes the processor core 23 to execute the data address generated by the data load command via the bus 94, bypassing 102, and then passing through the bus 46, the selector 54 sends a match to 51. If there is no match, the data address is read into the memory 111 via the bus 113 as described in the foregoing embodiment 13 to read a four-level data block, and is stored in the memory 122. The road number (65 in FIG. 15) given by the four-level cache replacement logic is flattened. A level four cache block pointed to by index 62 in the data address. The data address is stored in the entry in 51 of the tag unit that is also pointed to by 65 and 62.
系统进一步以上述65,62连同数据地址中的三级子地址126从存储器122读出三级数据块,经总线115存入三级数据缓存器存储器160中由三级数据缓存置换逻辑给出的三级数据块号128指定的三级缓存块,并将该三级块号128存入四级主动表120中由65,62及126指向的表项域并将该域置为‘有效’。同时该65及62(四级块号)被存入三级相关表174中由上述128指向的表项。此外扫描器43计算上述三级数据块的顺序下一个三级数据块的地址(即数据地址加上一个三级数据块的大小),送到51中标签单元匹配得到BN4地址,以该BN4地址访问四级主动表120映射为DBN3X地址,与数据地址中的DBNY 13拼合得到DBN3地址。将所得的DBN3或BN4地址存入三级轨道表164中由上述128指向的表项中132域。如顺序下一个三级数据块仍在同一缓存块中,则在上述126上加‘1’,与原来的65,62拼合即得到顺序下一个三级数据块的DBN3地址,不需经过51中标签单元映射。可选的,可以将该顺序下一个三级数据块也填入三级缓存器存储器160,并按上述填充相应120及174中表项;一般不需要将该顺序下一个三级数据块的顺序下一个三级数据块也填入160。The system further reads the three-level data block from the memory 122 in the above-described 65, 62 together with the three-level sub-address 126 in the data address, and stores it in the three-level data buffer memory 160 via the bus 115, which is given by the three-level data buffer replacement logic. The three-level cache block specified by the third-level data block number 128 stores the three-level block number 128 into the entry field pointed to by 65, 62, and 126 in the four-level active table 120 and sets the field to 'valid'. At the same time, the 65 and 62 (four-level block numbers) are stored in the entry in the three-level correlation table 174 pointed to by the above 128. In addition, the scanner 43 calculates the address of the next three-level data block in the order of the above-mentioned three-level data block (ie, the data address plus the size of a three-level data block), and sends the label unit 51 to the BN4 address to obtain the BN4 address. Accessing the four-level active table 120 maps to the DBN3X address, and DBNY in the data address 13 flattened to get the DBN3 address. The resulting DBN3 or BN4 address is stored in the triple-track table 164 in the 132 field of the entry pointed to by 128 above. If the next three-level data block is still in the same cache block, add '1' to the above 126, and the original 65, 62 is combined to obtain the DBN3 address of the next third-level data block, without going through 51. Label unit mapping. Optionally, the next three-level data block in the sequence may also be filled into the third-level buffer memory 160, and the corresponding entries in the 120 and 174 entries are filled as described above; generally, the order of the next three-level data block in the order is not required. The next three level data block is also filled in 160.
系统进一步以上述128连同数据地址中的二级子地址63从三级数据缓存器存储器160读出二级数据块,存入二级数据缓存器存储器161中由二级数据缓存置换逻辑给出的二级数据块号67指定的二级缓存块,并将该二级块号67存入三级数据主动表167中由128,63指向的表项域并将该域置为‘有效’。同时该128(三级块号)被存入二级相关表175中由上述67指向的表项。可选的,此时在上述63上加‘1’,以128与加‘1’后的63拼合的地址寻址三级主动表167,若表项‘有效’,则说明顺序下个二级缓存块已在二级缓存中;若表项‘无效’,则从三级数据缓存器存储器160中以该128与加‘1’后的63拼合的地址读出二级数据块,存入二级数据缓存器存储器161中由二级缓存置换逻辑给出的另一个二级块号67指向的二级数据缓存块,并将该另一个67存入167中以128与加‘1’后的63拼合的地址寻址的表项,并将该表项设为‘有效’。The system further reads the secondary data block from the tertiary data buffer memory 160 in the above-described 128 along with the secondary subaddress 63 in the data address, and stores it in the secondary data buffer memory 161 which is given by the secondary data cache replacement logic. The secondary cache block specified by the secondary data block number 67 stores the secondary block number 67 into the entry field pointed to by 128, 63 in the tertiary data active table 167 and sets the field to 'valid'. At the same time, the 128 (three-level block number) is stored in the entry in the secondary correlation table 175 pointed to by the above 67. Optionally, at this time, a '1' is added to the above 63, and the third-level active table 167 is addressed by an address of 128 and the 63 after the addition of the '1'. If the entry is 'valid', the next level is determined. The cache block is already in the secondary cache; if the entry is 'invalid', the secondary data block is read from the tertiary data buffer memory 160 by the address of the 128 and the added 63 after the '1', and is stored in the second data block. The secondary data buffer block pointed to by another secondary block number 67 given by the secondary cache replacement logic in the level data buffer memory 161, and the other 67 is stored in 167 with 128 and after '1' 63 flattened address-addressed entry and set the entry to 'valid'.
如顺序下一个二级数据块的地址越过了三级缓存块的边界,则三级轨道表164中上述128指向的表项经总线190被读出,如果该表项内容为BN4格式,则以该BN4地址经总线197访问四级主动表120。若120表项‘有效’,则以表项中DBN3地址存入164中128指向的表项置换原来的BN4;若120表项‘无效’,则以总线197上的该BN4地址访问存储器122读出顺序下一个三级数据块存入存储器160,并按上述方式填充相应的164,167,174及120中表项。以此保证当一个三级数据块的内容被存入二级数据缓存器时其顺序下一个三级数据块被存入三级数据缓存中。可选的,当三级轨道表164中由上述128指向的表项为DBN3格式时,将该DBN3经总线190如上述寻址三级主动表167,使正在填入二级缓存器存储器161的二级数据块的顺序下一个二级数据块也被填入161。当然也可以根据需要将逆序上一个数据块存入数据缓存中,此时使用轨道表中130域。也可以完全不用各数据轨道表164,165,166。此时系统没有自动填充越过三级数据缓存块边界的顺序或逆序二级数据块的功能。其他各数据存储层次的预填充以同样方式进行。 If the address of the next secondary data block in the sequence crosses the boundary of the third-level cache block, the entry pointed to by the above 128 in the third-level track table 164 is read out via the bus 190. If the content of the entry is in the BN4 format, The BN4 address accesses the four-level active list 120 via bus 197. If the 120 entry is 'valid', the original BN4 is replaced with the entry pointed to by the DBN3 address stored in the entry 164 in the entry; if the 120 entry is 'invalid', the access to the memory 122 is read by the BN4 address on the bus 197. The next three levels of data blocks are stored in the memory 160, and the corresponding entries in 164, 167, 174 and 120 are filled in the manner described above. This ensures that when the contents of a three-level data block are stored in the secondary data buffer, the next three-level data block is stored in the three-level data buffer. Optionally, when the entry pointed to by the above 128 in the three-level track table 164 is in the DBN3 format, the DBN3 is addressed to the three-level active table 167 via the bus 190 as described above, so that the secondary buffer memory 161 is being filled. The next secondary data block in the order of the secondary data blocks is also filled in 161. Of course, the data block in the reverse order can also be stored in the data cache as needed, and the 130 field in the track table is used. It is also possible to completely eliminate the data track tables 164, 165, 166. At this time, the system does not automatically fill in the order of the three-level data cache block boundary or the reverse order of the secondary data block. Pre-filling of the other data storage levels is done in the same way.
系统进一步从二级数据缓存器存储器161中以上述67与数据地址中的一级子地址64拼合读出一级数据块,存入一级数据缓存器存储器162中由一级数据缓存置换逻辑给出的一级数据块号68指定的一级缓存块;并将该一级块号68存入二级数据主动表168中由67,64指向的表项域并将该域置为‘有效’。同时该67(二级块号)被存入一级相关表176中由上述68指向的表项。可选的,此时二级轨道表165中上述67指向的表项被读出,如果该表项内容为BN3X格式,则以该BN3地址经总线185寻址三级主动表167,如167表项‘有效’,即以167表项中的BN2X地址经总线189写回165替代BN3X地址。如果167表项‘无效’,即以185上地址寻址三级数据缓存器存储器160读取二级数据块存入二级数据缓存器存储器161中由缓存置换逻辑给出的二级缓存块地址另一个67指向的二级缓存块。该另一个67也被存入三级数据主动表167中185寻址的表项,也被存入二级数据轨道表165中替代BN3X地址。也以该67地址在二级数据主动表168及二级数据相关表175中为上述二级缓存块建立相应表项,其中175表项中存储上述BN3X地址。如此保证当一个二级数据块的内容被存入一级数据缓存器时其顺序下一个二级数据块被存入二级数据缓存中。The system further reads out the primary data block from the secondary data buffer memory 161 by the above-mentioned 67 and the first-level sub-address 64 of the data address, and stores it in the primary data buffer memory 162 by the primary data cache replacement logic. The first level cache block specified by the primary data block number 68; and the first level block number 68 is stored in the entry field of the secondary data active table 168 pointed to by 67, 64 and the field is set to 'valid' . At the same time, the 67 (secondary block number) is stored in the entry of the primary correlation table 176 pointed to by the above 68. Optionally, at this time, the entry pointed to by the above 67 in the secondary track table 165 is read. If the content of the entry is in the BN3X format, the three-level active table 167 is addressed by the BN3 address via the bus 185, such as the 167 table. The entry 'valid', that is, the BN2X address in the 167 entry is written back to 165 via the bus 189 instead of the BN3X address. If the entry 167 is 'invalid', the address is read by the address-receiving three-level data buffer memory 160 at 185, and the second-level data block is stored in the secondary data buffer memory 161 in the secondary cache block address given by the cache replacement logic. Another 67-pointed L2 cache block. The other 67 is also stored in the entry of the 185 address in the three-level data active table 167, and is also stored in the secondary data track table 165 instead of the BN3X address. The corresponding entry is also established for the second level cache block in the secondary data active table 168 and the secondary data related table 175 by using the 67 address, wherein the 175 entry stores the BN3X address. This ensures that when the contents of a secondary data block are stored in the primary data buffer, the next secondary data block is stored in the secondary data cache.
系统进一步将上述68与数据地址中DBNY 13一同作为一级数据缓存地址DBN1经总线193存入步长表150中与上述数据装载指令相应的行中138域,并将该行139状态域设为‘1’。根据该为‘1’的状态,系统以上述DBN1访问一级数据缓存器存储器162,读出数据存入DRB 163中与上述数据装载指令相应的表项中,使该数据可随指令被推送到处理器核23处理。当该数据被推送到处理器核23后,系统开始预取下一个数据存入DRB以供下次执行同一数据装载指令时推送。因此时状态域139为‘1’,预取数据供推送的过程与上述完全一样,只是在产生新的68与13(DBN1)时先将该DBN1与原存在步长表150中该行138域中的上一个DBN1相减,其差作为步长存入此时分支判断选定的表项,如140中。其后将新的DBN1写入138域取代旧的地址,并将状态域139设为‘2’。 The system further takes the above 68 with the DBNY in the data address 13 is stored as the primary data cache address DBN1 via the bus 193 in the row 138 field corresponding to the data load command in the step size table 150, and the row 139 status field is set to '1'. According to the state of "1", the system accesses the primary data buffer memory 162 by the above DBN1, and the read data is stored in the DRB. In the entry corresponding to the above data load instruction in 163, the data can be pushed to the processor core 23 for processing with the instruction. When the data is pushed to the processor core 23, the system begins prefetching the next data to the DRB for pushing the next time the same data load instruction is executed. Therefore, the state field 139 is '1', and the process of prefetching data for pushing is exactly the same as above, except that when the new 68 and 13 (DBN1) are generated, the DBN1 and the original 138 field in the original step size table 150 are first generated. The previous DBN1 is subtracted, and the difference is stored as a step into the branch to determine the selected entry, such as 140. The new DBN1 is then written to the 138 field to replace the old address, and the status field 139 is set to '2'.
当该第二个数据被推送到处理器核23后,当该数据装载指令之后的一条分支指令其分支判断为‘执行分支’时,系统开始预取下一个数据存入DRB以供下次执行同一数据装载指令时推送。因此时状态域139为‘2’,系统不再等待处理器核23计算数据地址。而是直接将步长表150中与该数据装载指令相应行中138域中DBN1地址,及分支判断选定的步长(如140)输出,在加法器173中相加。系统并对173的输出181进行边界判断。如181没有超出一级数据缓存块的边界,则选择器192选择181访问一级数据缓存器存储器162,读出数据存入DRB中相应表项以待推送。并将181上的地址作为DBN1存入步长表中相应行中138域。如181超出了一级数据缓存块的边界,但没有超出相邻的一级缓存块边界,则以181寻址一级数据轨道表166,读出顺序下一个一级数据块的DBN1X地址132 (或逆序上一个数据块的DBN1X地址130)经总线191输出,由选择器192选择,与181上的DBNY地址13拼合访问存储器162,读出数据存入DRB中相应表项以待推送。并将上述拼合地址DBN1存入步长表150中相应行中138域。上述两种情况下150中状态域139都保持为‘2’不变。如果166输出的地址132为BN2X格式,系统将该BN2X经191寻址二级数据主动表168,如168表项‘有效’,即以168表项中的BN1X地址经总线184写回166替代BN2X地址。如果168表项‘无效’,即以191上地址寻址二级数据缓存器存储器161读取一级数据块存入一级数据缓存器存储器162中由缓存置换逻辑给出的一级缓存块地址68指向的一级缓存块。该68也被存入二级数据主动表168中191寻址的表项,也被存入一级数据轨道表166中替代BN2X地址。After the second data is pushed to the processor core 23, when a branch instruction following the data load instruction determines that its branch is an 'execution branch', the system starts prefetching the next data into the DRB for next execution. Push when the same data load instruction. Thus, the status field 139 is '2' and the system no longer waits for the processor core 23 to calculate the data address. Instead, the step size table 150 directly outputs the DBN1 address in the 138 field in the corresponding row of the data load instruction, and the branch determines the selected step size (such as 140), and adds it in the adder 173. The system makes a boundary determination for the output 181 of 173. If 181 does not exceed the boundary of the primary data cache block, selector 192 selects 181 to access primary data buffer memory 162, and the read data is stored in the corresponding entry in DRB for push. The address on 181 is stored as DBN1 in the corresponding field in the 138 field in the step table. If 181 is beyond the boundary of the primary data cache block, but does not exceed the adjacent primary cache block boundary, the primary data track table 166 is addressed by 181, and the DBN1X address 132 of the next primary data block is read out. (or DBN1X address 130 of a data block in reverse order) is output via bus 191, selected by selector 192, and accessed in conjunction with DBNY address 13 on 181 to access memory 162, and the read data is stored in the corresponding entry in DRB for push. The above-mentioned flattened address DBN1 is stored in the corresponding field 138 field in the step table 150. In both cases, the state field 139 in 150 remains unchanged at '2'. If the address 132 outputted by 166 is in BN2X format, the system addresses the secondary data active table 168 by 191, such as the 168 entry 'valid', that is, the BN1X address in the 168 entry is written back to the bus 184 to replace the BN2X. address. If the 168 entry is 'invalid', that is, the address of the secondary data buffer 161 is addressed by the address 191, and the primary data block is read into the primary cache block address of the primary data buffer memory 162 and given by the cache replacement logic. The level 1 cache block pointed to by 68. The 68 is also stored in the entry addressed by the 191 in the secondary data active table 168, and is also stored in the primary data track table 166 in place of the BN2X address.
如181超出了上述边界,但没有超出二级缓存块边界,则系统以DBN1地址138寻址一级相关表176,将DBN1地址映射为DBN2地址经总线182输出。加法器172将步长140与182上的DBN2地址相加,以其输出183寻址二级数据主动表168,如其表项‘有效’,则以表项中的DBN1X地址与183上的DBNY 13拼合,经总线184访问一级数据缓存器存储器162,读出数据存入DRB中表项以待推送;并将184上的DBN1地址存入步长表150中相应行中138域,保持139域为‘2’不变。如二级数据主动表168中表项‘无效’,则以183寻址二级数据缓存器存储器161,读出一级数据块存入一级数据缓存器存储器162中由一级数据缓存置换逻辑给出的一级数据块号68指定的一级缓存块。系统并由该68与183上的DBNY拼合为DBN1地址访问162,读出数据存入DRB中表项以待推送;并将该DBN1地址存入步长表中相应行中138域,保持139域为‘2’不变。If 181 exceeds the above boundary but does not exceed the level of the secondary cache block, the system addresses the primary correlation table 176 with the DBN1 address 138 and the DBN1 address for the DBN2 address for output via the bus 182. Adder 172 adds the DBN2 addresses on steps 140 and 182, and outputs 183 to the secondary data active table 168, such as its entry 'valid', with the DBN1X address in the entry and DBNY on 183. 13 splicing, accessing the primary data buffer memory 162 via the bus 184, reading the data into the DRB entry to be pushed; and storing the DBN1 address on the 184 into the corresponding row 138 field in the step table 150, maintaining 139 The domain is '2' unchanged. If the entry in the secondary data active table 168 is 'invalid', the secondary data buffer memory 161 is addressed by 183, and the read primary data block is stored in the primary data buffer memory 162 by the primary data cache replacement logic. The first level cache block specified by the primary data block number 68 is given. The system is combined with the DBNY on the 68 and 183 as the DBN1 address access 162, and the read data is stored in the DRB entry to be pushed; and the DBN1 address is stored in the corresponding row in the step table 138 field, and the 139 field is maintained. Not changed to '2'.
如181超出了二级缓存块边界,但没有超出三级缓存块边界,则系统以上述总线182上的DBN2地址寻址二级相关表175,将DBN2地址映射为DBN3地址经总线186输出。加法器171将步长140与186上的DBN3地址相加,以其输出188寻址三级数据主动表167,如167中表项‘有效’,则以该表项中的DBN2X地址与188上的DBNY 13拼合,经总线189寻址二级数据主动表168。若168中表项‘有效’,则直接以表项中的DBN1X地址拼合总线188上的DBNY 13作为DBN1地址经总线184访问一级数据缓存器存储器162,读出数据存入DRB中表项以待推送;并将该DBN1地址存入步长表中相应行中138域,保持139域为‘2’不变。若168中表项‘无效’,以总线189上的DBN2地址寻址二级数据缓存器存储器161,读出一级数据块存入一级数据缓存器存储器166中由一级数据缓存置换逻辑给出的一级数据缓存块号68指向的一级缓存块;该68也被存入168中由总线189寻址的表项,该表项被置为‘有效’。系统并由该68与189上的DBNY拼合为DBN1地址访问162,读出数据存入DRB中表项以待推送;并将该DBN1地址存入步长表中相应行中138域,保持139域为‘2’不变。If 181 exceeds the level 2 cache block boundary but does not exceed the level 3 cache block boundary, the system addresses the secondary correlation table 175 with the DBN2 address on the bus 182 described above, and maps the DBN2 address to the DBN3 address for output via the bus 186. The adder 171 adds the DBN3 address on the step sizes 140 and 186, and the output 188 addresses the three-level data active table 167. If the entry in the 167 is 'valid', the DBN2X address and the 188 are in the entry. DBNY 13 flattened, the secondary data active table 168 is addressed via the bus 189. If the entry in the 168 is 'valid', the DBNY on the bus 188 is directly combined with the DBN1X address in the entry. 13 as the DBN1 address accesses the primary data buffer memory 162 via the bus 184, the read data is stored in the DRB entry to be pushed; and the DBN1 address is stored in the corresponding row 138 field in the step table, and the 139 field is maintained. '2' does not change. If the entry in '168' is 'invalid', the secondary data buffer memory 161 is addressed by the DBN2 address on the bus 189, and the read primary data block is stored in the primary data buffer memory 166 by the primary data cache replacement logic. The primary data cache block number 68 pointed to by the primary cache block; the 68 is also stored in the entry in 168 addressed by the bus 189, the entry being set to 'valid'. The system is combined with the DBNY on the 68 and 189 as the DBN1 address access 162, and the read data is stored in the DRB entry to be pushed; and the DBN1 address is stored in the corresponding row in the step table 138 field, and the 139 field is maintained. Not changed to '2'.
如181超出了三级缓存块边界,但没有超出四级缓存块边界,则系统以上述总线186上的DBN3地址寻址三级相关表174,将DBN3地址映射为BN4地址经总线196输出。加法器170将步长140与196上的DBN4地址相加,以其输出197寻址四级主动表120,如120中表项‘有效’,则以该表项中的DBN3X地址与197上的DBNY 13拼合,经总线125寻址三级数据主动表167。若167中表项‘有效’,则直接以表项中的DBN2X地址拼合总线125上的DBNY 13作为DBN2地址经总线189访问二级数据主动表168。若167中表项‘无效’,以总线189上的DBN2地址寻址二级数据缓存器存储器161,读出一级数据块存入一级数据缓存器存储器162中由一级数据缓存置换逻辑给出的一级数据缓存块号68指向的一级缓存块;该68也被存入168中由总线189寻址的表项,该表项被置为‘有效’。以总线189上的DBN2地址访问二级数据主动表168及后续操作与上一段中的描述相同。最终系统以DBN1地址访问162,读出数据存入DRB163中表项以待推送;并将该DBN1地址存入步长表中相应行中138域,保持139域为‘2’不变。If 181 exceeds the level 3 cache block boundary, but does not exceed the level 4 cache block boundary, the system addresses the level 3 correlation table 174 with the DBN3 address on bus 186 above, and maps the DBN3 address to the BN4 address for output via bus 196. The adder 170 adds the DBN4 addresses on the step sizes 140 and 196, and addresses the four-level active table 120 with its output 197. If the entry in the entry is 'valid' in 120, the DBN3X address in the entry is 197. DBNY 13 is split, and the three-level data active table 167 is addressed via the bus 125. If the entry in the 167 is 'valid', the DBNY on the bus 125 is directly combined with the DBN2X address in the entry. The secondary data active table 168 is accessed via the bus 189 as a DBN2 address. If the entry in '167 is 'invalid', the secondary data buffer memory 161 is addressed by the DBN2 address on the bus 189, and the read primary data block is stored in the primary data buffer memory 162 by the primary data cache replacement logic. The primary data cache block number 68 pointed to by the primary cache block; the 68 is also stored in the entry in 168 addressed by the bus 189, the entry being set to 'valid'. Accessing the secondary data active table 168 with the DBN2 address on bus 189 and subsequent operations are the same as described in the previous paragraph. Finally, the system accesses 162 with the DBN1 address, and the read data is stored in the DRB163 entry to be pushed; and the DBN1 address is stored in the corresponding field 138 field in the step table, and the 139 field is kept unchanged by ‘2’.
如181超出了四级缓存块边界,则系统以上述总线196上的BN4地址寻址51中标签单元读出相应标签61,经总线113送到加法器169。169将标签61与步长140相加,其和198经选择器54选择后送到51中标签单元匹配。如果匹配产生新的BN4地址,即以该新BN4地址经总线123寻址四级主动表120。若120中表项‘有效’,则将表项中DBN3X地址经总线125寻址三级主动表167。其后操作与上一段中经总线125寻址167的操作相同。若120中表项‘无效’,则以总线123上的新BN4地址寻址存储器122读出三级数据块填入三级数据缓存器存储器160,其操作如前述。如在标签单元中不匹配,则以总线198上的地址放上总线113寻址存储器111读出四级数据块存入四级缓存器存储器122。其过程本实施例前面已有描述,不再赘述。最终系统以经各层次主动表映射所得的DBN1地址访问162,读出数据存入DRB中表项以待推送;并将该DBN1地址存入步长表中相应行中138域,保持139域为‘2’不变。过程中若某一存储层次中的相应数据块还不存在,则系统会自动从低一存储层次读取该数据块存入本层次中由缓存置换逻辑指定的缓存块,该缓存块地址也被存入低一层次主动表,且低一层次的缓存块号被存入本层次的相关表以建立双向的映射关系。If 181 exceeds the level 4 cache block boundary, the system reads the corresponding label 61 by the label unit in the BN4 address addressing 51 on the bus 196, and sends it to the adder 169 via the bus 113. 169 associates the label 61 with the step size 140. Plus, the sum 198 is selected by the selector 54 and sent to the tag unit in 51 for matching. If the match results in a new BN4 address, the four-level active list 120 is addressed via bus 123 at the new BN4 address. If the entry in the table is 'valid', the DBN3X address in the entry is addressed to the tertiary active table 167 via the bus 125. Subsequent operations are the same as those performed by bus 125 addressing 167 in the previous segment. If the entry in 120 is 'invalid', the new BN4 address addressing memory on bus 123 reads the tertiary data block and fills the tertiary data buffer memory 160, as described above. If there is no match in the tag unit, the bus 111 is placed on the bus 198 to address the memory 111 to read the four-level data block and store it in the quaternary buffer memory 122. The process has been described in the foregoing embodiment and will not be described again. The final system accesses 162 by the DBN1 address obtained by the active table mapping of each level, and the read data is stored in the DRB entry to be pushed; and the DBN1 address is stored in the corresponding field of the 138 field in the step table, and the 139 field is maintained. '2' does not change. If the corresponding data block in a certain storage hierarchy does not exist in the process, the system will automatically read the data block from the lower one storage level and store it in the cache block specified by the cache replacement logic in the hierarchy, and the cache block address is also The low-level active table is stored, and the lower-level cache block number is stored in the related table of the level to establish a two-way mapping relationship.
以上描述了数据装载的推送过程。数据存储可以用类似的方法,也可以用传统的方法比如存入写缓冲器(write buffer),当数据缓存空闲时,将写缓冲器中的数据写回数据缓存。以步长表150中的步长猜测装载数据时(即150中139域为‘2’时),需要处理器核通过总线49送出正确的数据地址与猜测的DBN1地址比较。如果不同,需要将猜测装载的数据及其后续执行结果抛弃,以总线49上的正确数据地址装载数据,并将相应139域设为‘0’,重新计算步长存入150。如果有写缓冲器,则猜测装载的地址还要与写缓存器中地址比较以确定装载的数据是更新过的数据。可以将DBN地址映射为数据地址以与49上数据地址比较。也可以将49上地址映射为DBN地址与系统猜测产生的DBN地址比较。另外如果步长表150中由分支判断读出的步长的有效位如141等是‘无效’时,也要重新在该分支判断条件下如前述产生步长存入相应步长域。The above describes the push process of data loading. Data storage can be done in a similar way, or it can be stored in a write buffer (write) Buffer), when the data cache is idle, write the data in the write buffer back to the data cache. When the load data is guessed by the step size in the step size table 150 (i.e., when the 139 field in 150 is '2'), the processor core is required to send the correct data address through the bus 49 to compare with the guessed DBN1 address. If it is different, it is necessary to discard the data of the guess load and its subsequent execution result, load the data with the correct data address on the bus 49, and set the corresponding 139 field to '0', and recalculate the step size to 150. If there is a write buffer, the guessed load address is also compared to the address in the write buffer to determine that the loaded data is updated. The DBN address can be mapped to a data address to compare with the data address on 49. It is also possible to map the upper address of 49 to the DBN address and compare it with the DBN address generated by the system guess. Further, if the valid bit of the step size read by the branch judgment in the step size table 150 is "invalid", the step size is also stored in the corresponding step size field as described above.
图18实施例中的数据存储器层次结构其最低层次缓存为路组相联,该层次有标签单元,也可能有虚实地址转换的TLB;该层次可以由存储器地址经51中标签单元匹配寻址,或直接由缓存器地址BN4寻址。其余层次的数据缓存都为全相联,由缓存器地址DBN寻址。各DBN与BN4之间由主动表及相关表映射。其中主动表的作用是将低层次缓存器地址映射为高层次缓存器地址;相关表的作用是将高层次缓存器地址映射为低层次缓存器地址。其作用机制请参考图19。The data memory hierarchy in the embodiment of FIG. 18 has the lowest level cache associated with the path group, the level has a label unit, and may also have a TLB of the virtual real address translation; the level may be addressed by the memory address matching the label unit in 51. Or directly addressed by the buffer address BN4. The rest of the data cache is fully associative and is addressed by the buffer address DBN. The mapping between the DBN and the BN4 is performed by the active table and the related table. The role of the active table is to map the low-level buffer address to the high-level buffer address; the role of the related table is to map the high-level buffer address to the low-level buffer address. Please refer to Figure 19 for its mechanism of action.
图19为图18实施例中数据缓存层次结构的作用机制示意图。图19中200为一个四级缓存块,其中含有两个三级缓存块201,202。每个三级缓存块又含有两个二级缓存块,如201中含有二级缓存块203及204。每个二级缓存块又含有两个一级缓存块,如203中含有一级缓存块205及206。假设当前步长表150中138域中的DBN1地址指向一级缓存块205,则系统根据步长140的长度,以最少映射步骤,最少延迟的方式求得同一条数据装载指令的下一个一级数据缓存地址,以提前访问一级数据缓存存储器162读出数据存入DRB中相应表项。FIG. 19 is a schematic diagram of the action mechanism of the data cache hierarchy in the embodiment of FIG. 18. In Figure 19, 200 is a four level cache block containing two level three cache blocks 201, 202. Each L3 cache block further contains two L2 cache blocks, such as 201 containing L2 cache blocks 203 and 204. Each L2 cache block further contains two L1 cache blocks, such as 203 containing L1 cache blocks 205 and 206. Assuming that the DBN1 address in the 138 domain of the current step size table 150 points to the L1 cache block 205, the system obtains the next level of the same data load instruction by the minimum mapping step and the least delay according to the length of the step 140. The data cache address is read in advance by the primary data cache memory 162 to store the data in the corresponding entry in the DRB.
以下结合图18与图19为例说明。假设指向205的138地址与140相加,其和没有超出205的边界,则直接以该和181作为新的一级数据缓存地址寻址一级数据缓存162读出数据存入DRB 163中相应表项。如138地址与140相加,其和181超出205的边界,但没有超出二级缓存块203的边界,则需要将138地址从BN1格式(图18实施中通过一级相关表176)映射为BN2格式182。加法器172将该182地址与步长140相加,其和183寻址二级主动表168中二级缓存块203的对应表项,从中读出一级缓存块206的DBN1X地址,与183上的DBNY13拼合作为DBN1寻址一级数据缓存器存储器162,也供存入150中138域。如果一级数据轨道表166中205缓存块的对应表项中存有顺序下一缓存块206的地址,也可直接以181(忽视181中进位溢出的位)寻址166,获得206的地址。Hereinafter, an explanation will be given with reference to FIGS. 18 and 19. Assuming that the 138 address pointing to 205 is added to 140, and the sum does not exceed the boundary of 205, the sum 181 is used as the new primary data cache address to address the primary data cache 162. The read data is stored in the DRB. The corresponding entry in 163. If the 138 address is added to 140, and the sum 181 exceeds the boundary of 205, but does not exceed the boundary of the L2 cache block 203, the 138 address needs to be mapped from the BN1 format (the implementation of the primary correlation table 176 in FIG. 18) to BN2. Format 182. The adder 172 adds the 182 address to the step size 140, and the sum 183 addresses the corresponding entry of the L2 cache block 203 in the L2 active table 168, from which the DBN1X address of the L1 cache block 206 is read, and 183 The DBNY 13 cooperates to address the primary data buffer memory 162 for DBN1 and also stores 138 fields in 150. If the address of the sequential next cache block 206 is stored in the corresponding entry of the 205 cache block in the primary data track table 166, the address of 206 can also be obtained by directly addressing 166 with 181 (ignoring the overflow of the bit in 181).
如果和181超出二级缓存块203的边界,则需要将138地址的DBN1格式经一级相关表176映射为DBN2格式182,再将DBN2格式经二级相关表175映射为DBN3格式186与步长140相加,其和174寻址三级主动表167中三级缓存块201的对应表项,从中读出二级缓存块204的DBN2地址189,再以189寻址二级主动表168,获得一级缓存块207的地址DBN1。即可以该地址经总线184寻址一级缓存器存储器162读取数据存入DRB163,并将该地址存入150中138域。如果和181超出三级缓存块201的边界,则将138中的DBN1地址经176映射为DBN2格式地址,再经175映射为DBN3格式地址,再经174映射为BN4格式地址;以该BN4地址寻址四级主动表120,获得DBN3格式地址125;以该DBN3地址寻址三级主动表167,获得DBN2地址189;以该DBN2地址寻址二级主动表168,获得一级缓存块207的地址DBN1。即可以该地址经总线184寻址一级缓存器存储器162读取数据存入DRB163,并将该地址存入150中138域。If the sum 181 is beyond the boundary of the L2 cache block 203, the DBN1 format of the 138 address needs to be mapped to the DBN2 format 182 via the primary correlation table 176, and the DBN2 format is mapped to the DBN3 format 186 and the step size via the secondary correlation table 175. 140 is added, and 174 addresses the corresponding entry of the third-level cache block 201 in the three-level active table 167, from which the DBN2 address 189 of the second-level cache block 204 is read, and then the secondary active table 168 is addressed by 189. The address DBN1 of the level one cache block 207. That is, the address can be addressed to the first stage buffer memory 162 via the bus 184 to read the data and store it in the DRB 163, and store the address in the 138 field of 150. If the sum of 181 exceeds the boundary of the level three cache block 201, the DBN1 address in 138 is mapped to a DBN2 format address via 176, and then mapped to a DBN3 format address via 175, and then mapped to a BN4 format address via 174; The address four-level active table 120 obtains the DBN3 format address 125; the three-level active table 167 is addressed by the DBN3 address to obtain the DBN2 address 189; the secondary active table 168 is addressed by the DBN2 address, and the address of the first-level cache block 207 is obtained. DBN1. That is, the address can be addressed to the first stage buffer memory 162 via the bus 184 to read the data and store it in the DRB 163, and store the address in the 138 field of 150.
图18,19实施例中数据缓存层次结构中各层次的缓存块形成一个树状结构。四级缓存块是树的根,其他层次的缓存块是根的不同层次的枝;其他各层次的缓存块又是更高层次缓存块的根。根与枝之间,枝与枝之间由双向的地址映射连接为树。从一个一级枝(一级缓存块)开始通过映射可以到达根(同一四级缓存块)以下任何一个一级枝。只有目标超出根的范围,才需要由51中标签单元匹配。目标枝与源枝属于同一分根,所需要经历的映射层次则少。目标枝与源枝分属不同分根,所需要经历的映射层次则多。可以改进图18实施例减少映射层次。In the 18th and 19th embodiments, the cache blocks of each level in the data cache hierarchy form a tree structure. The four-level cache block is the root of the tree, and the other levels of the cache block are the different levels of the root; the other levels of the cache block are the root of the higher-level cache block. Between the root and the branch, the branches and branches are connected as a tree by a bidirectional address mapping. From the first branch (level 1 cache block), you can reach the root (the same level 4 cache block) by any one of the following first level branches. It is only necessary to match the tag units in 51 if the target is beyond the root range. The target branch and the source branch belong to the same root, and there are fewer mapping levels to be experienced. The target branch and the source branch belong to different roots, and there are many mapping levels that need to be experienced. The embodiment of Figure 18 can be modified to reduce the mapping hierarchy.
请参考图20,其为图18实施例中数据缓存层次结构的改进实施例。图20中三级数据缓存器存储器160,二级数据缓存器存储器161,一级数据缓存器存储器162,数据读缓冲163;步长表150;三级数据轨道表164,二级数据轨道表165,一级数据轨道表166; 三级数据主动表167,二级数据主动表168;加法器172,173;三级数据相关表174,二级数据相关表175;及选择器192与图18中右半部相同号码的模块功能相同。一级数据相关表176格式如209所示。其中不但存有一级缓存块的二级数据缓存块号DBN2X,还存有其相应的三级数据缓存块号DBN3X以及四级缓存块号DB4X。Please refer to FIG. 20, which is a modified embodiment of the data cache hierarchy in the embodiment of FIG. 18. The three-level data buffer memory 160, the secondary data buffer memory 161, the primary data buffer memory 162, the data read buffer 163, the step size table 150, the three-level data track table 164, and the secondary data track table 165 are shown in FIG. , primary data track table 166; The three-level data active table 167, the secondary data active table 168; the adder 172, 173; the three-level data related table 174, the secondary data related table 175; and the module function of the selector 192 having the same number as the right half of FIG. the same. The primary data correlation table 176 format is as shown in 209. There is not only the secondary data cache block number DBN2X of the first-level cache block but also the corresponding three-level data cache block number DBN3X and the fourth-level cache block number DB4X.
其操作与图18实施例相似,将步长表150中与数据装载指令相应行中138域中DBN1地址,及分支判断选定的步长(如140)输出,在加法器173中相加。系统并对173的输出181进行边界判断。如边界判断为在一级缓存块中,则直接以181寻址一级数据缓存器存储器162。如边界判断为在一级缓存块外。则The operation is similar to that of the embodiment of Fig. 18. The DBN1 address in the 138 field in the corresponding row of the data load instruction and the step size (e.g., 140) selected by the branch are outputted in the adder 173. The system makes a boundary determination for the output 181 of 173. If the boundary is judged to be in the level one cache block, the level one data buffer memory 162 is directly addressed by 181. If the boundary is judged to be outside the first level cache block. then
以138上地址寻址一级相关表176中的一行209,根据上述边界判断选择209中的一个层级的缓存地址,由加法器172与步长140相加,其和为183。如边界判断为在二级缓存块中,则选择209中DBN2X与140相加,其和183被系统送到二级主动表168寻址;如边界判断为在三级缓存块中,则选择209中DBN3X与140相加,其和183被系统送到三级主动表167寻址;如边界判断为在四级缓存块中,则选择209中DBN4与140相加,其和183被系统送到四级主动表120寻址。其余操作与图18实施例相同,不再赘述。图20实施例可以节省从枝到根的逆向映射步骤与时延。另外可以增设一个加法器专门将209中BN4X拼合138中的DBNY13形成的地址与140相加,其和用于寻址51中标签单元,将BN地址映射为数据地址,以便与总线49上的正确数据地址比较。A row 209 in the primary correlation table 176 is addressed by the upper address 138, and the cache address of one of the selections 209 is judged according to the above-described boundary, and is added by the adder 172 and the step size 140, and the sum is 183. If the boundary is determined to be in the secondary cache block, then DBN2X and 140 are added in selection 209, and the sum 183 is sent by the system to the secondary active table 168 for addressing; if the boundary is determined to be in the tertiary cache block, then 209 is selected. The DBN3X is added to the 140, and the sum 183 is sent to the third-level active table 167 by the system; if the boundary is judged to be in the fourth-level cache block, then DBN4 and 140 are added in the selection 209, and the 183 is sent by the system. The four-level active table 120 is addressed. The rest of the operations are the same as those of the embodiment of FIG. 18 and will not be described again. The embodiment of Figure 20 can save the reverse mapping steps and delays from the branch to the root. In addition, an adder may be additionally added to add the address formed by DBNY13 in BN4X 138 in 209 to 140, and the sum is used to address the label unit in 51, and map the BN address to the data address so as to be correct with bus 49. Data address comparison.
请参考图21,其为预取按逻辑关系组织的数据的实施例。数据中可以含有地址指针,即按逻辑关系组织。本实施例以预取按二叉树组织的数据为例,对按其他逻辑关系组织的数据的预取可以按此类推。220-222是存储器中的数据,其中220是数据,221是二叉树左支的地址指针,222是二叉树右支的地址指针。图21中数据缓存器存储器162,数据读缓冲163,数据轨道表166, 选择器192,指令存储器22,IRB 39,及处理器核23与图18中相同号码的模块功能相同。一些在图21中未显示的模块与图18实施例中相同号码的模块功能相同。新增移位器225,学习引擎226,选择器227。从处理器核23中引出比较结果228。本实施例中数据轨道表(DTT)166中的表项一一对应于数据存储器(DL1)162的各个数据表项Please refer to FIG. 21, which is an embodiment of prefetching data organized in logical relationships. Data can contain address pointers, which are organized logically. In this embodiment, prefetching data organized by other logical relationships may be deduced by analogy, for example, by prefetching data organized by a binary tree. 220-222 is the data in the memory, where 220 is the data, 221 is the address pointer of the left branch of the binary tree, and 222 is the address pointer of the right branch of the binary tree. The data buffer memory 162, the data read buffer 163, the data track table 166, Selector 192, instruction memory 22, IRB 39, and the processor core 23 have the same function as the module of the same number in FIG. Some of the modules not shown in Fig. 21 have the same functions as the modules of the same number in the embodiment of Fig. 18. A shifter 225, a learning engine 226, and a selector 227 are added. The comparison result 228 is taken from the processor core 23. The entries in the data track table (DTT) 166 in this embodiment correspond one by one to the respective data entries of the data memory (DL1) 162.
学习引擎(leaning engine)226负责产生数据轨道表(DTT)166的表项。230-232是DTT 166中与162中数据220-222对应的表项。166中各表项都有‘有效位’,其中数据类型表项230对应数据表项220,指针表项231及232分别含有DBN格式的221及222中的地址指针。数据类型表项,指针表项各有其识别符以区分二者。DBN格式可直接寻址数据存储器162。Learning engine The engine 226 is responsible for generating an entry for the data track table (DTT) 166. 230-232 is DTT The entry in 166 corresponding to the data 220-222 in 162. Each entry in 166 has a 'valid bit', wherein the data type entry 230 corresponds to the data entry 220, and the pointer entries 231 and 232 contain the address pointers in the DBN formats 221 and 222, respectively. Data type table entries, each of which has its identifier to distinguish between the two. The DBN format can directly address the data store 162.
数据读指针181控制从数据轨道表166中读出一行轨道,如指针中DBNY数值接近一行的结尾处,则根据该行轨道结束轨迹点中的BN地址,将其按地址顺序下一行也读出,送往移位器225。225中该一行轨道或两行轨道中按数据读指针181中DBNY所指示的数量向左移位。学习引擎226接收移位后的复数个表项,根据这些表项中的识别符确定数据类型表项230。226并且根据数据类型表项230中的数据类型决定226对指针表项231,232的操作。处理器核23产生的比较结果228控制选择器227选择226输出的复数个指针放上数据读指针181,以寻址数据存储器(DL1)162向处理器核23提供数据。The data read pointer 181 controls to read a line of tracks from the data track table 166. If the DBNY value in the pointer is close to the end of a line, the BN address in the track point is terminated according to the line track, and the next line is also read in the address order. And sent to the shifter 225. 225 in the row of tracks or two rows of tracks shifted to the left by the number indicated by DBNY in the data read pointer 181. The learning engine 226 receives the shifted plurality of entries, determines the data type table entry 230 based on the identifiers in the entries, and determines 226 the pointer entries 231, 232 based on the data types in the data type table entry 230. operating. The comparison result 228 generated by the processor core 23 controls the selector 227 to select the plurality of pointers output by 226 to place the data read pointer 181 to address the data memory (DL1) 162 to provide data to the processor core 23.
例如,数据存储器162的220表项中的数据值为‘6’,221表项中为32位地址‘L’,222表项中为32位地址‘R’。相应地数据轨道表166的230表项中数据类型为二叉树,控制信号是处理器核23执行其地址为‘YYY’ 的指令产生的比较结果 228;231中为由221中‘L’地址指针映射获得的DBN格式地址指针‘DBNL’;232中为由222中‘R’地址映射获得的DBN格式地址指针‘DBNR’。学习引擎226检测来自移位器225的复数个表项,根据标识符选出数据类型表项230。226根据230中的二叉树数据类型,将来自移位器225的231及232表项输出至选择器227的两个输入。假设该指令地址为’YYY’ 的指令将待寻找数值‘8’与从(DL1)162中装载入23的220值‘6’比较,产生比较结果228为‘1’,其意义为待寻找数值大于当前节点220中数值。226观察控制一级存储器22的地址28,在其到达‘YYY’后,使处理器核产生的比较结果228控制选择器227。228此时控制227选择表项232中的右分支指针‘DBNR’输出到数据读指针181。若表项232中的有效位是‘有效’,则232中右分支指针指向的数据成为新的当前数据。选择器192选择181寻址162(DL1), 输出新的当前数据存入DRB 163。181也寻址DTT 166,使166输出含有新的当前数据相应的数据轨道到移位器225。181上地址中的块内偏移部分DBNY控制移位器225将该数据轨道左移使数据类型,DBNL地址,DBNR地址(格式如230,231,232)等对准学习引擎226的输入。 For example, the data value in the 220 entry of the data memory 162 is '6', the 221 entry is the 32-bit address 'L', and the 222 entry is the 32-bit address 'R'. Correspondingly, the data type in the 230 entry of the data track table 166 is a binary tree, and the control signal is that the processor core 23 executes its address as 'YYY'. Comparison result produced by the instruction In the 228; 231, the DBN format address pointer 'DBNL' obtained by the 'L' address pointer mapping in 221 is the DBN format address pointer 'DBNR' obtained by the 'R' address mapping in 222. The learning engine 226 detects a plurality of entries from the shifter 225, and selects a data type entry 230 based on the identifier. 226 outputs the 231 and 232 entries from the shifter 225 to the selection based on the binary tree data type in 230. Two inputs to the 227. Suppose the instruction address is 'YYY’ The instruction compares the value '8' to be searched with the 220 value '6' loaded from 23 in (DL1) 162, resulting in a comparison result 228 of '1', which means that the value to be found is greater than the value in the current node 220. 226 observes the address 28 of the control level 1 memory 22, after it reaches 'YYY', causes the comparison result 228 generated by the processor core to control the selector 227. 228 At this point the control 227 selects the right branch pointer 'DBNR' in the entry 232. Output to the data read pointer 181. If the valid bit in entry 232 is 'valid', then the data pointed to by the right branch pointer in 232 becomes the new current data. The selector 192 selects 181 addressing 162 (DL1), Output new current data stored in DRB 163. 181 also addresses DTT 166, causing 166 to output a corresponding data track containing the new current data to shifter 225. The intra-block offset portion DBNY in the address on 181 controls shifter 225 to shift the data track to the left to enable data type, DBNL address, DBNR Addresses (formats such as 230, 231, 232) are aligned with the input of the learning engine 226.
DRB 163每个表项对应一个块内偏移地址(Offset, DBNY),162(DL1)将整个数据块(如果按数据类型230所规定的数据,如220-222,超出一个数据块, 则将从‘DBNR’地址开始的,跨到按地址顺序的下一个数据块)存入163。处理器核23用执行装载(Load)指令产生的数据地址(Data Address)94中Offset部分寻址DRB 163,读取当前数据及其左分支地址指针,右分支地址指针(格式如220, 221,222)。处理器核23执行指令,将待寻找数值‘8’与当前数据比较,产生比较结果228。Each entry of DRB 163 corresponds to an intra-block offset address (Offset, DBNY), 162 (DL1) will be the entire data block (if the data specified by data type 230, such as 220-222, exceeds one data block, Then, the data block starting from the 'DBNR' address is crossed to the next data block in the order of addresses. The processor core 23 uses the data address generated by executing the load instruction (Data). Address) 94 Offset part of the addressing DRB 163, read the current data and its left branch address pointer, right branch address pointer (format such as 220, 221,222). The processor core 23 executes an instruction to compare the value '8' to be sought with the current data to produce a comparison result 228.
学习引擎226监测地址28,处理器核23产生的比较结果228,数据地址94, 以及(DL1)162输出的相应数据223,以产生数据轨道(Data Track)表项 存入DTT 166。在相应的166中表项‘无效’(未建立)时,数据缓存系统将处理器核23产生的数据地址94送往标签单元51(图中未显示)等匹配、映射为DBN地址184。184寻址数据存储器162,读取数据经223输出到处理器核23。学习引擎226记录94上的地址,及由其寻址数据存储器162中表项输出的223上的数据。226也将新产生的数据地址94与此前记录的223上数据比较,若相同,则学习引擎226将新产生的数据地址94匹配、映射所得的DBN存入记录中所述相同的223上数据的数据表项的对应数据轨道表166中表项, 并将这些表项设为‘有效’。即将221中的地址指针‘L’匹配、映射所得的‘DBNL’存入231,将222中的地址指针‘R’匹配、映射所得的‘DBNR’存入232。另一种方式,226也可以记录,比较映射后的BN格式数据与地址。The learning engine 226 monitors the address 28, the comparison result 228 generated by the processor core 23, the data address 94, And corresponding data 223 output by (DL1) 162 to generate a data track (Data Track) entry to be stored in DTT 166. In the corresponding 166, when the entry is 'invalid' (not established), the data cache system sends the data address 94 generated by the processor core 23 to the tag unit 51 (not shown) for matching, and maps to the DBN address 184. 184. The data memory 162 is addressed and the read data is output to the processor core 23 via 223. The learning engine 226 records the address on 94 and the data on 223 that is addressed by the entry in the data store 162. 226 also compares the newly generated data address 94 with the previously recorded data on 223. If they are the same, the learning engine 226 matches the newly generated data address 94 and maps the resulting DBN to the same 223 data in the record. The entry in the corresponding data track table 166 of the data entry, And set these entries to 'valid'. That is, the address pointer 'L' in 221 is matched, the mapped 'DBNL' is stored in 231, and the address pointer 'R' in 222 is matched and the mapped 'DBNR' is stored in 232. Alternatively, 226 can also record and compare the mapped BN format data with the address.
226将符合下列条件的数据存储器162表项判断为‘数据’(非指针)表项。其条件是该表项本身的数据地址与上述含有地址指针的表项地址只差一个或少数几个数据长度,而且在复数个指令循环中223上数据从未与其后94上地址相同。所述指令循环的范围可以由IRB 39中反向跳转的分支指令地址及其分支目标指令地址确定。与数据存储器162中‘数据’表项相应的数据轨道表166表项即为数据类型表项。学习引擎226将监测所得的规律(即28地址为‘YYY’时如228为‘0’选择231中BN地址,如228为‘1’时选择232中BN地址)存入所述‘数据’(此处为220)的对应数据轨道表表项(此处为230), 并将该表项设为‘有效’。数据类型表项中的有效位可以是复数位,如大于一个预设值为‘有效’;不大于该预设值为‘无效’。226 judges the data memory 162 entry that meets the following conditions as a 'data' (non-pointer) entry. The condition is that the data address of the entry itself is only one or a few data lengths from the above-mentioned address of the entry containing the address pointer, and the data on 223 in the plurality of instruction loops is never the same as the address on the last 94. The range of the instruction loop can be determined by the IRB The branch instruction address of the reverse jump in 39 and its branch target instruction address are determined. The data track table 166 entry corresponding to the 'data' entry in the data store 162 is the data type entry. The learning engine 226 will monitor the resulting rule (ie, when the 28 address is 'YYY', if 228 is '0', the BN address in 231 is selected, and if 228 is '1', the BN address in 232 is selected) and the 'data' is stored ( Here is the corresponding data track table entry for 220) (here 230), And set the entry to 'valid'. The valid bits in the data type table entry may be complex digits, such as greater than a preset value of 'valid'; no greater than the default value of 'invalid'.
在数据轨道表项建立后, 处理器核23执行指令产生的比较结果228控制选择器227选择地址指针,使数据读指针181沿二叉树移动。当到达一个新的数据点,根据其数据类型(如230),学习引擎226控制将同一组的数据及其地址指针(如220-222)从数据缓存162中读出,存入DRB 163,以备处理器核23产生的数据地址94读取。此过程中避免了数据地址94在标签单元匹配后再寻址数据存储器162的延迟。数据读缓冲DRB 163的访问延迟是单时钟周期, 一般也小于162的访问延迟。After the data track entry is established, The comparison result 228 generated by the processor core 23 to execute the instruction controls the selector 227 to select the address pointer to cause the data read pointer 181 to move along the binary tree. When a new data point is reached, based on its data type (e.g., 230), the learning engine 226 controls the same set of data and its address pointer (e.g., 220-222) to be read from the data cache 162 and stored in the DRB. 163, read by the data address 94 generated by the processor core 23. The delay of addressing the data memory 162 by the data address 94 after the tag unit is matched is avoided in this process. Data read buffer DRB The access delay of 163 is a single clock cycle, and is typically less than the access latency of 162.
进一步,可以将数据读缓冲按图18实施例方式组织,即163的表项与IRB指令读缓冲39的表项一一对应。 这种组织方式中数据轨道表(DTT)166中各表项中还增设一个域,用于记录读取与该表项对应的数据存储器162中数据的指令的地址或标志(例如装载指令在指令循环中的顺序号,以及指令的BNY地址)。在学习引擎226根据166中表项控制读出162中数据时,将数据存入表项中与所述标志对应的DRB 163表项。当IRB 39中一条装载指令被推送到处理器核执行时,与这条指令的IRB表项对应的一个DRB表项中的数据也会被推送到处理器核23供使用。如此消除了装载延迟(Load delay)。Further, the data read buffer can be organized in the manner of the embodiment of FIG. 18, that is, the entry of 163 corresponds one-to-one with the entry of the IRB instruction read buffer 39. In this organization, a field is also added to each entry in the data track table (DTT) 166 for recording the address or flag of the instruction for reading data in the data memory 162 corresponding to the entry (for example, the load instruction is in the instruction). The sequence number in the loop, and the BNY address of the instruction). When the learning engine 226 controls the reading of the data in 162 according to the entry in 166, the data is stored in the DRB corresponding to the flag in the entry. 163 entries. When IRB When a load instruction in 39 is pushed to the processor core for execution, the data in a DRB entry corresponding to the IRB entry of the instruction is also pushed to the processor core 23 for use. This eliminates the load delay (Load Delay).
学习引擎226进行一种学习(learning)。学习所得以数据类型及地址指针的形式存放在数据轨道表166中。从数据轨道表读出的数据类型用于控制226本身对从数据轨道读出的其他表项的处理,如将输入226的某个表项移动到某个特定的226输出,或者控制比较结果228的极性(polarity),使选择器227在228的控制下选择正确的地址指针放上数据读指针181, 寻址数据存储器162输出数据(如220)。数据类型也控制226产生及输出单数个或复数个后续地址(对所述正确指针地址加上增量, 所述增量是数据字长的整数倍),寻址162输出同一组的其他数据(如221,222)。因此数据类型就是对226的控制设置,例如产生比较结果228时的IRB地址或标志,228的极性,需产生的后续地址的个数。学习引擎226也将放上总线181的DBN地址与处理器核23产生的数据地址94匹配、映射所得的DBN 184比较,如不相同,则将相应DTT 166中数据类型表项中的有效值减‘1’, 并将所述映射所得的DBN 184放上总线181以寻址数据存储器162读取正确数据, 也寻址DTT 166读取相应轨道表项。学习引擎226对有效值减到‘0’的166表项重新学习。The learning engine 226 performs a learning. The learning proceeds are stored in the data track table 166 in the form of data types and address pointers. The type of data read from the data track table is used to control 226 itself to process other entries read from the data track, such as moving an entry of input 226 to a particular 226 output, or controlling the comparison result 228. The polarity of the selector 227 selects the correct address pointer under the control of 228 and places the data read pointer 181. The address data memory 162 outputs data (e.g., 220). The data type also controls 226 to generate and output a single or multiple subsequent addresses (adding an increment to the correct pointer address, The increment is an integer multiple of the data word length) and the addressing 162 outputs other data of the same group (eg, 221, 222). Therefore, the data type is the control setting for 226, such as the IRB address or flag when the comparison result 228 is generated, the polarity of 228, and the number of subsequent addresses to be generated. The learning engine 226 also matches the DBN address of the bus 181 with the data address 94 generated by the processor core 23, and maps the resulting DBN. 184 comparison, if not the same, the effective value in the data type table item in the corresponding DTT 166 is decremented by '1', and the DBN obtained by the mapping is obtained. 184 is placed on bus 181 to address data memory 162 to read the correct data, and DTT 166 is also addressed to read the corresponding track entry. The learning engine 226 relearns the 166 entries whose effective value is reduced to '0'.
图21实施例可以与图18实施例结合使用。学习引擎226持续监测数据轨道表中的数据类型,也监测数据存储器输出223上的数据与处理器核23输出的数据地址94。如223上数据与其后的94上地址并不相同,则将与输出该数据的数据存储器162表项对应的DTT 166中数据类型表项中的有效值减‘1’。如223上数据与其后的94上地址相同,则将数据类型表项的有效值增‘1’。系统对有效值大于一个预设的数据类型表项对应的同一组数据按图21实施例的方式操作,即假设数据中含有数据指针。系统对有效值不大于该预设值的按图18实施例的方式操作,即假设数据中不含地址指针,按‘步长’计算DBN地址读取数据存储器162中的数据存入DRB 163以备处理器核23使用。以后每次按21实施例产生的181上地址与94上地址相同,则将有效值增‘1’;不同则将有效值减‘1’。这是对学习引擎226的奖励(reward)。数据类型表项230可进一步包括一个域,其中记录本组数据按图18实施例,或图21实施例,或其他方式操作。The embodiment of Figure 21 can be used in conjunction with the embodiment of Figure 18. The learning engine 226 continuously monitors the type of data in the data track table, as well as the data on the data store output 223 and the data address 94 output by the processor core 23. If the data on 223 is not the same as the address on the next 94, the DTT corresponding to the data memory 162 entry that outputs the data will be used. The valid value in the data type entry in 166 is reduced by '1'. If the data on 223 is the same as the address on the next 94, the valid value of the data type entry is incremented by '1'. The system operates the same set of data corresponding to the RMS value greater than a preset data type entry in the manner of the embodiment of FIG. 21, that is, the data contains a data pointer. The system operates in the manner of the embodiment of FIG. 18, in which the effective value is not greater than the preset value, that is, if the address is not included in the data, the data in the DBN address read data memory 162 is stored in the DRB according to the 'step size'. 163 is used by the processor core 23. Each time the 181 upper address generated by the 21 embodiment is the same as the 94 upper address, the effective value is increased by '1'; if not, the effective value is decreased by '1'. This is a reward for the learning engine 226. The data type table entry 230 can further include a field in which the set of data is recorded in accordance with the FIG. 18 embodiment, or the FIG. 21 embodiment, or otherwise.
图22是处理函数调用(Call)与函数返回(Return)指令的实施例。图22中包含的一级缓存22,处理器核23,轨道表20,增量器24, 选择器25及寄存器26与图2实施例中相同号码的模块功能相同。新增加堆栈233与选择器236。扫描器扫描指令提取指令类型格式时译码指令是否调用或返回指令,记录在轨道表表项中的域11指令类型格式(见图1)中。当图22中轨道表输出29上的指令类型是调用指令,且TAKEN信号31是‘分支成功’时,控制器(未显示)控制将寄存器26中的BNX,以及增量器24输出的BNY压入(push)堆栈233。当轨道表输出29上的指令类型是返回指令,控制器控制选择器236选择堆栈233的输出。当31是‘分支成功’时,将233中栈顶BN弹出(pop)存入寄存器26。使程序回到调用函数指令的下一条指令执行。Figure 22 is an embodiment of a handler call (Call) and a function return (Return) instruction. Level 1 cache 22, processor core 23, track table 20, incrementer 24, The selector 25 and the register 26 have the same functions as the modules of the same number in the embodiment of Fig. 2. Stack 233 and selector 236 are newly added. Whether the decoding instruction invokes or returns an instruction when the scanner scan instruction extracts the instruction type format is recorded in the field 11 instruction type format (see FIG. 1) in the track table entry. When the instruction type on the track table output 29 in FIG. 22 is a call instruction, and the TAKEN signal 31 is 'branch successful', the controller (not shown) controls the BNX in the register 26 and the BNY output from the incrementer 24. Push stack 233. When the instruction type on the track table output 29 is a return instruction, the controller controls the selector 236 to select the output of the stack 233. When 31 is 'branch successful', the top of the stack BN in 233 is popped into the register 26. Return the program to the next instruction execution of the calling function instruction.
间接分支指令的指令类型(域11)也可以细分,向缓冲系统提供指引。有一类间接分支指令,每次执行都跳转到同一指令地址,或每次执行时产生的指令地址是上一次执行是产生的指令地址上增加一个‘步长’。对这类间接分支指令,在轨道表表项11中记录为重复类的间接分支指令,以图17中步长表150记录产生的指令地址及步长。也可以将产生的BNX, BNY指令地址分别存入轨道表表项中12及13域(见图1实施例),步长表仅记录步长。其具体操作如同图17,图18实施产生数据地址的方式,在此不再赘述。因为本发明的缓存系统可以主动向处理器核提供非分支指令以及直接分支指令,而且间接分支目标地址的产生是基于寄存器或存储器的内容,因此使用本发明缓存系统的处理器核并不需要保留产生指令地址的程序计数器(program counter)。可以将程序调试硬件断点映射为BN格式地址,与循迹器的BN比较,相同时触发中断。相应地,处理器核也并不需要具有取指令的相关流水线段。The instruction type (field 11) of the indirect branch instruction can also be subdivided to provide guidance to the buffer system. There is a type of indirect branch instruction that jumps to the same instruction address each time it is executed, or the instruction address generated each time it is executed is the addition of a 'step' to the address of the instruction that was generated last time. For such an indirect branch instruction, an indirect branch instruction as a duplicate class is recorded in the track table entry 11, and the generated instruction address and step size are recorded in the step size table 150 in FIG. It is also possible to generate BNX, The BNY instruction addresses are respectively stored in the 12 and 13 fields in the track table entry (see the embodiment of FIG. 1), and the step size table only records the step size. The specific operation is the same as that of FIG. 17, and FIG. 18 implements a method for generating a data address, and details are not described herein again. Because the cache system of the present invention can actively provide non-branch instructions and direct branch instructions to the processor core, and the indirect branch target address is generated based on the contents of the register or memory, the processor core using the cache system of the present invention does not need to be reserved. Program counter that generates an instruction address Counter). The program debug hardware breakpoint can be mapped to the BN format address, compared to the tracker's BN, and the interrupt is triggered the same. Accordingly, the processor core does not need to have an associated pipeline segment with instruction fetches.
请参考图23,其为本发明所述处理器系统的另一个实施例。图23是图8实施例的一个改进,其中三级主动表50,三级缓存的TLB及标签单元51,三级缓存器存储器52,选择器54,二级轨道表88,二级主动表40,二级缓存的存储器 42,轨道表20,一级缓存的相关表37,一级缓存器的存储器22, 指令读缓冲器39,循迹器47, 循迹器48, 处理器核 23与图8实施例中相同号码的模块功能相同。增设了轨道读缓冲(Track Read Buffer,TRB)238,以及选择器237, 239。Please refer to FIG. 23, which is another embodiment of the processor system of the present invention. 23 is a modification of the embodiment of FIG. 8, wherein the three-level active list 50, the three-level cached TLB and tag unit 51, the third-level buffer memory 52, the selector 54, the secondary track table 88, and the secondary active table 40 L2 cache memory 42, track table 20, level 1 cache related table 37, level 1 buffer memory 22, instruction read buffer 39, tracker 47, tracker 48, processor core 23 has the same function as the module of the same number in the embodiment of FIG. A Track Read Buffer (TRB) 238 is added, as well as selectors 237, 239.
TRB 238中存储与IRB 39中存储的指令块对应的轨道。处理器核23有两条前端流水线,分别为FT(Fall Through顺序下一个)与TG(Target目标)。 循迹器0 (TR0)48提供BNY增量38控制IRB 39向处理器核23的FT流水线提供顺序指令流,循迹器1 (TR1)47沿着TRB中的轨道前瞻读出轨道上的TG地址。BN1 格式的TG地址寻址L1 指令存储器22,BN2格式的TG地址寻址L2 指令存储器42,各自读出TG 指令,根据当时按程序顺序可能执行的TG是BN1或BN2格式控制选择器239选择后送到TG流水线。Taken 信号31选择FT或TG前端流水线的输出由后端流水线继续执行完毕。当分支成功时,来自L2或L1的,与分支指令对应的TG指令块由选择器239选择存入IRB 39,与该TG指令块相应的,来自二级轨道表(TT2) 88 或 轨道表(TT)20 的轨道也由选择器237选择存入TRB 238 中供47 TR1读取。如果此TG指令块是由轨道上的BN2X地址从L2指令存储器42中读出,则其也被存入L1指令存储器22中由置换逻辑提供的BN1X指向的一级存储块。该BN1X也被存入AL2主动表40中由所述BN2X指向的表项。二级轨道表88输出的轨道上的BN3格式地址经总线89被送到50 AL3映射为BN2地址(或当AL3表项无效时,寻址52 L3, 读出指令块存入42 L2的一个二级存储块,该存储块的块地址为BNX2)。该BN2地址替换轨道上的原BN3地址。Storage and IRB in TRB 238 The track corresponding to the instruction block stored in 39. The processor core 23 has two front-end pipelines, FT (Fall Through Sequence) and TG (Target Target). Tracker 0 (TR0) 48 provides BNY increment 38 control IRB 39 provides sequential instruction flow to the FT pipeline of processor core 23, tracker 1 (TR1) 47 looks ahead to read the TG address on the track along the track in the TRB. TG address in BN1 format addresses L1 instruction memory 22, TG address in BN2 format addresses L2 The instruction memory 42 reads the TG instruction, and the TG, which may be executed in the program order at that time, is selected by the BN1 or BN2 format control selector 239 and sent to the TG pipeline. Taken Signal 31 selects the output of the FT or TG front-end pipeline to be executed by the back-end pipeline. When the branch is successful, the TG instruction block corresponding to the branch instruction from L2 or L1 is selected by the selector 239 to be stored in the IRB. 39. Corresponding to the TG command block, the track from the secondary track table (TT2) 88 or the track table (TT) 20 is also selected by the selector 237 to be stored in the TRB 238 for 47. TR1 read. If the TG command block is read from the L2 instruction memory 42 by the BN2X address on the track, it is also stored in the L1 instruction memory 22 as the primary memory block pointed to by BN1X provided by the replacement logic. The BN1X is also stored in the entry in the AL2 active table 40 pointed to by the BN2X. The BN3 format address on the track output by the secondary track table 88 is sent to the bus via bus 89. AL3 is mapped to the BN2 address (or 52 L3 is addressed when the AL3 entry is invalid, and the read block is stored in 42). A secondary storage block of L2 whose block address is BNX2). The BN2 address replaces the original BN3 address on the track.
按同样的原理,从88 TT2或20 TT输出的轨道或238 TRB中的 轨道上的BN2格式地址可以通过40 AL2映射为BN1格式(或寻址42 L2 存入22 L1获得BN1地址)。本实施例中88 TT2中存储BN3或BN2格式的TG地址,20 TT只存储BN2或BN1格式的地址,而238 TRB 中则允许BN3,BN2或BN1格式TG地址。TT2及TT中对BN格式的限定触发了指令从低层存储器层次向高层存储器层次填充,避免了传统缓存机制中由缓存缺失触发填充,因此不可避免的缺失。 并且保证分支目标指令在直接分支指令同一或下一存储器层次。因为47 TR1前瞻读出轨道上的TG地址,因此能够部分或全部掩盖42 L2,或22 L1的访问延迟。如果指令段中有密集的分支指令,可以有意使其相应轨道上的TG地址以BN1,BN2格式交错排列,尽量掩盖42及22的访问延迟。如果TRB上读出的地址是BN3格式,如果对应的分支成功,则处理器核23要等待由该BN3地址映射(映射过程在轨道从88 TT2输出即开始,因此能部分或全部掩盖AL3或L3延迟)而得的BN2格式填入TRB 238中轨道后执行分支目标指令。如果对应的分支不成功,则处理器核23并不等待,直接执行顺序下一条指令,映射而得的BN2格式在获得后被填入轨道。 在TRB 238中轨道上的BN3格式地址都被替换为BN2格式后,该轨道被填入20 TT中由上述置换逻辑提供的BN1X所指出的行。本实施例中,系统可以按二级轨道表88或者一级轨道表20输出的轨道控制二级指令存储器42或一级指令存储器22向处理器核23提供TG指令,而IRB 39向处理器核提供顺序指令。本实施例中,执行到顺序下一个指令块的过程是按分支处理的,轨道中的结束轨迹点中的指令类型被设置为无条件分支,因此处理过程与上述分支过程相同。 本实施例中的方法和系统也可应用与其他多存储层次轨道指令缓存系统,如图11,12,13,18实施例。According to the same principle, the output from 88 TT2 or 20 TT or 238 TRB The BN2 format address on the track can be mapped to the BN1 format by 40 AL2 (or the 42 L2 is stored in 22 L1 to obtain the BN1 address). 88 in this embodiment TT2 stores the TG address in BN3 or BN2 format, 20 TT only stores the address in BN2 or BN1 format, and 238 TRB The BN3, BN2 or BN1 format TG address is allowed. The limitation of the BN format in TT2 and TT triggers the instruction to fill from the low-level memory level to the high-level memory level, which avoids the padding triggered by the cache miss in the traditional cache mechanism, so the inevitable missing. And to ensure that the branch target instruction is at the same or next memory level of the direct branch instruction. Because 47 TR1 looks ahead to read the TG address on the track, it can partially or completely cover 42 L2, or 22 L1 access delay. If there are dense branch instructions in the instruction segment, the TG addresses on the corresponding tracks can be intentionally staggered in BN1, BN2 format, and the access delays of 42 and 22 are covered as much as possible. If the address read on the TRB is in the BN3 format, if the corresponding branch is successful, the processor core 23 is waiting to be mapped by the BN3 address (the mapping process is in the track from 88). The TT2 output starts, so the BN2 format can be partially or completely masked by the AL3 or L3 delay. The branch target instruction is executed after the track in 238. If the corresponding branch is unsuccessful, the processor core 23 does not wait, directly executing the next instruction in the sequence, and the mapped BN2 format is filled in the track after being obtained. At TRB After the BN3 format address on the track in 238 is replaced with the BN2 format, the track is filled in 20 The line indicated by BN1X provided by the above replacement logic in TT. In this embodiment, the system can control the secondary instruction memory 42 or the primary instruction memory 22 to provide the TG command to the processor core 23 according to the track output of the secondary track table 88 or the primary track table 20, and the IRB. 39 provides sequential instructions to the processor core. In this embodiment, the process of executing the next instruction block to the sequence is processed by the branch, and the instruction type in the end track point in the track is set as the unconditional branch, so the processing is the same as the above-described branching process. The method and system in this embodiment are also applicable to other multi-storage hierarchical track instruction cache systems, such as the embodiment of Figures 11, 12, 13, and 18.
回到图12,图12实施例中结构的两种应用形式都可以有更多的具体实施例, 例如图12中的各功能模块分处一个有长时延的通讯信道的两端。假设图12中存储器111位于所述通讯信道的一端而其余的模块位于所述通讯信道的另一端。所述通讯信道可以是在同一芯片上从一个处理器核到另一个处理器核的存储器之间;或是在同一芯片上从一个处理器车道到另一个处理器车道的存储器之间;在一个芯片上的处理器核与另一个芯片上的存储器之间;在一台计算机的处理器与另一台计算机的存储器之间;在一个处理器核或计算机与有线或无线网络另一端的存储器之间;以及其他有长时延的通讯信道。Referring back to FIG. 12, there are more specific embodiments of the two application forms of the structure in the embodiment of FIG. For example, each functional module in FIG. 12 is divided into two ends of a communication channel having a long delay. It is assumed that the memory 111 in Fig. 12 is located at one end of the communication channel and the remaining modules are located at the other end of the communication channel. The communication channel may be between memory from one processor core to another processor core on the same chip; or between memory from one processor lane to another processor lane on the same chip; Between the processor core on the chip and the memory on the other chip; between the processor of one computer and the memory of another computer; at the memory of one processor core or computer and the other end of the wired or wireless network And other communication channels with long delays.
以下以网络信道为例说明。IPv6地址为128位,假设存储器地址为64位,则以IPv6地址与存储器地址合并为一个192位的地址以对网络远端的存储器寻址。为了支持所述192位的地址,图12中只有43,51以及113几个部件需要能满足192位的带宽,但其功能与操作还是一样的;其余各个部件不需要因这个192位带宽有任何改变。具体地,TLB/TAG单元51要能存储支持192位地址的标签(比如128位标签加64位的存储器标签),扫描器43也要能将51提供的192位的当前指令块地址与分支指令块内偏移地址,以及分支偏移量相加得到192位的分支目标地址。这个192位的分支目标地址与51中标签单元TAG的内容相匹配。如果不匹配,则将该192位的分支目标地址经总线113送到信道另一端的存储器111取指令。如果匹配,则如图12实施例前述一样操作以BN3或BN2地址存入二级轨道表88,不另赘述。其他信道如局域网,或不同计算机的处理器核与存储器之间等也可按同样方法支持,在存储器地址之前加上不同计算机或存储器等功能单元本身在连接网路中的网路地址作为前缀地址即可。也可以将图12实施例中的存储器112与存储器111一起放到通讯信道的另一端。The following uses the network channel as an example. The IPv6 address is 128 bits. Assuming the memory address is 64 bits, the IPv6 address and the memory address are combined into a 192-bit address to address the memory at the far end of the network. In order to support the 192-bit address, only the components of 43, 51, and 113 in Figure 12 need to meet the bandwidth of 192 bits, but the functions and operations are the same; the remaining components do not need any bandwidth due to this 192-bit bandwidth. change. Specifically, the TLB/TAG unit 51 is capable of storing a tag supporting a 192-bit address (such as a 128-bit tag plus a 64-bit memory tag), and the scanner 43 is also capable of providing the current block address of the 192-bit provided by 51 with a branch instruction. The intra-block offset address and the branch offset are added to obtain a branch target address of 192 bits. This 192-bit branch target address matches the content of the tag unit TAG in 51. If there is no match, the 192-bit branch target address is sent via bus 113 to memory 111 at the other end of the channel for instruction. If it matches, the BN3 or BN2 address is stored in the secondary track table 88 as described above in the embodiment of FIG. 12, and details are not described herein. Other channels, such as a local area network, or between processor cores and memories of different computers, may also be supported in the same manner. Before the memory address, a functional address of a different computer or memory itself is used as a prefix address in the network connected to the network. Just fine. Memory 112 in the embodiment of Figure 12 can also be placed with the memory 111 at the other end of the communication channel.
上述对图12中结构的应用形式的具体实施例也可以应用在图13及图18的结构上。以图18为例,假设图18中存储器111位于所述通讯信道的一端而其余的模块位于所述通讯信道的另一端。则如上述实施例一般只要TLB/TAG单元51,扫描器43,以及总线113的带宽能支持带网络地址前缀的存储器地址宽度即可支持指令存储器在通讯信道远端的操作。图13的具体实施例与上述图18指令存储器部分相同,不再赘述。在图18中如存储器111与存储器112也存储数据,则其中产生数据地址的加法器169及其输出总线198的带宽也要能支持如上述的带网络地址前缀的存储器地址。除上述51,43,169模块及总线113,198的带宽以外,图18中其余各模块不需做任何改变,因为其余各模块均基于缓存地址操作。网络存储器地址(网络地址+存储器地址)经51中标签单元TAG映射为缓存地址。缓存地址的宽度取决于缓存器的组织,与网络存储器地址无关。The above specific embodiment of the application form of the structure of Fig. 12 can also be applied to the structures of Figs. 13 and 18. Taking FIG. 18 as an example, assume that the memory 111 in FIG. 18 is located at one end of the communication channel and the remaining modules are located at the other end of the communication channel. Then, as in the above embodiment, the operation of the instruction memory at the far end of the communication channel can be supported as long as the bandwidth of the TLB/TAG unit 51, the scanner 43, and the bus 113 can support the memory address width with the network address prefix. The specific embodiment of FIG. 13 is the same as the instruction memory portion of FIG. 18 described above, and will not be described again. In FIG. 18, as the memory 111 and the memory 112 also store data, the bandwidth of the adder 169 and its output bus 198 in which the data address is generated can also support the memory address with the network address prefix as described above. Except for the above 51, 43, 169 modules and the bandwidth of the bus 113, 198, the remaining modules in Figure 18 need not be changed, since the remaining modules operate based on the cache address. The network memory address (network address + memory address) is mapped to the cache address via the tag unit TAG in 51. The width of the cache address depends on the organization of the buffer, regardless of the network memory address.
当存储器111与图18中其他模块分处网络的两端时总线113上的地址可能是通过封包(packet)传输,此时可以将网络存储器地址中的网络地址置于封包的包头,而将网络存储器地址中的存储器地址置于封包内容中。当存储器111可被复数个处理器核或计算机访问时,111中应有仲裁器以确定访问顺序。处理器核中由线程寄存器存储线程对应的网络地址。图18中加法器169或扫描器43中的加法器,可以用位宽等于网络存储器地址的位宽,但优化的实现其位宽只要满足存储器地址位宽就可以。在上述加法器运算获得分支目标或数据的存储器地址同时,由当时正在执行的线程号寻址线程寄存器中,读出该线程存储的网络地址。该网络地址与计算所得的存储器地址拼合,即为网络存储器地址,被送到51中标签单元TAG匹配。When the memory 111 and the other modules in FIG. 18 are separated at both ends of the network, the address on the bus 113 may be transmitted through a packet. At this time, the network address in the network memory address may be placed in the packet header, and the network is The memory address in the memory address is placed in the contents of the packet. When memory 111 is accessible by a plurality of processor cores or computers, an arbiter should be present in 111 to determine the order of access. The network address corresponding to the thread is stored in the processor core by the thread register. The adder in the adder 169 or the scanner 43 in Fig. 18 can use the bit width equal to the bit width of the network memory address, but the optimized implementation of the bit width can be as long as the memory address bit width is satisfied. At the same time as the adder operation obtains the memory address of the branch target or data, the thread address stored in the thread register is read by the thread number being executed at that time, and the network address stored by the thread is read. The network address is concatenated with the calculated memory address, which is the network memory address, and is sent to the tag unit TAG in 51 to match.
同样51中标签单元中可以存储复数条网络存储器地址的,比如每个表项为192位。但可以有几种优化方式。一种是用两个表,其中一个表2中每个表项中除存储存储器地址的标签以外还存储另一个表1的行号, 表1中每个表项中存储网络地址。网络存储器地址中的网络地址先与表1的内容匹配以获得表1的行号。所获的表1行号与存储器地址拼合送到表2匹配。表2匹配所得就是缓存地址,如不匹配则将网络存储器地址经总线113从存储器111取指令或数据填入存储器112。另一种是只用表2,表2中除存储存储器地址的标签以外还存储上述线程寄存器的行号(或线程号)。此时将线程寄存器的行号(或线程号)与存储器地址拼合送到表2匹配。 如果没有匹配,则将线程寄存器中由线程寄存器行号(或线程号)寻址读出的的网络地址与存储器地址拼合作为网络存储器地址经总线113从存储器111取指令或数据填入存储器112。因此实际需要增加的成本不多。In the same 51, the tag unit can store a plurality of network memory addresses, for example, each entry is 192 bits. But there are several ways to optimize. One is to use two tables, one of the entries in Table 2 stores the row number of another table 1 in addition to the label of the storage memory address. The network address is stored in each entry in Table 1. The network address in the network memory address is first matched with the contents of Table 1 to obtain the row number of Table 1. The obtained table 1 row number is combined with the memory address to match Table 2. The matching result of Table 2 is the cache address. If there is no match, the network memory address is fetched from the memory 111 via the bus 113 into the memory 112. The other is to use only the row number (or thread number) of the above thread register in Table 2 except for the label storing the memory address. At this point, the thread number (or thread number) of the thread register is combined with the memory address to match Table 2. If there is no match, the network address addressed by the thread register line number (or thread number) in the thread register is concatenated with the memory address as the network memory address is fetched from memory 111 via bus 113 into memory 112. Therefore, the actual cost of the increase is not much.
图12,13,18实施例中的扫描器43以来自51中标签单元的分支指令所在指令块地址为基础计算分支指令的分支目标指令地址。51中标签单元中存储物理地址,因此扫描器43计算所得的分支目标指令地址是物理地址。该分支目标指令的物理地址只要没有越过物理页面边界,即可以直接与51中标签单元中内容匹配,不需经过TLB映射。同理图18实施例中加法器169以51中标签单元中的物理地址为基地址产生的数据地址也是物理地址只要没有越过物理页面边界,可直接与51中标签单元中内容匹配,不需经过TLB映射。匹配所得的是最低层缓存的BN地址。在图4,5,12,13,18中只有总线46上的间接分支指令地址是虚拟地址,需要经过51中的TLB映射为物理地址。扫描器43与数据地址产生器169产生的都是物理地址,可直接在51中的TAG匹配。而其他寻址最后层次缓存(last level cache)的地址如图4,5中总线29,图8,11,12,中的总线89,以及图13,18中的总线119上的地址都是缓存地址格式BN,可以直接寻址最后层次的缓存存储器,主动表AL,相关表CT,以及51中的标签单元TAG,而不需要经过51中TLB或标签单元TAG映射。The scanner 43 in the embodiment of Figures 12, 13, and 18 calculates the branch target instruction address of the branch instruction based on the instruction block address from the branch instruction of the tag unit in 51. The physical address is stored in the tag unit in 51, so the branch target instruction address calculated by the scanner 43 is a physical address. The physical address of the branch target instruction can directly match the content in the label unit in 51 as long as it does not cross the physical page boundary, and does not need to be mapped by TLB. Similarly, in the embodiment of FIG. 18, the data address generated by the adder 169 based on the physical address in the tag unit of 51 is also the physical address. As long as the physical page boundary is not crossed, the content of the tag unit in 51 can be directly matched without going through TLB mapping. The match is the BN address of the lowest layer cache. In Figures 4, 5, 12, 13, 18, only the indirect branch instruction address on bus 46 is a virtual address and needs to be mapped to a physical address via the TLB in 51. The scanner 43 and the data address generator 169 generate physical addresses that can be directly matched to the TAGs in 51. Other addressing the last level cache (last Level The address of the cache) is as shown in Figure 4, 5, bus 29, bus 8, 89 in Figures 8, 11, 12, and the address on bus 119 in Figures 13, 18 are cache address format BN, which can directly address the last level. The cache memory, the active table AL, the associated table CT, and the tag unit TAG in 51, do not need to go through the TLB or tag unit TAG mapping in 51.
虽然本发明的实施例仅仅对本发明的结构特征和/或方法过程进行了描述,但应当理解的是,本发明的权利要求并不只局限于所述特征和过程。相反地,所述特征和过程只是实现本发明权利要求的几种例子。应当理解的是,上述实施例中列出的多个部件只是为了便于描述,还可以包含其他部件,或某些部件可以被组合或省去。所述多个部件可以分布在多个系统中,可以是物理存在的或虚拟的,也可以用硬件实现(如集成电路)、用软件实现或由软硬件组合实现。Although the embodiments of the present invention are only described in terms of structural features and/or methods of the present invention, it should be understood that the claims of the present invention are not limited to the features and processes. Rather, the features and processes are merely illustrative of several embodiments of the invention. It should be understood that the various components listed in the above embodiments are merely for convenience of description, and may include other components, or some components may be combined or omitted. The plurality of components may be distributed among multiple systems, may be physically present or virtual, or may be implemented in hardware (such as an integrated circuit), implemented in software, or implemented in a combination of hardware and software.
显然,根据对上述较优的实施例的说明,无论本领域的技术如何发展,也无论将来可能取得何种目前尚不易预测的进展,本发明均可以由本领域普通技术人员根据本发明的原理对相应的参数、配置进行相适应的替换、调整和改进,所有这些替换、调整和改进都应属于本发明所附权利要求的保护范围。Obviously, in accordance with the description of the preferred embodiments described above, the present invention may be practiced by one of ordinary skill in the art in accordance with the principles of the present invention, regardless of how the technology in the field develops, and what progress may be made in the future that is not readily predictable. Corresponding parameters, configurations, adaptations, adjustments and improvements are intended to be included within the scope of the appended claims.
工业实用性Industrial applicability
本发明提出的系统与方法可以被用于各种计算、数据处理系统,信息、数据存储系统,以及通信系统。本发明提出的系统与方法 可以掩盖或显著 减少存储系统的访问延迟,以及缓存缺失。 The systems and methods proposed by the present invention can be used in a variety of computing, data processing systems, information, data storage systems, and communication systems. The system and method proposed by the present invention can mask or significantly Reduce storage system access latency and cache misses.
序列表自由内容Sequence table free content

Claims (30)

  1. 一种处理器系统,包括:处理器核和缓存器;其特征在于:A processor system comprising: a processor core and a buffer; wherein:
    所述缓存器向所述处理器核推送指令及数据供所述处理器核执行及处理。The buffer pushes instructions and data to the processor core for execution and processing by the processor core.
  2. 如权利要求1所述的系统,其特征在于:The system of claim 1 wherein:
    所述处理器核向所述缓存系统提供分支判断;The processor core provides a branch determination to the cache system;
    所述缓存器审查其中存储的指令,提取并存储指令的控制流信息;The buffer reviews the instructions stored therein, extracts and stores control flow information of the instructions;
    所述缓存器根据所述控制流信息及所述分支判断向所述处理器核推送指令供处理器核执行。The buffer pushes an instruction to the processor core for execution by the processor core according to the control flow information and the branch determination.
  3. 如权利要求2所述的系统,其特征在于:The system of claim 2 wherein:
    所述处理器核向所述缓存系统提供间接分支指令的基地址;The processor core provides a base address of the indirect branch instruction to the cache system;
    所述缓存器根据所述基地址产生间接分支目标地址,向所述处理器核推送间接分支指令供处理器核执行。The buffer generates an indirect branch target address according to the base address, and pushes an indirect branch instruction to the processor core for execution by the processor core.
  4. 如权利要求1所述的系统,其特征在于:The system of claim 1 wherein:
    所述处理器核流水线不设取指令流水线段;The processor core pipeline does not set an instruction pipeline segment;
    所述处理器核不产生指令地址;The processor core does not generate an instruction address;
    所述处理器核不向所述缓存器提供指令地址以读取指令。The processor core does not provide an instruction address to the buffer to read the instruction.
  5. 如权利要求1所述的系统,其特征在于:The system of claim 1 wherein:
    该系统的缓存器连接至存储器;The buffer of the system is connected to the memory;
    所述缓存器产生并向所述存储器提供存储器地址;The buffer generates and provides a memory address to the memory;
    所述存储器根据所述存储器地址向缓存器提供指令。The memory provides instructions to the buffer based on the memory address.
  6. 如权利要求2所述的系统,其特征在于:The system of claim 2 wherein:
    所述的缓存器中仅最低存储层次有虚实地址转换;Only the lowest storage level in the buffer has a virtual real address translation;
    所述的缓存器中仅最低存储层次有存储器地址对缓存器地址的映射。Only the lowest storage level in the buffer has a mapping of memory addresses to buffer addresses.
  7. 如权利要求2所述的系统,其特征在于:The system of claim 2 wherein:
    所述的缓存器中最低存储层次以路组相联方式组织;The lowest storage level in the buffer is organized in a way group association manner;
    所述的缓存器中除最低存储层次以外的其他存储层次为全相联方式组织。The other storage levels in the cache except the lowest storage level are all-associated.
  8. 如权利要求2所述的系统,其特征在于:The system of claim 2 wherein:
    所述的缓存器中的相邻存储层次之间各有扫描器;Each of the buffers in the buffer has a scanner between adjacent storage levels;
    所述扫描器审查在所述相邻存储层次之间传递的指令以提取控制信息流。The scanner examines instructions passed between the adjacent storage hierarchies to extract control information flows.
  9. 如权利要求2所述的系统,其特征在于:The system of claim 2 wherein:
    所述的缓存器中在所述最低存储层次与次低存储层次之间有扫描器;The buffer has a scanner between the lowest storage level and the second lowest storage level;
    所述扫描器审查在所述最低与次低存储层次之间传递的指令以提取控制信息流;The scanner reviews instructions passed between the lowest and second lowest storage levels to extract a control information flow;
    存储所述提取的控制信息流以备比所述次低存储层次更高的存储层次调用。The extracted control information stream is stored for a higher storage level call than the second lowest storage level.
  10. 如权利要求2所述的系统,其特征在于:The system of claim 2 wherein:
    所述的缓存器中最高存储层次有第一读口及第二读口;The highest storage level in the buffer has a first read port and a second read port;
    所述的缓存器中最高存储层次有第一循迹器及第二循迹器;The highest storage level in the buffer has a first tracker and a second tracker;
    所述的第一及第二循迹器根据所述存储的控制流信息及所述分支判断,控制所述的第一读口及第二读口向处理器核提供分支指令后的顺序指令及分支目标指令;And the first and second trackers control, according to the stored control flow information and the branch determination, a sequence instruction after the first read port and the second read port provide a branch instruction to the processor core, and Branch target instruction;
    所述的处理器核执行所述分支指令,产生分支判断;The processor core executes the branch instruction to generate a branch determination;
    所述的处理器核以所述分支判断决定执行并写回顺序指令或分支目标指令。The processor core determines to execute and write back a sequential instruction or a branch target instruction with the branch decision.
  11. 如权利要求3所述的系统,其特征在于:The system of claim 3 wherein:
    所述的缓存器中存储所述间接分支指令的所述基地址与所述间接分支目标指令对;Storing, in the buffer, the base address of the indirect branch instruction and the indirect branch target instruction pair;
    所述的缓存器可以根据所述间接分支指令及所述基地址向处理器核提供存储的所述间接分支目标指令。The buffer may provide the stored indirect branch target instruction to the processor core according to the indirect branch instruction and the base address.
  12. 一种缓存器置换方式;其特征在于:以最少关联度原则决定被置换的缓存块。A buffer replacement method; characterized in that the replaced cache block is determined with a minimum degree of association.
  13. 如权利要求12所述的方法;其特征在于:进一步以最早曾被置换的原则决定被置换的缓存块。The method of claim 12 wherein the cache block being replaced is further determined on the principle that it was first replaced.
  14. 如权利要求12所述的方法;其特征在于:The method of claim 12;
    所述缓存器中的缓存块存有关联记录;The cache block in the buffer stores an associated record;
    所述关联记录中记录以该缓存块为分支目标的指令的数目作为所述关联度。The number of instructions that use the cache block as a branch target is recorded in the association record as the degree of association.
  15. 如权利要求12所述的方法;其特征在于:The method of claim 12;
    所述缓存器中的缓存块存有关联记录;The cache block in the buffer stores an associated record;
    所述关联记录中记录以与所述缓存块的部分或全部内容相同的更高层次缓存块的数目作为所述关联度。The association record records the number of higher-level cache blocks that are identical to part or all of the contents of the cache block as the degree of association.
  16. 如权利要求12所述的方法;其特征在于:The method of claim 12;
    所述缓存器中存有控制流信息,所述控制流信息中记录了分支目标地址;Control buffer information is stored in the buffer, and a branch target address is recorded in the control flow information;
    所述缓存器中的缓存块存有关联记录;The cache block in the buffer stores an associated record;
    所述关联记录中记录所述缓存块在低一存储层次中的地址;Recording, in the associated record, an address of the cache block in a lower storage level;
    所述关联记录中记录以所述缓存块为分支目标的分支源缓存块的地址;Recording, in the associated record, an address of a branch source cache block that uses the cache block as a branch target;
    在置换所述缓存块时,用所述缓存块的所述低一存储层次地址替换所述控制流信息中记录的所述缓存块的地址。When the cache block is replaced, the address of the cache block recorded in the control flow information is replaced with the lower one storage hierarchy address of the cache block.
  17. 如权利要求12所述的方法;其特征在于:The method of claim 12;
    查询所述控制流信息以确定一个低存储层次缓存块中内容的相应高存储层次缓存块的地址;Querying the control flow information to determine an address of a corresponding high storage hierarchy cache block of content in a low storage hierarchy cache block;
    置换所述相应高存储层次缓存块以减低存储层次缓存块的关联度。Displace the corresponding high storage hierarchy cache block to reduce the association of the storage hierarchy cache block.
  18. 如权利要求12所述的方法;其特征在于:The method of claim 12;
    置换与其他缓存块没有关联的缓存块。Replaces a cache block that is not associated with other cache blocks.
  19. 一种信息处理方法,其特征在于:An information processing method, characterized in that:
    由缓存系统向处理器核推送指令及数据供处理器核执行。The instructions and data are pushed by the cache system to the processor core for execution by the processor core.
  20. 如权利要求19所述的方法其特征在于,包括:The method of claim 19, comprising:
    步骤A:所述处理器核向所述缓存系统提供分支判断;Step A: The processor core provides a branch judgment to the cache system.
    步骤B:所述缓存器审查其中存储的指令,提取并存储指令的控制流信息;Step B: The buffer reviews the instructions stored therein, and extracts and stores the control flow information of the instruction;
    步骤C:所述缓存器根据所述控制流信息及所述分支判断向所述处理器核推送指令供所述处理器核执行。Step C: The buffer pushes an instruction to the processor core for execution by the processor core according to the control flow information and the branch determination.
  21. 如权利要求19所述的方法,其特征在于:The method of claim 19 wherein:
    所述处理器核向所述缓存系统提供间接分支指令的基地址;The processor core provides a base address of the indirect branch instruction to the cache system;
    所述缓存器根据所述基地址产生间接分支目标地址,向所述处理器核推送间接分支指令供处理器核执行。The buffer generates an indirect branch target address according to the base address, and pushes an indirect branch instruction to the processor core for execution by the processor core.
  22. 如权利要求19所述的方法,其特征在于:The method of claim 19 wherein:
    由循迹器提供缓存器地址寻址所述缓存器向所述处理器核推送指令;Providing a buffer address by the tracker to address the buffer to push an instruction to the processor core;
    按线程存储所述循迹器中的及所述处理器核中的寄存器状态;Storing, by a thread, a register state in the tracker and in the processor core;
    按线程将所述存储的寄存器状态与所述循迹器及所述处理器核中的状态互换以进行线程切换。The stored register state is swapped by threads with states in the tracker and the processor core for thread switching.
  23. 如权利要求19所述的方法,其特征在于:The method of claim 19 wherein:
    所述缓存系统以主存储器为最低层次缓存;The cache system uses the main memory as the lowest level cache;
    所述主存储器由缓存地址寻址。The main memory is addressed by a cache address.
  24. 如权利要求23所述的方法,其特征在于:The method of claim 23 wherein:
    所述缓存由实地址寻址;The cache is addressed by a real address;
    所述缓存系统不进行虚实地址转换。The cache system does not perform virtual and real address translation.
  25. 如权利要求19所述的方法,其特征在于:The method of claim 19 wherein:
    所述主存储器由非易失性存储器与易失性存储器共同组成;The main memory is composed of a nonvolatile memory and a volatile memory;
    所述易失性存储器作为非易失性存储器的缓存。The volatile memory acts as a cache for the non-volatile memory.
  26. 如权利要求19所述的方法,其特征在于:The method of claim 19 wherein:
    所述缓存器中各存储层次中的存储块以树状组织;The storage blocks in each storage hierarchy in the buffer are organized in a tree;
    所述不同存储层次中的所述存储块以映射关系相联系。The memory blocks in the different storage levels are associated in a mapping relationship.
  27. 如权利要求26所述的方法,其特征在于:The method of claim 26 wherein:
    所述映射关系可以是从低存储层次向高存储层次映射的正向映射;The mapping relationship may be a forward mapping from a low storage level to a high storage level mapping;
    所述映射关系可以是从高存储层次向低存储层次映射的逆向映射。The mapping relationship may be a reverse mapping from a high storage hierarchy to a low storage hierarchy.
  28. 如权利要求27所述的方法,其特征在于:The method of claim 27 wherein:
    将一个所述缓存块的高层地址逆向映射成低层缓存地址;Reversely mapping a higher layer address of the cache block to a lower layer cache address;
    将所述低层缓存地址与目标偏移量相加获得下一数据装载的低层缓存地址;Adding the lower layer cache address to the target offset to obtain a lower layer cache address of the next data load;
    将所述目标低层缓存地址正向映射为下一数据装载高层缓存地址。The target lower layer cache address is forward mapped to the next data load upper level cache address.
  29. 如权利要求28所述的方法,其特征在于:The method of claim 28 wherein:
    以所述下一数据装载高层缓存地址在所述处理器请求之前从缓存中取得数据;Loading the higher level cache address with the next data to retrieve data from the cache before the processor requests;
    所述数据随相应数据访问指令向所述处理器推送。The data is pushed to the processor with corresponding data access instructions.
  30. 如权利要求28所述的方法,其特征在于:The method of claim 28 wherein:
    所述目标偏移量由相关指令之后的反向跳转的分支指令的分支判断结果选择。The target offset is selected by the branch determination result of the branch instruction of the reverse jump after the relevant instruction.
PCT/CN2016/080039 2015-04-23 2016-04-22 Instruction and data push-based processor system and method WO2016169518A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/568,715 US20180088953A1 (en) 2015-04-23 2016-04-22 A processor system and method based on instruction and data push

Applications Claiming Priority (8)

Application Number Priority Date Filing Date Title
CN201510201436.1 2015-04-23
CN201510201436 2015-04-23
CN201510233007.2 2015-05-06
CN201510233007.2A CN106201913A (en) 2015-04-23 2015-05-06 A kind of processor system pushed based on instruction and method
CN201510267964.7A CN106201914A (en) 2015-04-23 2015-05-20 A kind of processor system pushed based on instruction and data and method
CN201510267964.7 2015-05-20
CN201610188651.7 2016-03-21
CN201610188651.7A CN106066787A (en) 2015-04-23 2016-03-21 A kind of processor system pushed based on instruction and data and method

Publications (1)

Publication Number Publication Date
WO2016169518A1 true WO2016169518A1 (en) 2016-10-27

Family

ID=57142872

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/080039 WO2016169518A1 (en) 2015-04-23 2016-04-22 Instruction and data push-based processor system and method

Country Status (1)

Country Link
WO (1) WO2016169518A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112527395A (en) * 2020-11-20 2021-03-19 海光信息技术股份有限公司 Data prefetching method and data processing apparatus

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1375767A (en) * 2001-07-03 2002-10-23 智权第一公司 Apparatus and method for providing branch instruction and relative target instruction to buffering zone
CN101763249A (en) * 2008-12-25 2010-06-30 世意法(北京)半导体研发有限责任公司 Branch checkout for reduction of non-control flow commands
CN103870394A (en) * 2012-12-13 2014-06-18 Arm有限公司 Retention priority based cache replacement policy

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1375767A (en) * 2001-07-03 2002-10-23 智权第一公司 Apparatus and method for providing branch instruction and relative target instruction to buffering zone
CN101763249A (en) * 2008-12-25 2010-06-30 世意法(北京)半导体研发有限责任公司 Branch checkout for reduction of non-control flow commands
CN103870394A (en) * 2012-12-13 2014-06-18 Arm有限公司 Retention priority based cache replacement policy

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112527395A (en) * 2020-11-20 2021-03-19 海光信息技术股份有限公司 Data prefetching method and data processing apparatus
CN112527395B (en) * 2020-11-20 2023-03-07 海光信息技术股份有限公司 Data prefetching method and data processing apparatus

Similar Documents

Publication Publication Date Title
US9053049B2 (en) Translation management instructions for updating address translation data structures in remote processing nodes
TW201638774A (en) A system and method based on instruction and data serving
US6920531B2 (en) Method and apparatus for updating and invalidating store data
WO2016131428A1 (en) Multi-issue processor system and method
US20040088489A1 (en) Multi-port integrated cache
WO2014000624A1 (en) High-performance instruction cache system and method
EP2517100A1 (en) High-performance cache system and method
WO2013000400A1 (en) Branch processing method and system
US20150356025A1 (en) High-performance cache system and method
JP3449487B2 (en) Conversion index buffer mechanism
JP7184815B2 (en) Conversion assistance for virtual cache
WO2013071868A1 (en) Low-miss-rate and low-miss-penalty cache system and method
JP2001195303A (en) Translation lookaside buffer whose function is parallelly distributed
WO2015002481A1 (en) Apparatus and method for managing buffer having three states on the basis of flash memory
US11372647B2 (en) Pipelines for secure multithread execution
WO2007099598A1 (en) Processor having prefetch function
US6678789B2 (en) Memory device
WO2018199646A1 (en) Memory device accessed on basis of data locality and electronic system including same
WO2015024532A9 (en) System and method for caching high-performance instruction
US6076145A (en) Data supplying apparatus for independently performing hit determination and data access
WO2016169518A1 (en) Instruction and data push-based processor system and method
WO2014000626A1 (en) High-performance data cache system and method
KR101100143B1 (en) Cash memory control device and pipeline control method
JP3970705B2 (en) Address translator, address translation method, and two-layer address translator
WO2017073957A1 (en) Electronic device and method for managing memory thereof

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16782665

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 15568715

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16782665

Country of ref document: EP

Kind code of ref document: A1