WO2016169518A1

WO2016169518A1 - Instruction and data push-based processor system and method

Info

Publication number: WO2016169518A1
Application number: PCT/CN2016/080039
Authority: WO
Inventors: 林正浩
Original assignee: 上海芯豪微电子有限公司
Priority date: 2015-04-23
Filing date: 2016-04-22
Publication date: 2016-10-27

Abstract

Provided are a processor system and a method. When being applied to the fields of processors and computers, a caching system can actively push an instruction and data to a processor core, preventing the processor core from fetching the instruction and data from a cache in a delay mode, thereby improving the processor performance.

Description

Processor system and method based on instruction and data push

Technical field

The invention relates to the field of computers, communications and integrated circuits.

Background technique

The central processor in the stored program computer generates an address to the memory, from which the read command or data is sent back for execution by the central processing unit, and the result of the execution is sent back to the memory for storage. With the advancement of technology, the capacity of memory increases, the memory access latency increases, and the channel delay of memory access increases. The execution speed of the central processor increases, so memory access latency becomes a bottleneck for computer performance improvement. . Therefore, the stored program computer uses a buffer to mask the memory access latency to alleviate this bottleneck. But the central processor fetches instructions or data into the cache in the same way. That is, the processor core in the central processing unit generates an address and sends the address to the buffer. If the address matches the address tag stored in the buffer, the buffer sends the corresponding information directly to the processor core for execution, thus avoiding access to the memory. delay. As technology advances, the capacity of the buffer increases, the buffer access latency increases, and the channel latency of the access increases. The execution speed of the processor core increases, so the buffer access latency is now an improvement in computer performance. A serious bottleneck.

technical problem

The manner in which the above processor core fetches information (including instructions and data) from the memory for execution can be considered as the processor core pulls (pull) information to the memory. The pull information needs to go through the delay channel twice, once the processor sends the address to the memory, and once the memory sends the information to the processor core. In addition, in order to support the way of pulling information, all the processors of the stored program computer have modules for generating and recording instruction addresses, and the pipeline structure of the instructions must have a pipeline segment for instruction fetching. Modern stored procedure computer fetching instructions usually requires a plurality of pipeline segments, which deepens the pipeline and increases the loss of branch prediction errors. In addition, generating and recording a long instruction address also requires more energy. In particular, a computer that converts a variable length instruction into a fixed-length micro-operation requires that the address of the fixed-length micro-operation be inversely converted to the address of the variable-length instruction to address the cache, which has a cost.

The method and system apparatus proposed by the present invention can directly address one or more of the above or other difficulties.

The present invention provides a processor system comprising: a push buffer and a corresponding processor core; wherein: the processor core does not generate and maintain an instruction address, and there is no pipeline segment in the pipeline; The processor core only provides branch decisions to the push buffer and provides a base address stored in the register file when the indirect branch instruction is executed; the push buffer extracts control flow information in the stored instructions thereof and stores Control flow information and the branch determination to push instructions to the processor core for execution thereof; the push buffer, upon encountering an indirect branch instruction, to the processor based on the base address from the processor core The core provides the correct indirect branch target instructions for execution. Further, the push buffer may provide the processor core with a subsequent sequence of branch instructions and two instructions of the branch target, and the branch generated by the processor core determines to execute one of the instructions, thereby masking the The delay at which the processor core passes the branch decision to the push buffer. Further, the push buffer may store the base address of the indirect branch instruction and the corresponding indirect branch target address, and may reduce or eliminate the delay when pushing the indirect branch target instruction, partially or completely masking the processor core to the base The delay at which the address is sent to the push buffer. Still further, the push buffer may push instructions to the processor core in advance based on control flow information stored therein, partially or completely masking the delay in transmitting information from the push buffer to the processor core. The processor core of the processor system proposed by the present invention does not need to have a pipeline stage for fetching instructions, nor does it need to generate and record an instruction address.

The invention proposes an organization form of a complex hierarchical cache, and its last (lowest) level cache (Last Level Cache, LLC) is a link group association organization, which has a virtual real address translation buffer TLB and a tag unit TAG, which can convert a virtual address of a memory into a physical address by TLB, and obtain the real address of the memory and the content in the TAG. Match to get the buffer address of the LLC. Due to LLC The buffer address is derived from the memory real address mapping, so the LLC buffer address is actually the real address. The resulting LLC buffer address can be used to address the LLC's information memory RAM and can also be used to select an LLC active table. The LLC active table stores the mapping relationship between the LLC cache block and the cache block in the higher layer buffer, that is, the LLC active table is addressed by the LLC buffer address, and the contents of the entry are the corresponding higher level cache block address. In the present invention, other levels of buffers other than LLC are all associative organizations, which are directly addressed by their own buffer addresses, and do not require a tag unit TAG or TLB. The buffer address of the current level and the higher-level buffer address are mapped by the active table. The active table is similar to the LLC active table, and is addressed by the hierarchical buffer address and the higher-level buffer address is stored in the entry. The highest level buffer has a corresponding track table in which the control stream information scanned by the scanner and reviewed for storage into the highest level buffer memory RAM instruction is stored. The track table is addressed by the highest level buffer address, and its entry stores the branch target address of the branch instruction. The tracker generates a highest level buffer address addressing the first read port output sequence instruction of the highest level buffer memory to be pushed to the processor core; also reading the corresponding entry in the track table by the highest level buffer address addressing The corresponding branch target address, the second read output branch target instruction that addresses the highest level buffer memory with the branch target address is also pushed to the processor core. The processor core executes the branch instruction to generate a branch decision, and selects one of the two instructions to execute and discards the other branch. The branch determination also controls the tracker to select one of the two buffer addresses accordingly, addressing the highest level buffer to continue pushing instructions to the processor core.

The present invention proposes a cache replacement method for determining a replaceable cache block based on the degree of association between cache blocks. The way from the branch source branch to the branch target is recorded in the track table. In addition, the related table records the corresponding low-level buffer address of the cache block content in the low-level buffer, the branch source path of the jump into the cache block, and the number of branch sources that jump into the cache block. The association degree of the cache block may be defined according to the count of the branch source jumped in the cache block, and the smaller the count, the smaller the degree of association, and may be pre-replaced. For each cache block of the same minimum degree of association, the cache block of the last old replacement may be replaced according to the order of the previous replacement, so as to avoid the cache block that has just been replaced being replaced. When the cache block is replaced, the entry in the track table is addressed by the jump-in branch source path stored in the correlation table, and the cache block address is replaced by the corresponding lower-layer buffer address of the cache block content in the related table to keep the control information flow intact. Sex. The above description is based on the degree of association between the same storage levels.

The minimum degree of association replacement method can also be applied between different storage levels. The method is to record the number of high-level cache blocks that are the same as the content of the cache block as the degree of association, and the smaller the count, the smaller the degree of association, and the cache block with the smallest degree of association. This method can also be called the least descendant method (Least Children), where the descendant refers to the same high-level cache block as the cache block. In addition, the number of entries in the track table with the cache block as the branch target is also recorded (the cache block and the track table can be at different storage levels). When both numbers are '0', the cache block can be replaced. If the descendant count is not '0', the cache block can be replaced by replacing the descendant cache block. If the number of entries in the track table with the cache block as the branch target is not '0', the replacement may be performed when it is '0', or the low-level buffer address containing the contents of the cache block is substituted for the track table entry. The current cache block address is replaced. The minimum degree of association between storage hierarchies can also be shared with the earliest replaced method described above.

The present invention provides a method of temporarily storing register states in a tracker and processor core to a memory identified by a thread number. The state of the registers in the memory and the tracker and processor core can be swapped by threads to switch threads. Because the thread instructions in the push cache of the present invention are independent, there is no need to clear the cache when the thread is changed, and no one thread executes the instruction of another thread.

The present invention proposes a method and processor system that can execute instructions provided by a plurality of memory levels simultaneously.

The invention proposes a function call and function return method and system based on a track table.

The invention provides a computer memory hierarchical organization method and system, in addition to a hard disk, the storage levels, including traditional main memory (main Memory) is organized by cache and managed by hardware, without the need for the operating system to allocate memory. This method does not need to be matched by the tag unit when the instruction or data is read, which reduces the delay.

The invention proposes a full associative caching method for preserving the relationship between data by hierarchy, and avoids the comparison matching operation between the address and the label according to the bidirectional address mapping between different levels of data. Before executing a load instruction, the cache system pushes the data to the processor core (Serve) in advance according to the previous extraction of the same load instruction, the retained step information, and the mutual relationship.

The present invention proposes a method and system for extracting and recording the interrelationship between logically organized data (i.e., address information containing data in a data). The method and system autonomously learn according to the result of executing the load instruction, and the logical relationship between the extracted data remains in the data track table. The entries in the data track table correspond one-to-one with the data storage table entries. The data track entry corresponding to 'data' in the data memory retains the 'data type' generated by analyzing the relationship between the data. The data address entry corresponding to the 'address' in the data memory retains the address mapped 'address pointer'. The 'address pointer' can directly address the data memory to read data without the need for tag unit matching. The method and system push data to the processor core according to the relationship between the data before the logical relationship is extracted. After the method and the system extract the logical relationship, before the execution of a load instruction, the cache system retains the logical relationship retained in the data track table according to the previous execution of the same load instruction, and the processor core executes The comparison result provided by the relevant instruction, the data is read in advance and pushed to the processor core.

The memory hierarchy method and system of the present invention actively pushes most of the instructions and data to the processor core; the processor core only needs to provide branch decision or comparison results, and the pipeline stop signal of the processor most of the time.

The present invention provides a memory hierarchy and method that can access a memory hierarchy at the other end of a communication channel with a uniform memory address.

The present invention provides a processor system including a processor core and a buffer that pushes instructions and data to the processor core for execution and processing by the processor core.

The system and method of the present invention can provide a basic solution for providing a two-way delay of a processor core access buffer in a processor system. In a conventional processor system, the processor core sends a memory address to the buffer, and the buffer transmits information (instructions or data) to the processor core based on the memory address. The system and method for utilizing correlation between instructions according to the present invention pushes instructions from the buffer to the processor core, thereby avoiding delays in transmitting the memory address by the processor core to the buffer. In addition, the push buffer of the present invention is not in the pipeline structure of the processor core, so instructions can be pushed in advance to mask the delay of the buffer to the processor core.

The system and method of the present invention also provides a multi-layer cache organization form, in which virtual and real address translation and address mapping are performed only in a lowest level cache (LLC), and instead of a virtual cache in a conventional cache, the highest level cache is performed, and The address map is cached at each level. Each level of cache in the multi-layer cache organization can be addressed by a buffer address based on the real address mapping of the memory, such that the cost and power consumption of the fully associative cache approximates the direct map cache.

The system and method of the present invention also provides a cache replacement method based on the degree of association between cache blocks, which is suitable for a buffer that utilizes an inter-instruction relationship (control information flow).

Other advantages and applications of the present invention will be apparent to those skilled in the art.

1 is an embodiment of a track table based cache system of the present invention;

2 is an embodiment of a processor system of the present invention;

Figure 3 is another embodiment of the processor system of the present invention;

4 is another embodiment of the processor system of the present invention;

Figure 5 is another embodiment of the processor system of the present invention;

6 is an address format of a processor system in the embodiment of FIG. 5;

Figure 7 is a partial storage table format of the processor system in the embodiment of Figure 5;

Figure 8 is another embodiment of the processor system of the present invention;

9 is an embodiment of an indirect branch target address generator of the processor system of the present invention;

10 is a schematic diagram of a pipeline structure of a processor core in the processor system of the present invention;

Figure 11 is another embodiment of the processor system of the present invention;

Figure 12 is an embodiment of the processor/memory system of the present invention;

Figure 13 is another embodiment of the processor/memory system of the present invention;

Figure 14 is a format of each storage table in the embodiment of Figure 13;

Figure 15 is an address format of a processor system in the embodiment of Figure 13 of the present invention;

16 is a format of a data track table, a data active table, and a data related table according to the present invention;

17 is a format and working principle of the step size table according to the present invention;

Figure 18 is another embodiment of the processor/memory system of the present invention;

19 is a schematic diagram of the action mechanism of the data cache hierarchy in the embodiment of FIG. 18 of the present invention;

20 is a modified embodiment of a data cache hierarchy in an embodiment of the present invention;

21 is an embodiment of prefetching data organized in a logical relationship;

22 is an embodiment of a handler function call and a function return instruction;

23 is another embodiment of the processor system of the present invention.

The preferred embodiment of the invention is shown in Figure 18.

The high performance cache system and method proposed by the present invention are further described in detail below with reference to the accompanying drawings and specific embodiments. Advantages and features of the present invention will be apparent from the description and appended claims. It should be noted that the drawings are in a very simplified form and all use non-precise proportions, and are only for convenience and clarity to assist the purpose of the embodiments of the present invention.

It should be noted that the various embodiments of the present invention are further illustrated to illustrate the various embodiments of the present invention in order to clearly illustrate the present invention. Further, for the sake of brevity of explanation, the contents already mentioned in the foregoing embodiment are often omitted in the latter embodiment, and therefore, contents not mentioned in the latter embodiment can be referred to the previous embodiment accordingly.

Although the invention may be modified in various forms of modifications and substitutions, some specific embodiments of the invention are set forth in the specification and detailed. It should be understood that the inventor's point of departure is not to limit the invention to the particular embodiments set forth, but the inventor's point of departure is to protect all improvements, equivalent transformations and modifications based on the spirit or scope defined by the claims. . The same component numbers may be used in all figures to represent the same or similar parts.

In addition, some embodiments have been simplified in the present specification in order to more clearly express the technical solutions of the present invention. It should be understood that changing the structure, delay, clock cycle difference and internal connection manner of these embodiments under the framework of the technical solution of the present invention should fall within the protection scope of the appended claims.

The buffers in the processor system can be improved with a data structure called a track table. In the track table, not only the branch target instruction information of the branch instruction but also the instruction information sequentially executed are stored. Figure 1 shows an example of a cache system including a track table of the present invention. 10 is an embodiment of the track table of the present invention. The track table 10 is composed of the same number of rows and columns as the level 1 buffer 22, wherein each line is a track corresponding to a level 1 cache block in the level 1 cache. Each entry on the track corresponds to an instruction in the level 1 cache block. In this example, it is assumed that each L1 cache block in the L1 cache contains a maximum of 4 instructions, and the intra-block offset addresses BNY are 0, 1, 2, and 3, respectively. Hereinafter, the five instruction blocks in the first-level buffer 22, whose first-level buffer block addresses BN1X are ‘J’, ‘K’, ‘L’, ‘M’, and ‘N’, will be described as an example. Therefore, there are corresponding 5 tracks in the track table 10, and up to 4 items in each track can correspond to up to 4 instructions in the first level cache block of 24, and BNY can also address the items in the track. In this example, the track table 10 and the corresponding level 1 buffer 22 can be addressed by the level 1 buffer address BN1 formed by the level 1 buffer block address BN1X and the in-block offset address BNY. Read the track table entry and the corresponding instruction. The fields 11, 12, and 13 in Fig. 1 are the entry format of the track table 10. There is a special domain storage program flow control information in the table entry format of the track table. The domain 11 is an instruction type format, and can be divided into two categories: non-branch instructions and branch instructions according to the type of the corresponding instruction. The type of the branch instruction may be further subdivided into direct and indirect branches according to one dimension, or may be subdivided into conditional branches and unconditional branches according to another dimension. Stored in domain 12 is the buffer block address, and in domain 13 is the offset within the memory block. In Figure 1, the domain 12 is in the primary buffer BN1X format, and the domain 13 is in the BNY format. The buffer address can also use other formats, in which case address format information can be added in the field 11 to illustrate the address format in the fields 12, 13. Only one of the track table entries of the non-branch instruction stores the instruction type field 11 of the non-branch type, and the entry of the branch instruction has the BNX domain 12 and the BNY domain 13 in addition to the instruction type field 11.

Only fields 12 and 13 are shown in the track table 10 of FIG. For example, the value 'J3' in the entry 'M2' indicates that its branch target instruction level one cache address of the instruction corresponding to the 'M2' entry is 'J3'. Thus, when the 'M2' entry in the track table 10 is read according to the track table address (ie, the level 1 buffer address), the corresponding instruction can be judged as a branch instruction according to the field 11 in the entry, according to the fields 12, 13 The branch target of the instruction is the instruction of the 'J3' address in the level 1 buffer. The instruction that addresses the BNY of the 'J' instruction block in the found first level buffer 24 to '3' is the branch target instruction. In addition, in the track table 10, in addition to the above BNY is outside the column of '0'~'3' and also contains an additional end column 16, where each end entry has only fields 11 and 12, where domain 11 stores an unconditional branch type, and domain 12 stores The sequence address of the corresponding instruction block corresponds to the BN1X of the next instruction block, that is, the next instruction block can be directly found in the L1 cache according to the BN1X, and the track corresponding to the next instruction block is found in the track table 10. .

The blank entries in the track table 10 display corresponding non-branch instructions, and the remaining entries correspond to branch instructions. The entries also show the level 1 cache address (BN1) of the branch target (instruction) of the corresponding branch instruction. For a non-branch instruction entry on a track, the next instruction to be executed may only be an instruction represented by the entry on the right side of the same track of the entry; for the last entry in the track, the next one is to be The executed instruction may only be the first valid instruction in the first-level cache block pointed to by the content of the end entry on the track; for the branch instruction entry on the track, the next instruction to be executed may be the table. The instruction represented by the entry on the right side of the item may also be an instruction pointed to by the BN in the entry of the item, and is selected by the branch. Therefore, the track table 10 contains all the program control flow information of all the instructions stored in the level one cache.

Please refer to FIG. 2, which is an embodiment of the processor system of the present invention. In this example, a level 1 cache 22, a processor core 23, a controller 27, and a track table 20 like the track table 10 of FIG. 1 are included. Incrementor 24, The selector 25 and the register 26 form a tracker 47 (within the dotted line). The processor core 23 controls the selector 25 in the tracker with the branch decision 31, and controls the register 26 in the tracker with the pipeline stop signal 32. The selector 25 is controlled by the controller 27 and the branch decision 31 to select the output 29 of the track table 20 or the output of the incrementer 24. The output of selector 25 is registered by register 26, while the output 28 of register 26 is referred to as a read pointer (Read Pointer, RPT), the instruction format is BN1. Note that the data width of the incrementer 24 is equal to the width of BNY, and only increases the BNY of the read pointer by '1' without affecting the value of BN1X, such as the width of the overflow result of the incremental result (ie, the capacity of the first-level cache block). For example, when the carry output of the incrementer 24 is '1', the system will look for the BN1X of the next level one cache block in the end column to replace the block BN1X; the following embodiments are the same, unless otherwise stated. The tracker in the system of the present embodiment accesses the track table 20 with the read pointer 28 to output the entry via the bus 29, and also accesses the level 1 cache 22 to read the corresponding command for execution by the processor core 23. The controller 27 decodes the field 11 in the entry output on the bus 29. If the instruction type in the field 11 is non-branch, the controller 27 controls the selector 25 to select the output of the incrementer 24, and the read pointer is incremented by '1' in the next clock cycle, and the sequential lower strip is read from the first-level cache 22 ( Fall Through) instruction. If the instruction type in field 11 is an unconditional direct branch, controller 27 controls selector 25 to select fields 12, 13 on bus 29, the next cycle read pointer 28 points to the branch target, and reads the branch target instruction from level one cache 22. . If the instruction type in the field 11 is a direct conditional branch, the controller 27 causes the branch decision 31 to control the selector 25. If it is determined that the branch is not to be executed, the next week the read pointer 28 is incremented by the incrementer 24 by '1', from one The sequence buffer 22 reads the sequence instruction; if it is determined to execute the branch, the next week the read pointer points to the branch target, and the branch target instruction is read from the level 1 cache 22. When the pipeline in processor core 23 stalls, the update of register 26 in the tracker is halted by pipeline stall signal 32, causing the cache system to stop providing new instructions to processor core 23.

Returning to Figure 1, the non-branch entries in the track table 10 can be discarded to compress the track table. The format of the table of compressed track table adds the source in addition to the original domain 11, 12, 13 (Source) BNY (SBNY) field 15 to record the (source) intra-block offset address of the branch instruction itself, because the compressed table entry has horizontal displacement in the table, although the order between the branch entries is still maintained, but it is no longer Can be directly addressed by BNY. The compressed track table 14 stores the same control flow information in the track table 10 in a compressed entry format. Only the SBNY field 15, the BNX domain 12, and the BNY domain 13 are shown in the track table 14. For example, the entry "1N2" in the K line indicates that the entry represents an instruction with the address K1, and its branch target is N2. The end table 16 is in the rightmost column of the track table 14 and is output through the independent reading port 30. When the read pointer 28 addresses the track table 14, the BN1X therein is used to read out the SBNY in all entries corresponding to the row. a value of 15, and each of the SBNY values is sent to a corresponding comparator of the column (such as comparator 18, etc.) and BNY in the read pointer Part 17 is compared separately. These comparators output '0' if the SBNY value of this column is less than the BNY, otherwise output '1'. The outputs of these comparators are detected, and the first '1' is found in order from left to right, with its control selector 19 outputting the contents of the entries in the row selected by BN1X in the corresponding column of the '1' via the bus 29. . For example, when the address on the read pointer 28 is 'M0', 'M1' or 'M2', the outputs of the three comparators 18 from left to right are '011', so the first output via the bus 29 The contents of the corresponding entry of '1' are both '2J3'. When the embodiment of Figure 2 uses a 14-format compressed track table as its track table 20, controller 27 compares BNY on read pointer 28 with SBNY on track table output bus 29. If BNY is less than SBNY, the instruction corresponding to the track table entry accessed by the read pointer 28 is still after the instruction accessed by the same read pointer 28, and the system can continue to step. If BNY is equal to SBNY, the track table entry accessed by the read pointer 28 is corresponding to the accessed instruction, at which point the controller 27 can control the selector 25 to perform the branch operation in accordance with the branch type in the domain 11 on 29. In the above embodiments of FIG. 1 and FIG. 2, the cache system provides an instruction for each clock cycle as an example for convenience of description.

Please refer to FIG. 3, which is another embodiment of the processor system of the present invention. 20 of them are the track tables of the first level cache. 22 is the memory RAM of the first-level buffer, 39 is the Instruction Read Buffer (IRB), and 47 is the tracker. 91 is a register, 92 is a selector, and 23 is a processor core. Instruction read buffer IRB 39 may store a portion of a level one instruction cache block or a single or multiple level one level instruction cache block, which is addressed by the read pointer 28 of the tracker 47. Read pointer 28 also addresses track table 20. The branch target address of the track table output is addressed to the level 1 cache 22 via bus 29 and also to the tracker 47 via bus 29. IRB Together with the primary buffer memory 22, a dual read memory is formed. The IRB 39 provides a first read port, the memory 22 provides a second read port, and the register 91 temporarily stores data output by the second read port. IRB The output of the 39 and the output of the level 1 buffer 22 are controlled by the branch decision 31 output from the processor core 23, and the output of the selector 92 is sent to the processor core 23 for execution.

The operation of the processor system of the embodiment of Fig. 3 is described below in conjunction with the contents of track table 14 of Fig. 1. Each entry in column 16 of 14 is an unconditional direct branch type. For ease of explanation, in all embodiments of the present disclosure, it is assumed that the other entries in 14 are direct conditional branch types. Read pointer at the beginning 28 points to the address 'L0', the corresponding instruction is read from the IRB 39, and the default value of the branch decision 31 controls the selector 92 to select the instruction from the IRB 39 for execution by the processor core 23. Reading pointer at the same time The address 'L0' on the 28 is addressed to the track table 14, the entry '0M1' is output from the bus 29, the first stage buffer 22 is accessed by the address 'M1' on 29, and the corresponding branch target instruction is stored in the register 91. At this time, the controller 27 compares the SBNY field 15 on the bus 29 with the BNY field 13 on the read pointer 28, and finds that the two are equal, so the selector 92 is controlled by the branch decision 31. Assuming 31 is 'no branch' at this time, 31 controls selector 92 to select IRB in the next clock cycle. 39 output. On the next clock cycle, the read pointer 28 is stepped to the address 'L1', the corresponding instruction is read from the IRB 39, and is selected by the selector 92 for execution by the processor 23. Reading pointer at the same time The address 'L1' on 28 addresses the track table 14, outputs the entry '3J0' from the bus 29, accesses the level 1 buffer 22 at the address 'J0' on 29, and reads the corresponding instruction as the branch target instruction in the register 91. . At this time, the controller 27 compares the SBNY field 15 on the bus 29 with the BNY field 13 on the read pointer 28, and finds that the two are not equal, so the selector 92 is selected to select the IRB by default. Output of 39 for processor core 23 carried out. On the next clock cycle, the read pointer 28 steps to the address 'L2', at which point the controller 27 finds that the SBNY field 15 on the bus 29 and the BNY field 13 on the read pointer 28 are still not equal, so 27 still controls the selector 92 selection. IRB The output of 39 is for execution by processor core 23. On the next clock cycle, the read pointer 28 steps to the address 'L3', at which point the controller 27 finds that the SBNY field 15 on the bus 29 and the BNY field 13 on the read pointer 28 are equal, so the selector 92 is controlled by the branch decision 31. Assuming that 31 is 'branch' at this time, control selector 92 selects the output of register 91, i.e., the branch target instruction of address 'J0', for execution by processor 23. At the same time, the branch decision 31 also controls the tracker 47 to select 'J0' on the bus 29 to place the read pointer 28, and the control will The ‘J’ level 1 cache block is stored in IRB 39. In the next cycle, the read pointer 28 steps to 'J1', and the control IRB 39 outputs the corresponding command to be selected by the selector 92 for execution by the processor core 23.

Please refer to FIG. 4, which is another embodiment of the processor system of the present invention. 40 is the secondary active table (Active List 2, AL2), 41 is the address cache buffer TLB and tag unit TAG of the second level cache, 42 is the memory RAM of the level 2 cache, 43 is the scanner, 44 is the selector, 20 is the track table of the level 1 cache, 37 For the level 1 cache related table, 22 is the memory RAM of the level 1 buffer, 27 is the controller, 33 is the selector, and 39 is the instruction read buffer IRB. The incrementer 24, the selector 25, together with the register 26 constitute a tracker 47, The incrementer 34, the selector 35, together with the register 36 constitute a tracker 48, 23 is the processor core, which can receive two instructions and select one execution completion under branch control and abandon execution of another branch, and 45 is a register for temporarily storing the state of each thread of the processor.

The scanner 43 examines the instruction block stored from the L2 cache memory 42 to the L1 buffer memory 22, and calculates the branch target address of the direct branch instruction therein by adding the branch instruction to the memory address of the branch instruction itself. Branch offset. The calculated branch target address is selected by the selector 44 and sent to the TLB/tag unit 41 for matching. The secondary active table 40 is accessed by matching the obtained secondary buffer address BN2. If the instruction corresponding to the L2 cache address has been stored in the L1 cache memory 22, the corresponding entry in 40 is valid, and the BN1X block address in the entry and the type of the branch instruction generated by the scanner 43 are at this time. And the intra-block offset BNY is merged into one track table entry. If the instruction corresponding to the L2 cache address has not been stored in the L1 cache memory 22, the corresponding entry in 40 is invalid. At this time, the L2 cache address BN2 (including the intra-block offset BNY) and the scan are obtained. The type of the branch instruction generated by the unit 43 is merged into one track table entry. Each corresponding track table entry in an instruction block thus generated is written in the instruction sequence to a track corresponding to the instruction block in the memory 22 in the track table 20, that is, the program stream extracted and stored in the instruction block is completed. .

The read pointer 28 generated by the tracker 47 addresses the track table 20 to read the entry through the bus 29. The controller 27 decodes the branch type and address format in the output entry. If the branch type in the output entry is a direct branch and the buffer address is in the BN2 format, the controller 27 addresses the secondary active table 40 with the BN2 address. If the entry in the entry is valid, the BN1X in the entry is filled in the track table 20 instead of the BN2X in the above entry, so that it becomes the BN1 format; if the entry in the 40 is invalid, the secondary buffer memory is addressed by the BN2 address. 42. The read command block is filled in a first level cache block provided by the level 1 buffer replacement logic in the level 1 buffer memory 22, and the block number BN1X of the level 1 cache block is filled into the above invalid entry in 40. And the entry is valid, and the BN1X is filled in the entry in the track table as above, and the BN2 address in the entry is replaced with the BN1 address. The BN1 address of the write track table 20 described above can be bypassed onto the bus 29 and sent to the tracker 47 for later use. If the branch type output via the bus 29 is a direct branch and the buffer address is in the BN1 format, the controller 27 causes it to be sent directly to the tracker 47 for later use.

If the branch type output via the bus 29 is an indirect branch, the controller 27 controls the tracker to wait for the processor core 23 to calculate the indirect branch target address via the bus 46, and the selector 44 sends the match to the L2 cache TMB/tag unit 41 to The matching secondary cache address BN2 accesses the secondary active table 40. If the corresponding entry in the 40 is invalid, the secondary cache memory 42 reads the instruction block and fills the first level cache memory 22 with the BN2 address as described above. In the cache block, the obtained BN1 address is bypassed to the tracker 47 for backup. Related table (Correration Table, which may also be referred to as an association table, 37 is a component of the permutation logic of the level 1 buffer 22, the structure and function of which will be described in the embodiment of FIG.

The branch in the processor core 23 determines that there are two pipelines before the pipeline segment, one of which receives the IRB from the instruction read buffer. 39 sequential instructions, the branch is named FT (Fall-through) branch; The other receives a branch target instruction from the level 1 buffer memory 22, which is named the TG (Target) branch. The number of front-end pipeline segments included in the two branches is determined by the pipeline structure of the processor. In this embodiment, two front-end pipeline segments are included in each of the two branches as an example. The branch in the processor core 23 determines that the pipeline segment executes the branch instruction, selects one of the two instructions to complete execution based on the generated branch decision 31, and discards execution of the other branch. In this embodiment, IRB 39 can store two instruction blocks as an example, the instruction read buffer IRB 39 is addressed by the IPT read pointer 38 of the tracker 48. The primary command buffer 22, associated table 37 and track table 20 are addressed by the RPT read pointer 28 of the tracker 47.

When the processor core 23 does not make a judgment on the branch, the branch judgment 31 The default value is '0', that is, no branch, the processor core 23 selects an instruction to execute the FT branch; when the processor core 23 makes a judgment on the branch, if the judgment is 'no branch', the value of the branch judgment 31 is '0. ' At this time, the processor core 23 selects the instruction to execute the FT branch; if it is judged as the 'branch', the value of the branch decision 31 is '1', at which time the processor core 23 selects the instruction to execute the TG branch. The selectors 33, 25, 35 can all be controlled by the branch judgment 31. When 31 is '0', the above three selectors select the right input; when 31 is '1', the above three selectors are selected. Input on the left. Further, when the processor core 23 does not make a judgment on the branch, the selectors 33 and 25 are also controlled by the controller 27. The operation of the processor system of the embodiment of Fig. 4 is described below in conjunction with the contents of track table 14 of Fig. 1. At the beginning, the M command block is already in the instruction read buffer IRB39, the branch decision 31 is '1', and the selectors 25 and 35 select the left input, the IPT read pointer. Both 38 and PT read pointer 28 point to address M1. At this point, the M1 instruction in IRB39 pointed to in IPT 38 is sent to the FT branch front-end pipeline in the processor core; at the same time, RPT 28 points to the track table 20, and reads the value 'N' of the end entry 16 of the M line from the independent read port 30 to address the level 1 buffer 22 and output the N command block to the IRB. 39. The entry 2J3 of the track table 14 in which the M line matches the BNY address '1' is output via the bus 29. At this time, the instruction branch judges 31 to be the default value '0', and the selector 35 selects the input of the incrementer 34, IPT. The pointer 38 is stepped, and the IRB 39 output M2, M3, N0 command is sent to the FT branch front-end pipeline of the processor core 23. Controller 27 compares the value '2' on the 15 domain SBNY on bus 29 with the RPT The value of the 13 field BNY on the 28 is '1', and when they are not equal, the selector 25 is controlled to select the output of the incrementer 24 to step the RPT 28 to point to M2, at which point SBNY and RPT on the bus 19 On the read pointer 28, BNY is equal, and the decoder 27 controls the selector 33 and the selector 25 to select the input on the right, that is, the BN1 address J3 on the bus 29 is stored in the register 26. Thereafter, the controller 27 controls the RPT read pointer 28 to read J3 from the level 1 cache 22, and the K0 command is sent to the TG branch front end pipeline of the processor core 23.

M2 is a branch instruction. When it reaches the pipeline segment in the processor core 23 for branch determination, the pipeline segment executes the M2 instruction to generate a branch decision. If the branch judges that '31' is '0', the processor core 23 selects M3 in the FT branch. The N0 instruction continues to execute, and the J3 and K0 instructions in the TG branch are discarded. At this time, the branch judgment 31 controls the selectors 25 and 35 to select the output of the incrementer 34 to be stored in the registers 26 and 36, so that the RPT 28 and the IPT are made. 38 points to N1, IPT 38 controls IRB 39 output N1 and subsequent instructions to the processor core 23 FT for continuous execution. At this time RPT 28 Pointing to N rows in the track table, reading the end entry of the N row, and sending it to the first level buffer 22 to read the sequence of the N instruction block, the next instruction block is stored in the IRB 39.

If the branch judges that '31' is '1', the processor core selects J3 in the TG branch. The K0 instruction continues to execute, and the M3, N0 instruction in the FG branch is discarded. At this time, the branch judgment 31 controls to store the K line instruction output by the level 1 cache 22 into the IRB. 39, and controls the selectors 25 and 35 to select the output of the incrementer 24 to be stored in the registers 26 and 36, so that both the RPT 28 and the IPT 38 point to K1, and the IPT 38 controls the IRB 39 output. The FT of K1 and subsequent instructions to the processor core 23 is continuously executed. The RPT 28 points to the K line, and the L of the end line of the K line is sent to the first level buffer 22 to read the L line and store it in the IRB. 39. As such, the processor 23 can execute the instructions without interruption, without the pipeline stall due to the branch.

The tracks corresponding to different threads in the track table are orthogonal, so they can coexist and do not affect each other. The indirect branch address 46 generated by the processor core in FIG. 4 is a virtual address, which is selected by the selector 44 after being combined with the thread number, wherein the index address is simultaneously sent to the TLB and the second label unit in 41, and the virtual label portion is The thread number is sent to the TLB and mapped to a physical label, and the physical label matches the label of each path read by the index address in the secondary label unit, and the obtained road number is matched (Way Number) is combined with the index number (Index) in the virtual address, that is, the level 2 cache block address. Therefore, the level 2 cache address BN2 and the level 1 cache address BN1 obtained by the mapping are actually mapped by the physical address instead of Virtual address mapping. Therefore, the two different threads with the same virtual address in the processor have different buffer addresses BN, which avoids the same virtual address of different programs of different threads to address the same cache address (address Aliasing) problem. On the other hand, the same virtual address of the same program of different threads, because it will be mapped to the same physical address, the cache address of the mapping is also the same, avoiding the duplication problem of the same program in the buffer. Based on this feature of the cache address, multi-threaded operations can be implemented. 45 in FIG. 4 is a register bank in which a thread number and a status register in the processor are stored by a thread, such as in the tracker 47 of FIG. The contents of register 26 and register 36 in tracker 48, as well as the values of the registers of the thread in processor core 23. 45 is addressed by thread number 49. When the processor wants to switch threads, the values in the registers 26 and 28 in the trackers 47, 48, and the values in the registers in the processor core 23 are all read out and stored in 45, thereby changing on the bus 49. The entry pointed to by the thread number. The swapped thread number is then transmitted by the bus 49 to 45, and the contents of the entry pointed to by the thread number are swapped into the registers 26, 36 and the registers in the processor core 23, and then at the IRB. 39 filled in IPT The instruction block pointed to by 38 and the next instruction block in the order can start the operation of swapping in the thread. The instructions of each thread in the track table 20 and the buffers 42 and 22 are orthogonal, and there is no phenomenon that one thread mistakenly executes an instruction of another thread.

Please refer to FIG. 5, which is another embodiment of the processor system of the present invention. Among them, the secondary active table 40, the secondary cache memory RAM 42, secondary scanner 43, track table 20, level 1 cache related table 37, level 1 buffer memory RAM 22, instruction read buffer 39, tracker 47, tracker 48, processor core 23 has the same function as the module of the same number in the embodiment of Fig. 4; although the controller 27 and the selector 33 are omitted in Fig. 5 for making the drawing easy to read, the operation below the secondary buffer is the same as that of the embodiment of Fig. 4. Figure 3 adds a three-level cache, consisting of a three-level active table 50, a three-level cached TLB, and a tag unit TAG. The 51 and the tertiary buffer memory 52, the tertiary scanner 53 and the selector 54 are formed instead of the TLB and label unit 41 of the secondary cache in FIG. 4, and the selector 44. The last level cache in the embodiment of Figure 5 (last Level Cache), the three-level buffer 52 is organized in a way group mode, and the second level buffer 42 and the first level buffer 22 are all connected in a connected manner. Each of the L2 cache blocks 42 has four L1 cache blocks, and the L3 cache block of each of the L3 buffers 52 has 4 L2 cache blocks.

Please refer to FIG. 6, which is an address format of the processor system in the embodiment of FIG. 5. The memory address is divided into a tag (61), an index (Index) 62, and a second-level subaddress (L2). Sub_address) 63, the first level subaddress (L1 sub_address) 64, and the intra-block offset (BNY) 13. The address BN3 of the third-level buffer is composed of a road number 65 and an index 62, a second-level sub-address 63, a first-level sub-address 64, and an intra-block offset (BNY) 13; wherein the road number 65 and the index 62 are combined into three levels. Cache block address; 65, 62, 63 flattened to address a level one instruction block in the level three cache block; and each of the blocks within the block offset 13 is collectively referred to as BN3X, addressing one of the three level cache blocks Level instruction block. The address BN2 of the secondary buffer is composed of a secondary cache block number 67 and a primary subaddress 64, and an intra-block offset (BNY) 13; wherein the secondary cache block number 67 addresses a secondary cache block; Each of the internal offsets 13 is collectively referred to as BN2X, addressing a level one instruction block in the secondary cache block. The address BN1 of the primary buffer is composed of a primary cache block number 68 (BN1X) and an intra-block offset (BNY) 13. The intra-block offset (BNY) 13 in the above four address formats is the same, and the BNY portion does not change when the address conversion is performed. In the BN2 address format, the secondary block number 67 points to a secondary cache block, and the primary subaddress 64 points to one of the four primary instruction blocks in the secondary cache block. Similarly, in the BN3 address format, the road number 65 and the index 62 point to a third-level cache block, and the second-level sub-address 63 points to one of the four second-level instruction blocks, and the first-level sub-address 64 points to the selected two-level instruction block. One of the first level instruction blocks.

Please refer to FIG. 7, which is a partial storage table format of the processor system in the embodiment of FIG. 5. This will be described below with reference to Fig. 5, Fig. 6 and Fig. 7. The format of the tag unit in 51 of Figure 5 is the physical tag 86. The CAM format of the TLB in 51 is the thread number 83 and the virtual tag 84, and the RAM format is the physical tag 85. The thread number 83 and the virtual tag 84 of the selector 54 selection output are mapped to the physical tag 85 in the TLB; the index address 62 in the virtual address reads the physical tags 86 and 85 in the tag unit to match the way number 65. The road number 65 and the index address 62 in the virtual address are combined to form a three-level cache block address.

The AL3 three-level active list 50 of FIG. 5 is organized in a multiplexed associative manner, with the same number of rows in each path as the tag units in the L3 buffers 52 and 51, also addressed by the index address 62. Each row has a count field 79 and four BN2X fields 80, and a plurality of 80s in the same row are addressed by a secondary subaddress 63. Each 80 field has its corresponding valid bit 81. The same line of each way shares a three-level pointer 82. The AL2 secondary active list 40 is organized in a fully associative manner with the same number of rows as the L2 buffer 42 being addressed by the secondary block address 67. There is a count field 75 and four BN1X fields 76 in each row, 76 being addressed by a primary subaddress 64. Each 76 field has its corresponding valid bit 77. Each line shares a secondary pointer 78. The CT correlation table 37 is organized in a fully associative manner with the same number of rows as the L1 buffer 22, addressed by the primary block address 68. There is a count field 70 in each row, BN2X domain 71 and several BN1X domains 72. Each 72 field has its own valid bit 77. Each line shares a first level pointer 74.

When one of the L3 cache blocks is stored in a L2 cache block, the block number of the L2 cache block is stored in the L2 cache block. The entry cache 80 is addressed by the secondary subaddress 63 in the corresponding row of the tertiary active table 50, and its corresponding valid bit 81 is also set to '1' (valid). The instructions in the L2 cache block are decoded by a three-level scanner 53, wherein the branch offset in the branch instruction is added to the address of the instruction to obtain the branch target address. The address of the next L2 cache block in the L2 cache block is also determined by the memory address of the L2 cache block plus the size of a L2 cache block. The branch target address or the next L2 cache block address is selected by the selector 54 to be sent to the tag unit match 51. If not, the address is sent to the lower level memory read command and stored in the L3 cache memory 52. This ensures that the instructions in the L2 cache memory 42, its branch destination and the sequence of the next L2 cache block are at least in the L3 cache memory 52 or are being stored in the process 52.

When a level one instruction block in one of the level two buffer blocks in the level 2 buffer 42 is stored in a level one cache block in the level 1 buffer 22, the block number of the level 1 cache block in the 22 is stored in the The entry cache 76, which is addressed by the primary subaddress 64 in the corresponding row of the secondary cache table 40, has its corresponding valid bit 77 set to '1' (valid). The instructions in the level one cache block are decoded by the secondary scanner 43, wherein the branch offset in the branch instruction is added to the address of the instruction to obtain the branch target address. The address of the next level one cache block in the first level cache block is also obtained by the memory address of the first level cache block plus the size of one level one cache block. The branch target address or the sequence of the next level one cache block address is selected by the selector 54 to be sent to the tag unit 51 for matching. If there is no match, the address is sent to the lower layer memory read instruction and stored in the level three cache memory 52; Then, the entries 80 and 81 in the third-level active list 50 are read out by the 65, 62, 63 portions of the matched third-level cache addresses. If 81 is '0' (invalid), the third-level cache memory 52 is addressed by 65, 62, 63, 64 of the matched third-level cache addresses, and a second-level cache block is read into the second level. In a secondary cache block of the cache memory 42, the block number 67 and the valid value '1' of the second level cache block are written to the entries 80 and 81 addressed by the above-mentioned three-level cache address in the third-level active list 50. in.

If the read entry 81 is '1' (valid), the AL2 secondary active table 40 reads the entries 76 and 77 with the BN2X values (67 and 64) in the read entry 80. If 77 is '0' (invalid), the BN2X value and the BNY combined BN2 address (67, 64, 13) are stored in the entry corresponding to the branch instruction on the track being filled in the track table 20. If 76 is '1' (valid), the BN1 address (68, 13) in the BN1X and BNY in the entry is stored in the entry in the track table 20 corresponding to the branch instruction. Further, the branch type 11 decoded by the secondary scanner 43 is also stored in the entry of the track of the track table 20 together with the above BN2 or BN1 address. The block address of the first level cache block is also matched and addressed in the above manner. If the next level two instruction block is not already in the level 2 buffer memory, the instruction block is stored from the level 3 cache 52 to the second level. The cache 42; and stores the obtained BN2 or BN1 address in the end table entry 16 at the far right of the above track. This ensures that the instructions in the level one cache memory 42 whose branch destination and sequence the next level one cache block are at least already in the level two cache memory 42 or are being stored into the process 42.

This embodiment discloses a hierarchical prefetching function, and each storage hierarchy can ensure that the branch target of the storage hierarchy is at least, or is being written into, a lower level storage hierarchy. This allows the branch target instructions of the instruction being executed by the processor core to be in the primary cache or the secondary cache in most cases, masking access latency to lower memory levels.

The corresponding row in the correlation table 37 is also established while the first level instruction block is filled in the level 1 cache memory 22, and the instruction track of the cache block is created to fill the track table 20 with the corresponding track. Filling the BN2X address (67 and 64) of the L1 cache block in the corresponding field 71 of the relevant table 37, so that when the L1 cache block is replaced, the BN2X address may be used to replace the track table with the BN2X address. The level 1 cache block is the block number BN1X of the level 1 cache block in the target entry to maintain the integrity of the control information flow in the track table. At the same time, the row in the correlation table 37 is also addressed with the BN1X in the branch target being written into the track of the track table 20, and the count value 70 in the row is incremented by '1', thereby recording another branch. The instruction targets the behavior and writes the first-level cache block number of the track itself being written to its 72-field and sets the corresponding 73-domain to '1' (valid) to record the path of the branch source (address) ). For the next sequential level one cache block address stored in the track end entry, a row in the associated table 37 is also addressed in that manner in that manner.

The branch target address format in the entry of the track table 20 may be in the BN2 or BN1 format as described above. When the track table entry is output from the bus 29, the controller (such as 27 in FIG. 4) decodes the branch type 11 therein. If the address format is BN2, the controller searches for the BN2X address (67 and 64) on the bus 29. The address secondary active table 40 reads entries 76 and 77. If 77 is '0' (invalid), the L2 cache memory 42 is addressed by the BN2X address, and a level one instruction block is read into a level one cache block in the level one buffer memory 22, and the level is one level. The cache block number and the valid value '1' are stored in the entries 26 and 77 pointed to by the above BN2X address in the secondary active list 40. If 77 is '1' (valid), then BN1X in 76 68 writes to entry 12 in the track table but does not change BNY in entry 13, thus replacing the original BN2 address with the BN1 address. The BN1X address can be bypassed onto bus 29 for use by tracker 47. Tracker 47 addresses track table 20, level 1 buffer memory 22; tracker 48 addresses IRB The process of providing the processor core 23 with uninterrupted instructions for execution is the same as that of the embodiment of FIG. 4, and details are not described herein again.

Cache Replacement Logic of this embodiment with least correlation (Least Correlation, LC) and Earliest Replacement (Earlierst Replacement, The ER) combined method (hereinafter referred to as LCER) determines the cache block that can be replaced. The count value 70 in the correlation table 37 is used to detect the correlation (also called the degree of association). The smaller the count value, the smaller the number of cache blocks targeted to the level 1 cache block, which is convenient for replacement. The pointer 74 shared by each row in the associated table 37 points to a row that can be replaced (the count value 70 in the replaceable row must be lower than a preset value). When the primary cache block pointed to by the pointer 74 is replaced, the corresponding track pointed to by 74 in the track table 20 is also replaced by the branch type and branch target extracted by the secondary cache block replaced by the secondary scanner 43. The track in the track table 20 is also addressed by the BN1X address in the corresponding 72 field in which each of the 73 fields in the associated table 37 is '1' (valid), and the first level cache in the track is replaced. The branch target address recorded in the block number is replaced with BN2X in the 71 field in the row indicated by 74 in the correlation table 37, so that the instruction originally used as the branch target in the replaced first-level cache block is now in the second-level buffer buffer. The same instruction in 22 is a branch target such that replacing the level one cache block does not affect the control information flow. At the same time, the BN2X is used to address the secondary active table 40, and the count value 75 in the entry of 40 is increased by the number of times BN1X is replaced in the track table 20 by the BN2X value to record the correlation of the increase of the secondary cache block. And set the valid bit 77 of the 40 entry corresponding to the replaced primary cache block (indicated by the 64 field in the BN2X address) to '0' (invalid). The pointer 74 then moves in a single direction, staying on the next line that satisfies the least correlation; when the pointer goes beyond the boundary of all the lines in the associated table 37, it moves to the other boundary (if the line exceeding the maximum address is the smallest from the address) The line starts to detect the least correlation detection). The one-way movement of the pointer 74 ensures that the first-order cache block that was replaced the first time is preferentially replaced, that is, the above ER. Detecting the one-way movement of the count value 75 of each row and the pointer 74 implements the LCER level one cache replacement strategy. This permutation method replaces a single number of L1 cache blocks at a time.

In addition, it can be replaced in sequential or reverse order in program order. For example, when a level 1 cache block is replaced, the cache block pointed to by the level 1 cache block number BN1X in the end entry in the track is also replaced, which is a sequential replacement. Or when a primary cache block is replaced, the primary cache block number BN1X in the 72 domain corresponding to the previous cache block in the corresponding row of the related table is also replaced, which is a reverse replacement. It can even be replaced in both sequential and reverse order starting from a level 1 cache block. The permutation may be continued in order or in reverse order until a level one cache block is encountered, with the count value 70 in the corresponding correlation table 37 exceeding a preset value. This permutation method replaces a plurality of L1 cache blocks at a time. A singular or multiple replacement method can be used as needed. It is also possible to mix different methods. If the singular permutation method is used normally, the complex permutation method is used when the lower layer cache lacks a cache block that can be replaced.

The replacement of the L2 cache is also based on the LCER strategy. In addition to the above, when the primary cache block is replaced, the corresponding 77 field in the secondary active table 40 is set to '0' and the count value is increased 75; when the cache block is stored in the primary cache memory 22 from the secondary cache memory 42 The corresponding valid bit 77 in the corresponding entry in the secondary active table 40 is set to '1', and the primary cache block number BN1X is written to the corresponding 76 field. Each time when the BN2X obtained by the matching of the branch target address or the like is stored in the track table 20, the count value 75 corresponding to the BN2X in the secondary active table 40 is incremented by '1'; each time the BN2X in the track table entry is When the BN1X is replaced, the count value 75 corresponding to the BN2X in the secondary active table 40 is decremented by '1'. Thus, the count value 75 records the number of times a secondary cache block is used as a branch target; and each valid bit 77 in the entry records whether a portion of the secondary cache block has been stored in the primary buffer; Each 76 field records the block address 68 of each corresponding level one cache block. The replacement of the secondary cache causes the shared secondary pointer 78 to move in one direction and stay on the next replaceable secondary cache block. The replaceable secondary cache block can be defined as the count value of 75 in its corresponding secondary active table 40 entry and all 77 fields are '0'. That is, when a secondary cache block is unrelated to all instructions in the primary buffer 22, the one-way moved pointer 78 guarantees the ER.

The replacement of the L3 cache is also based on the LCER strategy. When the cache block is stored in the L2 cache memory 42 from the L3 cache memory 52, the corresponding valid bit 81 in the corresponding entry in the tertiary active table 50 is set to '1', and the L2 cache block number BN2X is written. The corresponding 80 domain. The count value 79 in the entry of the tertiary active table 50 is not used in this embodiment. The third-level cache is an association form of road groups, and each group (the same index address) has a plurality of paths, and each group shares a pointer 82. The next available path can also be looked up by pointer 82, where the replaceable path can be all 81 fields in the path being '0'. That is, the level three cache block is not related to the instructions in the level two buffer 42, and thus can be replaced. The above method of using the pointer to ensure that the cache block that has just been replaced is not replaced again may be replaced by another method.

In the embodiment, the third-level buffer is a group association organization manner. If it is not replaceable in each group (at least one of the 81 fields in the three-level active table 50 is '1'), you can select the first-level cache block in which the 81 field is the least '1'. Perform a complex replacement. For example, if only one 81 domain of a certain path is '1', that is, only one of the four secondary instruction blocks that can be stored in the third-level cache block is in the secondary cache memory 42, so the 80 domain corresponding to the 81 domain can be used. The BN2X output in the address addresses the secondary active table 40, from which the BN1X number in the 76 field that is first valid in the address order (its 77 field is '1') is read, and the cache block from this level is calculated to The last valid L1 cache block in the level cache block is a total of N L1 cache blocks. That is, the BN1X number and the first-level cache block number N are sent to the first-level cache replacement logic, and the first-level cache blocks are replaced from the first-level cache block pointed to by the BN1X, and the cache blocks targeted to the cache blocks are combined. Substitution, the above secondary cache block can be replaced. Then all 81 fields in the above-mentioned way group in the three-level active list 50 are '0', and the corresponding level three cache blocks can be replaced. If the level 1 cache block included in the level 3 cache block is not continuous, then a plurality of starting points and a plurality of corresponding N values are set to be sent to the level 1 cache replacement logic in sequence as described above.

The count values in each level in the embodiment of FIG. 7 are 79 in the third-level active list 50, 75 in the second-level active list 40, and 70 in the (primary) correlation table 37 are used to record the cache block at the same storage level. The degree of relevance in . Each valid bit in each level with a higher storage level is used to record the relevance of the cache block in a higher storage hierarchy, such as the association between the 81 record and the second-level cache block in the three-level active table 50, and the second-level initiative. The degree of association of 77 records in Table 40 with the level one cache block. The 73 in the related table 37 records the branch source address that jumps to the level 1 cache block. Therefore, the method of replacing the current cache block BN1X address in each entry pointed to by the branch source address in the track table 20 can be replaced by the BN2X address 71 of the cache block in 37 to maintain the integrity of the control flow information. In this way, the cache block can be replaced. Another replacement method can select a cache block replacement with a degree of association of '0'. In essence, the cache system of the present invention operates based on control flow information, so the basic principle of cache replacement is that the integrity of the control flow information is not compromised.

Please refer to FIG. 8, which is another embodiment of the processor system of the present invention. Figure 8 is a modification of the embodiment of Figure 5, wherein the three-level active list 50, the three-level cached TLB and tag unit 51, the third-level buffer memory 52, the selector 54, the secondary active table 40, the second-level cache memory 42, track table 20, level 1 cache related table 37, level 1 buffer memory 22, instruction read buffer 39, tracker 47, tracker 48, processor core 23 has the same function as the module of the same number in the embodiment of FIG. The secondary scanner 43 (which can generate the branch type) is connected to the bus from the tertiary buffer 52 to the secondary buffer 42, which is the only one in the embodiment. A secondary track table 88 is additionally added. The organization of each buffer in the embodiment of Fig. 8 is the same as that in the embodiment of Fig. 5.

Each track in the secondary track table 88 corresponds to a secondary cache block in the secondary buffer 42. Each secondary track contains four primary tracks, each corresponding to one primary instruction block in the secondary cache block. The format of the first-order orbit in the secondary track table 88 also takes the SBNY in Figure 1. 15, type 11, BNX 12 and BNY 13 format, the address format can be BN3 or BN2 format. The scanner 43 scans the L2 cache block stored from the L3 buffer memory 52 to the L2 cache memory 42 and calculates the branch target address of the branch instruction therein. The branch target address is selected by the selector 54 to be sent to the TLB/tag unit 51 to be matched to the BN3 address, and the BN3 address addressing three-level active table 50 detects whether the entry is valid (whether the corresponding cache block has been stored in the secondary cache memory 42); Valid, the BN2X address in the entry is combined with the BNY in the BN3 address to form the BN2 address along with the SBNY generated by the scanner. 15 and type 11 are stored in the entry of the secondary active table 88 corresponding to the branch instruction; if invalid, the entry is directly stored in the 88 entry with the BN3 address together with SBNY 15 and type 11.

When a primary instruction block in the secondary cache block of the secondary buffer memory 42 is stored in the primary cache block in the primary buffer memory 22, the secondary track table 88 outputs the corresponding primary track from the bus 89. Deposited in track table 20. If the address in the entry on the track is in the BN3 address format, the three-level active table 50 is addressed by the address, and if the entry valid bit 81 is invalid, the secondary cache block is removed from the tertiary buffer 52 in the foregoing manner. Stored in a L2 cache block of the L2 buffer, and the L2 cache block number is combined with the second-level sub-address 64 of the BN3 address to form a BN2X address and stored in the 80-domain of the third-level active table 50; Valid, that is, BN2X in the entry is stored in the secondary track table 88 instead of the original BN3X address. The BN2X is also bypassed to Bus 89 is provided for storage in track table 20. This embodiment uses the count value 79 in the three-stage active meter 50. Similar to the method of using the count value 75 in the secondary active table in the embodiment of FIG. 6, when the BN3 address is written to the secondary track table 88, the count value 79 in the corresponding three-level active list 50 is increased. When the BN3 address output by the secondary track table 88 is mapped to the BN2 address in the tertiary active table 50, its corresponding count value 79 is decreased. In the third-level cache replacement, not only the value of each valid bit 81 but also the count value of 79 is checked.

On bus 89 The BN2 address is also used to address the secondary active table 40. If the valid entry 77 of the entry in the 40 is invalid, the BN2 address is stored in the entry in the track table 20. If the valid entry 77 of the entry in the 40 is valid, The BN1X address in the 40 entry is combined with the BNY address in the BN2 address and stored in the entry in the track table 20. When the BN2 address is output from the track table 20 via the bus 29, it is used to address the secondary active table 40. If the valid bit 77 in the entry is invalid, the secondary cache memory 42 is accessed by the BN2 address to read a level 1 cache. The block is stored in a first-level cache block number in the first-level cache memory 22, and the first-level cache block number BN1X is stored in the 76 field of the second-level active table 40, and the BN1X is stored in the track table 20, and the block BN1X is bypassed onto bus 29 for use by the tracker. In this embodiment, the address of the track entry in the secondary active table 88 may be in the BN3 or BN2 format, and the address of the track entry in the active list 20 may be in the BN2 or BN1 format. Another strategy is to fill in the track table 20 with the BN1 address. If the address on the bus 89 is in the BN2 format, and the address active address of the secondary active table 40 entry is invalid, the BN2 address is accessed. The L2 cache memory 42 reads a level 1 cache block number stored in the L1 cache memory 22, and stores the L1 cache block number BN1X in the 76 field of the L2 active table 40, and The corresponding 77 field is set to be valid; and the BN1X is stored in the track table 20, and the BN1X can also be bypassed to the bus 29 for use by the tracker; if 77 bits in 40 are valid, the BN1X in the table 76 field is used. The track table 20 is directly filled in and bypassed to the bus 29 for use.

Please refer to FIG. 9, which is an embodiment of an indirect branch target address generator of the processor system of the present invention. The indirect branch target address is generally obtained by adding a base address stored in the register file in the processor core to the branch offset contained in the indirect branch instruction. In Fig. 9, 93 is an adder, 39 is an IRB, 95 is a plurality of registers with comparators, 96 is a plurality of registers, and the relationship between them is CAM-RAM, one-to-one correspondence. 98 is a selector. In addition, 15, 11, 12, and 13 are contents of the entry of the track table 20 output via the bus 29. A set of registers 95 and 96 is arranged for each indirect branch instruction. Adder 93 and IRB 39 is the sharing of all indirect branch instructions. The field of the track table 20 of the indirect branch instruction has 15 fields SBNY, and the 11 field type is the same as that defined in FIG. 1; however, the 12 field is instead used to store the register file (RF) address, and the 13 field is used to store the register 95, 96. Group number. When the scanner 43 decodes the scanned one instruction into an indirect branch instruction, the 15 field and the 11 field of the track table entry are generated as described above, and the base address register file number in the instruction is placed in the 12 domain, and the 13 domain is Set to 'invalid'. When an entry corresponding to an indirect branch instruction is output from the track table 20 via the bus for the first time, the 13 fields that are 'invalid' cause the system to allocate a set of registers 95, 96 (multiple rows of CAM-RAM in a group) The group number of the set of registers is stored in the track table entry 13. Track table entry 15 domain addressing IRB 39 from which the branch offset in the indirect branch instruction is sent to an input of the adder 93; the register address of the register file is read by the track table entry 12; or as shown in FIG. The write address of the heap, when the write address is the same as the address in the track table entry 12 field, connects the bus 94, which writes the execution result of the execution unit transfer from the processor core back to the register file, to the other input of the adder 93. The output 46 of the adder 93 is the branch target address, which is sent to the TLB/tag unit 51 for matching. At the same time, the base address on the bus 94 is also stored in a row available in the 95 register in the register group pointed to by the track table entry 13 field; the branch target instruction matches the resulting BN1 address stored in the register field pointed to by the 13 field via the bus 89. The same line in the 96 registers.

When the 13 field is 'invalid' or when it is 'active' but the base address on bus 94 does not match the contents of register 95, selector 98 selects the BN1 address on bus 89 to be output via bus 99. When the type of the entry on the bus 29 is an indirect branch instruction, the address of the bus 99 is used by the tracker 47; the address on the bus 29 is selected for use by the tracker 47 when the entry type of the entry is other. The next time the same indirect branch instruction is executed, the register group number in the 13 field in the track table entry on bus 29 selects the corresponding register bank 95 and 96. The register file address in the 12 field selects the bus written back to the register file table entry. The data on 94 is compared with the contents of register 95. If matched, the BN1 address in the 96 rows of the corresponding register is output via bus 97, and is selected by the selector 98 for use by the tracker; if not, the addition is as described above. The device 93 calculates that the indirect branch target address matches the BN1 address on the bus 89, and the selector 98 selects the address output on the bus 89. A mismatch also causes the base address on bus 94 and the BN1 address on bus 89 to be stored in a row in registers 95, 96 that are not used. The permutation logic is responsible for allocating register sets 95, 96 to the entries of the indirect branch type of bus 29 that are "invalid" in the field 13, which may be LRU or the like. Thus, in this embodiment, the base address of the indirect branch instruction can be mapped to the level 1 buffer address BN1, and the steps of address calculation and address mapping are omitted.

Please refer to FIG. 10 , which is a schematic diagram of a pipeline structure of a processor core in a processor system according to the present invention. 100 is a typical pipeline structure of a traditional computer or processor core, divided into I, D, E, M, W segments. The I segment is the instruction segment, D is the instruction decoding segment, E is the instruction execution segment, M is the data access segment, and W is the register write segment. 101 is the pipeline segment of the processor core in the present invention, and has less than one segment compared with 100. A conventional processor core generates an instruction address that is sent to a memory or buffer to read (pull) the instruction. The cache system of the present invention automatically pushes instructions to the processor core, requiring only the processor core to provide a branch decision 31 to determine the program direction, and a stall pipeline signal 32 to synchronize the cache system with the processor core. Therefore, the pipeline structure of the processor core using the cache system of the present invention is different from the conventional pipeline structure, and there is no need for a pipeline segment for instruction fetching. In addition, the processor core using the cache system of the present invention does not need to maintain the instruction address (Program) Counter, PC). As shown in Figure 9, the indirect branch target address is generated based on the base address in the register file, and no PC address is required. Other instructions are also accessed by the BN address of the cache system, without the PC. Therefore, it is not necessary to maintain the PC in the processor core using the cache system of the present invention.

Please refer to FIG. 11, which is another embodiment of the processor system of the present invention. Figure 11 is a modification of the embodiment of Figure 8, wherein the three-level active list 50, the three-level cached TLB and tag unit 51, the third-level buffer memory 52, the selector 54, the scanner 43, the secondary track table 88, two Active table 40, secondary cache memory 42, track table 20, level 1 cache related table 37, level 1 buffer memory 22, instruction read buffer 39, tracker 47, tracker 48, processor core 23 has the same function as the module of the same number in the embodiment of FIG. A secondary correlation table 103 is added, and 102. 102 is the indirect branch target address generator shown in the embodiment of FIG. The organization of the buffer in the embodiment of Fig. 11 is the same as that of the embodiment of Figs. 5 and 8.

The secondary correlation table 102 is similar in structure to the related table 37. Each of the L2 cache blocks has a count value, a L3 cache address corresponding to the L2 cache block, a source address of the branch source instruction with the L2 cache block as a branch target, and a valid signal thereof (refer to FIG. 7) Medium CT format); as in the related table, the count value is the number of branch source instructions. When the scanner 43 generates a track corresponding to the L2 cache block and fills the secondary track table 88, the BN2 format branch target address in the filled track entry addresses the row in the secondary correlation table 103 (hereinafter referred to as the target) Line), fill in the secondary buffer address of the track (referred to as the source track) that is filling in the secondary track table 88 into the source address field in the target row and set its valid signal to 'valid', and the target line The count is increased by '1'. The corresponding level three buffer address of the source track is also filled in the row in the secondary correlation table 103 corresponding to the source track. In addition, when the address in the entry of the secondary track table 88 is in the BN3 format, the three-level active list 50 entry is addressed by the BN3 address, and the count value 79 therein is increased by '1'.

When the format of the entry on the output 29 of the track table 20 is in the BN2 format, it will be used to address the secondary active table 40. If the corresponding entry is invalid, the BN2 (hereinafter referred to as the source BN2 address) address is required from the second The read instruction block in the level buffer memory 42 fills the level one cache block specified by the permutation logic in the level one buffer 22. At this time, the source track BN2 addresses the secondary track table 88 to output the corresponding track to the track table 20 for storage. When the output 89 of 88 is in the BN3 address format (hereinafter referred to as the target BN3 address), the target BN3 address is sent to the tertiary active table 50 to be mapped to the BN2 address (hereinafter referred to as the target BN2 address), and the target BN3 is pointed at this time. The count value in the third-level active table entry is reduced by '1', and the value in the target row pointed to by the target BN2 address in the secondary correlation table 103 is increased by '1'; The target BN3 address is stored in the same destination row; the source BN2 address is also stored in the same destination row, and its corresponding valid bit is set to 'active'.

When a L2 cache block is replaced, the secondary pointer 78 points to the corresponding target row of the replaceable L2 cache block in the secondary correlation table 103, from which each valid BN2 source address is read, and each BN2 source address is searched for. The address secondary track table 88 replaces the BN2 target address (pointing to the above target line) in the corresponding entry with the BN3 target address in the target row in 103, and sets the valid position of each BN2 source address in the target row in 103 to be invalid. '. At this time, the count value in the target row in 103 is subtracted from the value of the valid BN2 source address, and the entry in the third-level active table 50 is addressed by the above BN3 target address, and the count value 79 is increased and the count value in 103 is incremented. The value subtracted is the same value.

The above buffer replacement methods are based on inclusive buffers (inclusive) Cache) description, that is, the content of the high cache level must be in the low cache hierarchy. You can also apply the least associated cache replacement method to a non-inclusive buffer (non-exclusive) Cache). A lock signal bit may be added to the correlation table corresponding to the high-level cache block. When the lock signal bit is '0', the operation is the same as the above; when the lock signal bit is '1', the corresponding cache block is only When the degree of association is '0', that is, when no branch instruction targets the cache block (here, the end table entry of the previous instruction block is also regarded as storing the unconditional branch instruction), the cache block can be replaced. In the correlation table 37, this is the first level cache block when one of the above lock signal bits is '1' only when its corresponding count value is 70. It can be replaced when it is '0' and all valid bits 73 are '0'. In the secondary correlation table 103, the above-mentioned L2 cache block whose lock signal bit is '1' can be replaced only when its corresponding count value and all valid bits are '0'.

For example, when the level 3 cache is to replace a level three of the cache blocks of the way, the entries in the level three active table 50 can be addressed by the BN3 address on the level 3 pointer 83, with all valid entries. The BN2 address addresses the row in the secondary correlation table 103 and sets the lock signal to '1'. Thereafter, the level three cache block can be replaced. After the replacement, the buffer works in a non-inclusive state. The corresponding three-level cache block in the L2 cache block whose lock signal is set to '1' has been replaced, and therefore cannot be maintained by replacing the BN2 address in the entry of the second track table 88 with the corresponding BN3 address. To control the integrity of the information flow, the secondary cache block can be replaced until the association degree of the secondary cache block is '0'.

If all high-level caches are assumed to have a lock signal of '1', that is, the high-level cache block can be replaced only when the degree of association is '0'; and in the entry of the active table corresponding to one cache block The valid bits of all high-level sub-cache blocks (such as 81 in the three-level active table 50) are all '1', and the three-level when the count value in the entry (such as 79 in 50) is '0' The cache block is set to be replaceable, and the buffer is an exclusive organization. It is also possible to set the buffer replacement method so that the cache blocks at all cache levels are replaced when the degree of association is '0'.

102 is the indirect branch target address generator in the embodiment of FIG. 9, which accepts the entry control on the bus 29 output from the track table 20, obtains the base address 94 from the processor core 23, and generates the indirect branch target address 46. The processor 54 sends 51 to perform virtual real address translation and address mapping, and outputs a BN1 branch target address 99 for use by the tracker 47. When the type of the entry on the bus 29 is an indirect branch instruction, the tracker 47 selects the address 99 output by 102; when the type of the entry on the bus 29 is another instruction, the tracker 47 selects the bus 29 output from the track table 20. The address on. It can be seen from the embodiment of Figure 11 that all instructions are pushed by the cache system to the processor core 23, which only provides the branch system 31 and the base address 94 of the indirect branch to the cache system. The indirect branch target address generator 102 can also be applied to the FIG. 4, FIG. 5, and FIG. 8 embodiments in which all instructions are pushed by the cache system to the processor.

The methods of the embodiments of Figures 4, 5, 8, and 11 can be further applied to control addressing memory. Turning to Figure 12, an embodiment of a processor/memory system of the present invention is shown. The embodiment of Figure 12 applies the method to a memory external to the processor based on the embodiment of Figure 11, and other embodiments may be deduced by analogy. Below the dashed line in Fig. 12 are the functional blocks and connections in the processor, which are identical to the embodiment of Fig. 11 except that there is no tertiary cache memory 52. The three-level active table 50, the three-level cached TLB and tag unit 51, the selector 54, the scanner 43, the secondary track table 88, the secondary active table 40, and the second-level cache memory 42, secondary correlation table 103, indirect branch target address generator 102, track table 20, level 1 cache related table 37, level 1 buffer memory 22, instruction read buffer 39, tracker 47, tracker 48, Processor core 23 has the same function as the module of the same number in the embodiment of FIG. The memory 111 and its address bus 113 are added above the dotted line in FIG. 12; the memory 112 and its address bus 114 are also added; the bus 115 sends the information block outputted by the memory 112 to the second level buffer memory 42 in the processor below the dotted line. The instructions in the information are also scanned by the scanner 43 and the branch instruction information is extracted as described in the previous embodiment. The memory 111 is organized by memory, and is addressed by a memory address 113 that is not obtained in the TAG of 51 (the virtual memory address generated by the source 102 or 43 is mapped to the physical address obtained by TLB in 51). The memory 112 is organized by buffer, which is generated by a match obtained in the TAG of 51, or is output by the secondary track table 88 via 89, which is addressed by the tertiary buffer address 114. The memory 112 outside the processor is actually used as a tertiary buffer memory instead of 52 in the embodiment of FIG. The memory 111 is a low level memory not shown but described in Figures 4, 5, 8, and 11. Thus, the embodiment of FIG. 12 is compared to the embodiment of FIG. 11 except that the last stage (level 3) cached memory (52 in FIG. 11) in the processor is moved outside the processor (112 in FIG. 12). In fact the two embodiments are logically equivalent. The organization of the buffer (including the memory 112 as a three-level buffer memory) in the embodiment of Fig. 12 is the same as that of the embodiment of Fig. 11.

The structure in the embodiment of Figure 12 can have several different applications. The first application form is that the memory 111 is a memory having a large capacity and a large access delay; and the memory 112 is a memory having a small capacity but a small access delay. That is, the memory 112 serves as a cache of the memory 111. The memory can be constructed of any suitable storage device, such as a register or register file (register) File), static memory (SRAM), dynamic memory (DRAM), flash memory (Flash Memory), hard disk (HD), solid state drive (SSD), and any suitable storage device or future new form of memory. The operation of this application is the same as the embodiment of Fig. 11. That is, the scanner 43 scans the instruction block sent from the memory 112 to the secondary buffer memory 42 via the bus 115, calculates the virtual branch target address of the direct branch instruction, and sends the virtual branch target address to the selector 54 (102 also generates an indirect branch. The virtual branch target address of the instruction is sent to 54 via bus 46. After 54 selection, the TLB is mapped to a physical address in 51, and the physical address matches the TAG in 51. If there is no match, the physical address is sent to the memory 111 via the address bus 113 to read the corresponding instruction block and stored in the memory 112 in the replaceable level three cache block indicated by the third level buffer replacement logic, and The three-level cache block number is merged with the lower address output by the selector 54 into a BN3 address and stored in the secondary track table 88. If there is a match, as described in the previous embodiment, the matching road number is matched, the index address output by the selector 54 is integrated into the BN3 address for addressing the three-level track table 50, and the BN2 address is stored in the secondary track table 88. If the entry in 50 is 'invalid', it will be directly stored in 88 with BN3. The rest of the operations are the same as those in the embodiment, and are not described herein again.

A specific embodiment of the first application may be a flash memory as the memory 111 and a DRAM. As the memory 112. Flash memory has a large capacity and low cost, but the access latency is large and the number of writes is limited. DRAM memory is small in size and costly, but the access latency is small and the number of writes is unlimited. Therefore, the structure in the embodiment of Fig. 12 takes advantage of the respective advantages of flash memory and DRAM to mask their respective disadvantages. In this first application, 111 and 112 are collectively used as the main memory of the computer system (main Mamory, memory) use. There are lower storage levels such as hard drives outside of 111. The first application is suitable for existing computer systems and can use existing operating systems. In the existing computer, the memory is managed by the storage manager in the operating system, that is, the memory is being used, and the memory is free; when the process needs it, it allocates memory and releases the memory after the process uses it. Because of the storage management by software, the execution efficiency is relatively low.

The second application of the embodiment of Fig. 12 uses a nonvolatile memory (such as a hard disk, a solid state hard disk, a flash memory, etc.) as the memory 111; and a volatile or nonvolatile memory as the memory 112. In the second application of the embodiment of Fig. 12, 111 is used as a hard disk in a computer; and 112 is used as a memory memory in a computer, but 112 is organized by a buffer, and thus may be a hardware pair 112 of the processor. Do storage management. In this system architecture, the storage manager in the operating system is used with little or no instruction. The instructions in the memory 111 are stored in the memory 112 in blocks as previously described. In a particular embodiment, the blocks of instructions may be virtual memories (virtual Memory) A page in which each tag of the tag unit TAG in 51 can represent a page.

The address in this embodiment is the format shown in FIG. 6. The memory 111 (hard disk) address 113 is divided into a label 61, an index 62, a secondary sub-address 63, a primary sub-address 64, and a primary block internal offset. Transfer (BNY) 13. The memory 111 (hard disk) address in this example may have a larger address space than the normal main memory address to address the entire hard disk, wherein 63, 64 and 13 are combined to be an offset address within a page; 61 and 62 are combined That is the page number. The address BN3 of the memory 112 (main memory, that is, the third-level buffer in the foregoing embodiment) is composed of a road number 65 and an index 62, a second-level sub-address 63, a first-level sub-address 64, and an intra-block offset (BNY) 13. Wherein the road number 65 is combined with the index 62, that is, the block address of the main memory 112, and one block is the above one page; 65, 62, 63 is flattened to address a second level instruction block in the main memory instruction block (page); Each of the intra-block offsets 13 is collectively referred to as BN3X, addressing one of the primary instruction blocks (pages). The address BN2 of the secondary buffer is composed of a secondary cache block number 67 and a primary subaddress 64, and an intra-block offset (BNY) 13; wherein the secondary cache block number 67 addresses a secondary cache block; Each of the internal offsets 13 is collectively referred to as BN2X, addressing a level one instruction block in the secondary cache block. The address BN1 of the primary buffer is composed of a primary cache block number 68 (BN1X) and an intra-block offset (BNY) 13. The intra-block offset (BNY) 13 in the above four address formats is the same, and the BNY portion does not change when the address conversion is performed. In the BN2 address format, the secondary block number 67 points to a secondary cache block, and the primary subaddress 64 points to one of the four primary instruction blocks in the secondary cache block. Similarly, the road number 65 and the index 62 in the BN3 address format point to a main memory instruction block, and the second level sub-address 63 points to one of several second-level instruction blocks in the main memory instruction block, and the first-level sub-address 64 points to the selected one. One of several first-level instruction blocks in a level two instruction block.

When the operating system controls the processor in Figure 12 to begin executing a new thread, the address of the starting point of the new thread (memory 111 address format) is passed through selector 54 (assuming that selector 54 has a third input in this embodiment) For the starting address to enter), sent to 51. The index 62 in the start address addresses the tag unit TAG in 51, and the contents of the tag in each path are read to match the tag 61 in the start address. If there is no match, 61 and 62 of the start address are read out by the bus 113 to store the corresponding page (instruction block) in the memory 111, and stored in the memory 112 in the set indicated by the index 62 in the start address by the main memory. (that is, the third-level buffer in the foregoing embodiment) replaces the logic in a way specified by the road number 65; at this time, the 61 and 62 fields in the starting address are also stored in the same group in the same label unit in 51. .

Thereafter, or when the 61 in the starting address matches the content of the tag in the tag unit, the system controller reads one from the memory 112 (main memory) with the path number 65, the starting address address index 62, and the second level sub address 63. The secondary instruction block is stored in the secondary buffer memory 42, and the secondary cache is replaced by a secondary cache block specified by the secondary block number 67; and the secondary block number 67 is stored in the tertiary active table 50. The entry 80 pointed to by 65, 62, and 63 above and the valid bit 81 in the entry is set to 'valid'. The scanner 43 scans the above-mentioned two-level instruction block, extracts the branch instruction information therein, and generates a track to be stored in the secondary track table 88. Thereafter, the system controller further blocks one of the first-level sub-addresses 64 of the first-level sub-address 64 in the above-mentioned two-level block number 67 to be stored in the first-level buffer memory 22 by the first-level cache replacement logic to the first-order block. A first-level cache block designated by the number 68; the corresponding track in the secondary track table 88 is also stored in the track table 20, and the address of the BN3 format on the track is replaced with BN2 as described above; the first block number 68 is also stored. The entry 76 in the secondary active table 40 pointed to by 67, 64 above is set and the valid bit 77 in the entry is set to 'valid'. Finally, the system controller combines the above-mentioned first-order block number 68 into the first-order block offset BNY in the starting address. The start instruction of the above-mentioned thread, which is placed in the tracker 47 as the BN1 address, causes the read pointer 28 to point to the above-mentioned thread in the level 1 buffer memory 22 to also point to the corresponding entry in the track table 20. The push operation to the processor core thereafter is similar to the previous embodiments. In summary, the new thread start address injected by the operating system, or the hard disk address generated by the scanner 43 or the indirect branch address generator 102 is selected by the selector 54 and sent to the tag unit in 51 for matching. When the match is successful, the resulting BN3 address is addressed to the three-level active list 50. If the entry of 50 output is 'valid', the secondary active table 40 is addressed by BN2 in the entry. If the entry of the 50 output is 'invalid', the secondary instruction block is output to the secondary buffer memory 42 by the above-mentioned BN3 address direct addressing memory 112 (main memory). When the hard disk address is unsuccessful in the tag unit in 51, the memory 111 (hard disk) is addressed via the bus 113, and the corresponding instruction block (page) is read into the memory 112 (main memory) and specified by the cache replacement logic. The main memory cache block overwrites the instruction block that was originally present in the cache block. This replacement process from hard disk to main memory is completely controlled by hardware, and basically no software operation is required. The permutation logic can use various algorithms such as LRU, NRU (not recently used), FIFO, clock, etc.

If the address space of the hard disk address is greater than or equal to the address space of the memory 111, the conversion detection buffer TLB is not required in the embodiment 51 of the embodiment of FIG. 12, and the hard disk address is a physical address. The starting address injected by the operating system is a physical address, whereby the resulting main memory address BN3 (for addressing memory 112) of the address mapping is a mapping of physical addresses. The remaining BN2 addresses, which are mappings of BN3 addresses, are also mappings of physical addresses. The memory 111 (hard disk) is the virtual memory of the memory 112 (main memory), and the memory 112 (main memory) is the buffer of the memory 111 (hard disk). Therefore, there is no case where the address space of the program is larger than the address space of the main memory. The same program executed at the same time has the same BN3 address, and the BN3 addresses of different programs executed at the same time must be different. Therefore, the same virtual address of different programs at the same time will be mapped to different BN addresses without confusion. The processor core in the push architecture does not generate an instruction address. Therefore, the physical hard disk address can be directly used as the address of the processor. It is not necessary to generate a virtual address by a processor core as in an existing processor system, and then map to a physical address to access the memory.

The memory 111 and the memory 112 in the embodiment of Fig. 12 can be packaged in a package as a memory. In addition to the existing memory address bus 113 and the instruction bus 115, the interface between the processor and the memory in the embodiment of FIG. 12 additionally adds a cache address BN3 bus 114. Although the boundary between the memory and the processor in the embodiment of Fig. 12 is shown as a broken line, it is also possible to move some of the functional blocks from one side of the boundary to the other. For example, the TLB and the tag unit TAG in the three-stage active list 50, 51 are placed on the memory side above the dotted line, which is also logically equivalent to the embodiment of FIG. 12 and the embodiment of FIG. In addition, the singular or plural non-volatile memory 111 chip and the single or multiple memory 112 chips and the memory chip below the dotted line in FIG. 12 (the external interface can be added) are connected to each other through the TSV via hole, and are packaged in a single package. A complete computer in the microphysical scale.

Please refer to Figure 13, which is another embodiment of the processor/memory system of the present invention. The embodiment of Figure 13 is a more general representation of the embodiment of Figures 8, 11, and 12. The memory 111, the third-level buffer memory 112, the three-level active table 50, the TLB and tag unit 51 of the third-level cache, the selector 54, the scanner 43, the secondary track table 88, the secondary active table 40, and the second-level cache Memory 42, a secondary correlation table 103, an indirect branch target address generator 102, a track table 20, a level correlation table 37, a level 1 buffer memory 22, an instruction read buffer 39, a tracker 47, a tracker 48, Processor core 23 has the same function as the module of the same number in the embodiment of Fig. 12. A four-level active table 120, a four-level correlation table 121 and a four-level buffer memory 122 are added, which are addressed by the BN4 bus 123 generated by 51. A three-level track table 118, a three-level correlation table 117, is also added, in which the count values extracted from the three-level active table 50 in the embodiment of FIG. 8, FIG. 11, and FIG. 12 are stored, so that the format of each level active table is consistent. . That is, there is no count value in 50 in the embodiment of Fig. 13, and the count value is stored in 117.

The lowest level 111 of the memory hierarchy in the embodiment of Figure 13 is a memory, addressed by memory address 113. The remaining different levels of memory with each memory level of 111 are addressed by the corresponding BN cache address. The lowest level cache, that is, the fourth level buffer 122 in the figure is an associated structure of the road group. The remaining higher memory buffer levels are all associative. The scanner 43 is located between the quaternary buffer memory 122 and the tier buffer memory 112. TLB/TAG 51 is in the level 4 cache. Each cache level higher than the level of the scanner 43 has a track table such as 118, 88, 20. Each cache level except the highest cache level has active tables such as 120, 50, and 40. Each cache level has a related table such as 121, 117, 103, 37. The format of each storage table is shown in Figure 14.

Figure 14 is a diagram showing the format of each storage table in the embodiment of Figure 13. The format of the tag unit in the embodiment 51 of Fig. 13 is the physical tag 86. The CAM format of the TLB in 51 is the thread number 83 and the virtual tag 84, and the RAM format is the physical tag 85. The thread number 83 and the virtual tag 84 selected by the selector 54 in FIG. 13 are mapped to the physical tag 85 in the TLB; the index address 62 in the virtual address is read out to match the physical tags 86 and 85 in the tag unit to obtain the road number 65. . The road number 65 and the index address 62 in the virtual address are joined together to form a four-level cache block address 123. Alternatively, the TLB is not set as in the foregoing 51, and the physical address selected by the selector 54 is directly matched with the physical tag 86 in the TAG. In Figure 14, each entry in the track table contains type 11, cache block address BNX 12 and BNY13, may also contain SBNY 15 to determine the branch execution time point. Cache block address 12 in each level of the track table It may be a BN format of this level or a lower level, for example, 12 of the three-level track table 118 may be a BN3X or BN4X format. The active table entry has a buffer block number 76 of the corresponding sub-block, and the format is a cache block number higher than the current level, for example, the BN2X is stored in the third-level active table 50; and the corresponding valid bit 77 . The function of the active table is to map the cache address of this level to a higher level cache address. The correlation table has a count value of 70, the meaning of which is the number of entries in the storage hierarchy or the high-level storage hierarchy track table with the cache block as a branch target; and a lower-level cache block number 71 corresponding to the cache block; And the track table entry address 72 and its corresponding valid bit 73 in the storage hierarchy with the cache block as a branch target. The pointer 74 shared by each channel points to the cache block that has not been replaced for the longest time as described above; if the count value 70 corresponding to the cache block is smaller than the preset replacement threshold, the cache block can be replaced. At the time of replacement, the table entry in the track table is addressed with 72 addresses of 73 'valid', and the level cache block number in the track table entry is replaced with the lower layer cache block number 71. The exception is that the four-level correlation table 121 has only the count value of 70, and there is no 71, 72, 73. Since there is no track table in the hierarchy, there is no need to perform address replacement in the above-mentioned track table entry.

When an instruction block is transferred from the memory 122 (four-level buffer memory) to the third-level cache memory 112 via the bus, the scanner 43 extracts the information of the branch address in the instruction block, generates a track entry type, and also calculates a branch target address. The branch target address is selected by the selector 54 to be sent to 51 to match the tag unit. If there is no match, the branch target address is addressed to the memory 111 via the bus 113, and the corresponding instruction block is read into the memory 122 and selected by the four-level cache replacement logic (four-level active table 120 and four-level correlation table 121, etc.). Level 4 cache block. If matched, the matched BN4X address 123 addresses the four-level active table 120. If the 120 entry is valid, the BN3X address in the entry is combined with the BNY of the branch target address into a BN3 address and stored in the third-level track via the bus 125. The entry corresponding to the branch instruction in the table 118; if the 120 entry is invalid, the BN4X address is directly combined with the BNY address to be added to the entry in the BN4 address.

Please refer to FIG. 15, which is an address format of the processor system in the embodiment of FIG. The memory address is divided into a tag 61, an index 62, a tertiary subaddress 126, a secondary subaddress 63, and a primary subaddress. 64, with an intra-block offset (BNY) of 13. The address BN4 of the quaternary buffer is composed of a road number 65 and an index 62, a three-level sub-address 126, a second-level sub-address 63, a first-level sub-address 64, and an intra-block offset (BNY) 13; The portion of 13 is collectively referred to as BN4X. The address BN3 of the third-level buffer is composed of a third-level cache block number 128, a second-level sub-address 63, a first-level sub-address 64, and an intra-block offset (BNY) 13; Known as BN3X. The address BN2 of the secondary buffer is composed of a secondary cache block number 67 and a primary subaddress 64, and an intra-block offset (BNY) 13; each of the offsets 13 in the block is collectively referred to as BN2X, addressing. A level one instruction block in the level 2 cache block. The address BN1 of the primary buffer is composed of a primary cache block number 68 (BN1X) and an intra-block offset (BNY) 13. The intra-block offset (BNY) 13 in the above four address formats is the same, and the BNY portion does not change when the address conversion is performed.

When the secondary instruction block is filled from the tertiary cache memory 112 to the secondary cache memory 42, the corresponding track is read by the tertiary track table 118 via the bus 119, and the BN4 format address in the track entry is addressed to the four-level active list. 120; if the 120 entry is valid, the track entry in the 118 is filled in with the BN3X address and bypassed to the bus 119 and also stored in the corresponding entry in the secondary track table 88; if the 120 entry is invalid, then The above BN4 address addressing memory 122 on the 119 bus reads the corresponding instruction block and fills the memory 112 with the BN3X address pointed by the third level cache replacement logic (the third active table 50 and the third level correlation table 117, etc.). A three-level cache block. The BN3X address is stored in the entry of the four-level active table 120 pointed to by the BN4 address, and is stored in the corresponding entry in the three-level track table 118. The BN3X address is bypassed to the bus 119 and also stored in the secondary track. Corresponding entries in Table 88. If the output on the bus 119 is already a BN3X address, then the BN3X is used. The address is addressed to the three-level active list 50. If the 50 entry is valid, the BN2X address is stored in the corresponding entry in the secondary track table 88; if the 50 entry is invalid, the memory 112 is addressed by the BN3X address on 119. Reading the corresponding L2 cache block into the L2 cache memory 42 and the L2 cache address pointed by the BN2X address given by the L2 cache replacement logic (the secondary active table 40 and the secondary correlation table 103, etc.); The BN2X is also stored in the entry of the three-level active list 50 addressed by the above BN3X; the BN2X is also stored in the secondary track table 88.

When the primary instruction block is filled from the secondary cache memory 42 to the primary cache memory 22, the corresponding track is read by the secondary track table 88 via the bus 89, and the BN3 format address in the track entry is addressed to the tertiary active table. 50; if the 50 entry is valid, the BN2X address is filled in the track entry in 88 and bypassed to the bus 89 and also stored in the corresponding entry in the primary track table 20; if the 50 entry is invalid, then 89 The above BN3 address addressing memory 112 on the bus reads the corresponding instruction block and fills in the memory 42 with the BN2X address pointed by the L2 cache replacement logic (the secondary active table 40 and the secondary correlation table 103, etc.). Level cache block. The BN2X address is stored in the entry of the three-level active list 50 pointed to by the BN3 address, and is stored in the corresponding entry in the secondary track table 88. The BN2X address is bypassed to the bus 89 and also stored in the first track. Corresponding entries in Table 20. If the output on the bus 89 is already a BN2X address, then the BN2X is used. The address is addressed to the secondary active table 40. If the 40 entry is valid, the BN1X address is stored in the corresponding entry in the primary track table 20. If the 40 entry is invalid, the memory 42 is addressed by the BN2X address on the 89. Reading the corresponding level 1 cache block into the level 1 cache block pointed to by the BN1X address given by the level 1 cache replacement logic (primary correlation table 37, etc.) in the first level buffer memory 22; the BN1X is also stored in the second level. The entry in the active list 40 addressed by the above BN2X; the BN1X is also stored in the primary track table 20.

When the instruction block is from the level 1 cache memory 22 to the processor core 23 or IRB When the push is 39, the corresponding track is read by the bus 29 in the first track table 20, and the BN2 format address in the track entry is addressed to the secondary active table 40; if the 40 entry is valid, the BN1X address is filled in. Enter the track entry in 20 and bypass to the bus 29; if the 40 entry is invalid, the BN2 address on the 29 bus addresses the memory 42, reads the corresponding instruction block and fills the memory 22 by the first level cache replacement logic (Level 1 correlation table 37, etc.) The first level cache block pointed to by the BN1X address given. The BN1X address is stored in the entry in the secondary active list 40 pointed to by the BN2 address, and is stored in the corresponding entry in the primary track table 20. If the output on the bus 89 is already a BN1 address, then the BN1 The address is stored in the register in the tracker 47, becomes the read pointer 28, addresses the track table 20 and the level 1 cache memory 22, to the processor core 23 or IRB. 39 push instructions. This ensures that the instructions in the level one cache memory 22, their branch destinations and the sequence of the next level one cache block are at least already in the level two cache memory 42 or are being stored into the process 42. The rest of the operations are as described in the previous embodiments and will not be described again.

Although the embodiment of FIG. 13 is shown in an instruction push memory/processor system that executes two branches simultaneously, its memory hierarchy can also be applied to processor cores of other architectures, such as address addressing level 1 cache generated by a processor core or Instructional read buffered out-of-order multi-transmission processor system. The method and system of the embodiment of Figure 13 can be applied to data memory hierarchies and data pushes such that the memory hierarchy also pushes data to the processor core. For convenience of explanation, the following embodiment assumes that the data memory has the same storage hierarchy as the instruction memory, that is, there are memory, four-level cache, three-level cache, two-level cache, one-level cache and data read buffer, and the instruction memory levels. correspond. So the address format of the data memory hierarchy is like Figure 15 In the same embodiment, only the memory address is a data address instead of an instruction address, and each BN address can be a DBN (Data Block). The Number) address is distinguished from the BN address to accommodate separate instruction caches and data caches. Such as a single storage as a unified cache at a storage level (Unified The Cache stores instructions and data. The hierarchical address is still in the BN name.

Each storage hierarchy also requires data track table DTT, data active table DAL, data related table DCT and pointers to support the operation of data memory storage. Please refer to FIG. 16, which is a format of the data track table, the data active table, and the data related table. The branch target address does not need to be stored in the data track table DTT, so only the block address DBNX of the next data block in the order is stored. 132 and its valid bit 133. Optionally, the block address 130 of one of the data blocks in the storage order and its valid bit 131 can be added for use in reverse order access to the data. In addition, the data track table can be completely eliminated. Data active table DAL The format is the same as the active list AL format 76, 77 shown in Figure 14, where the 134 field storage block address DBNX, 135 field stores the corresponding valid bit. A row of DALs of this hierarchy is addressed by a data block address (e.g., block 2 address 67 in FIG. 15), and a set of 134, 135 in the row is addressed by a subaddress (such as subaddress 2 in FIG. 15). If the valid bit 135 is 'active', the higher level block address in 134 is read from the DAL to access the higher level data store. That is, the data active table DAL maps the storage hierarchy address to the address of the upper storage level. Only the corresponding lower one storage hierarchy address 136 is stored in the data correlation table DCT. That is, the data active table DAL can map the storage hierarchical address to the corresponding high-level storage hierarchical address, and the data correlation table DCT can map the storage hierarchical address to the lower one storage hierarchical address (the DBLNX represents a low-level address in FIG. 16). ). The pointer 137 is used for buffer replacement. The data cache can be replaced by the instruction cache disclosed in the present invention, but there is no count value in the data cache related table, because no branch instruction jumps into the data cache. The target, therefore, does not need to consider the address in the replacement track table that targets the data cache block, and does not need to record the branch source address. The level 1 cache only needs to record the last replaced cache block with the pointer 137, and the pointer 137 is unidirectionally traversed or replaced by LRU, LFU, and the like. The second, third, and fourth level caches are replaced by the instruction cache, as long as the cache block does not have a high level of corresponding cache blocks. Each of the entries in the active table can be read by one-way traversal of the pointer 137 of the present level. If all the address fields in an entry are "invalid", the corresponding cache block can be replaced. The L1 cache replacement method of the instruction cache disclosed by the present invention may also be implemented by LRU, LFU or the like.

The data push memory hierarchy also uses the step size table 150 to record the difference between two adjacent data access addresses of the same data access instruction. Please refer to FIG. 17, which is a step size table format and working principle. 150 is a memory, where each row corresponds to a data access instruction (such as LD Or ST), addressed by the instruction address of the data access instruction. There is a data address 138 in each row. In the following embodiment, the format of 138 is DBN1, which is a primary data cache address, and the format is DBN1X and DBNY. Similar to 68 and 13 in Figure 15, The 139 field is the status bit of 138. In addition, there are a plurality of sets of steps, one of which is 140 and the corresponding valid bits 141; 142 and 143 are the step sizes of the other groups. Each set of step sizes, such as 140 and its corresponding valid bit 141, is selected by the data access instruction at the branch cycle level of the instruction segment. Please refer to the lower part of Figure 17, The straight line represents sequential instructions that are executed sequentially in the direction of the arrow, the arc represents the reverse branch, the intersection represents the branch instruction, and the triangle represents the data access instruction. Wherein 146 is a data access instruction, and the 150 step row of the upper step of FIG. 17 corresponds to 146. When the branch of the branch instruction 140 is determined to be an 'execution branch', the inner loop step of the data access instruction 146 is stored in the 146 corresponding 150. Step field 140 of the row; when the branch of the branch instruction 140 is judged as 'no branch', and the branch of the branch instruction 142 is judged as 'execution branch', the middle loop step of the data access instruction 146 is stored in 146 Step field 142 of 150 lines; when the branch of the branch instruction 140 is judged as 'no branch', and the branch of the branch instruction 140 is judged as 'no branch', and the branch of the branch instruction 143 is judged as 'execution branch' The outer loop step size of the data access instruction 146 is stored 146 in a step field 143 corresponding to 150 rows. That is, the branch judgment has priority, so that the reverse branch instruction immediately after the data access instruction has the highest priority, the priorities of the other reverse branch instructions are decremented in order, and the branch judges the high priority branch instruction of the 'execution branch' The low priority branch instruction is masked so as not to affect the readout of the step table 150. Forward branch instructions are not recorded in the step table. Can be added by the adder to the line of 150 The data address DBN1 is added to the branch judgment selection step size, such as 140, to obtain the next data address to access the data storage hierarchy system, and to acquire data in advance to push the processor core.

Please refer to FIG. 18, which is another embodiment of the processor/memory system of the present invention. The left half of Fig. 18 is an instruction push processor system similar to the embodiment of Fig. 13, and the right half is a data push memory hierarchy. The three-level track table 118, the three-level correlation table 117, the third-level buffer memory 112, the three-level active table 50, the three-level buffer TLB and the tag unit 51, the scanner 43, the second-level track table 88, and the second-level active table 40, L2 cache memory 42, a secondary correlation table 103, an indirect branch target address generator 102, a track table 20, a level correlation table 37, a level 1 buffer memory 22, an instruction read buffer 39, a tracker 47, a tracker 48, The processor core 23 has the same function as the module of the same number in the embodiment of Fig. 13. Memory 111, The functions of the four-stage active table 120, the four-level correlation table 121 and the four-level buffer memory 122 are similar to those in the embodiment of FIG. 13, except that not only the storage instructions but also the data and data-related auxiliary information such as the data cache block number are stored. The entry of the four-level active table 120 may store a three-level instruction cache address BN3 or a three-level data cache address DBN3. The selector 54 is now a three-input selector. Scanner 43 except execution In addition to the scan function of the instructions in the embodiment of Fig. 13, the order of the next data block (or the reverse block data block address) of the data block passing through the bus 115 is also calculated. The right half has a three-level data buffer memory 160, a secondary data buffer memory 161, a primary data buffer memory 162, a data read buffer 163, a step size table 150, a three-level data track table 164, and a secondary data track table. 165, a primary data track table 166; Three-level data active table 167, secondary data active table 168; adders 169, 170, 171, 172, 173; three-level data related table 174, secondary data related table 175, primary data related table 176; 192.

The memory 111 in FIG. 18 is addressed by a memory address, the memory 122 is a group associative cache organization structure, and the other levels of buffers are all associative cache organization structures. As with the embodiment of Fig. 13, the memory 111 of the embodiment of Fig. 18 can be used as the main memory of the processor/memory system, and at this time 122 is the last level cache of the processor (Last Level Cache) is a unified cache; Or another system organization mode uses 111 as the hard disk of the system. At this time, 122 is the main memory organized by the cache, and 112 is the last instruction cache of the processor, and 160 is the last data cache of the processor. The instruction push of the left half of the embodiment of FIG. 18 is completely the same as that of the embodiment of FIG. 13, and details are not described herein again. The data push process of the right half is described below. Data read buffer (Data Read Buffer, DRB) The entry of 163 corresponds one-to-one with the entry of the IRB instruction read buffer 39. When a data load instruction in the IRB is pushed by the IPT pointer 38 to the processor core 23, the data in its corresponding DRB entry is also 38 read out via the bus 196 to the processor core 23 for processing. Therefore, the task of the data storage hierarchy is to pre-fill the data to be used by the processor core into the entries in the DRB corresponding to the data access instructions in the IRB. The data is pushed to the processor core 23 with the instructions (data and instructions are not necessarily pushed at the same time, because the data load instructions executed by the processor core and their corresponding data are not in the same pipeline segment of the processor core).

When a level one instruction block is stored in IRB 39, its corresponding DRB 163 was emptied. When the decoder (the instruction decoder in the processor core 23 is attached to the IRB at this time) The dedicated instruction decoder of 39) when an instruction sent to the processor core 23 is a data load instruction, the system allocates a row for its exclusive use in the step size table 150. The status bit 139 of the line is set to '0'. Based on the status bit of '0', the system causes the processor core 23 to execute the data address generated by the data load command via the bus 94, bypassing 102, and then passing through the bus 46, the selector 54 sends a match to 51. If there is no match, the data address is read into the memory 111 via the bus 113 as described in the foregoing embodiment 13 to read a four-level data block, and is stored in the memory 122. The road number (65 in FIG. 15) given by the four-level cache replacement logic is flattened. A level four cache block pointed to by index 62 in the data address. The data address is stored in the entry in 51 of the tag unit that is also pointed to by 65 and 62.

The system further reads the three-level data block from the memory 122 in the above-described 65, 62 together with the three-level sub-address 126 in the data address, and stores it in the three-level data buffer memory 160 via the bus 115, which is given by the three-level data buffer replacement logic. The three-level cache block specified by the third-level data block number 128 stores the three-level block number 128 into the entry field pointed to by 65, 62, and 126 in the four-level active table 120 and sets the field to 'valid'. At the same time, the 65 and 62 (four-level block numbers) are stored in the entry in the three-level correlation table 174 pointed to by the above 128. In addition, the scanner 43 calculates the address of the next three-level data block in the order of the above-mentioned three-level data block (ie, the data address plus the size of a three-level data block), and sends the label unit 51 to the BN4 address to obtain the BN4 address. Accessing the four-level active table 120 maps to the DBN3X address, and DBNY in the data address 13 flattened to get the DBN3 address. The resulting DBN3 or BN4 address is stored in the triple-track table 164 in the 132 field of the entry pointed to by 128 above. If the next three-level data block is still in the same cache block, add '1' to the above 126, and the original 65, 62 is combined to obtain the DBN3 address of the next third-level data block, without going through 51. Label unit mapping. Optionally, the next three-level data block in the sequence may also be filled into the third-level buffer memory 160, and the corresponding entries in the 120 and 174 entries are filled as described above; generally, the order of the next three-level data block in the order is not required. The next three level data block is also filled in 160.

The system further reads the secondary data block from the tertiary data buffer memory 160 in the above-described 128 along with the secondary subaddress 63 in the data address, and stores it in the secondary data buffer memory 161 which is given by the secondary data cache replacement logic. The secondary cache block specified by the secondary data block number 67 stores the secondary block number 67 into the entry field pointed to by 128, 63 in the tertiary data active table 167 and sets the field to 'valid'. At the same time, the 128 (three-level block number) is stored in the entry in the secondary correlation table 175 pointed to by the above 67. Optionally, at this time, a '1' is added to the above 63, and the third-level active table 167 is addressed by an address of 128 and the 63 after the addition of the '1'. If the entry is 'valid', the next level is determined. The cache block is already in the secondary cache; if the entry is 'invalid', the secondary data block is read from the tertiary data buffer memory 160 by the address of the 128 and the added 63 after the '1', and is stored in the second data block. The secondary data buffer block pointed to by another secondary block number 67 given by the secondary cache replacement logic in the level data buffer memory 161, and the other 67 is stored in 167 with 128 and after '1' 63 flattened address-addressed entry and set the entry to 'valid'.

If the address of the next secondary data block in the sequence crosses the boundary of the third-level cache block, the entry pointed to by the above 128 in the third-level track table 164 is read out via the bus 190. If the content of the entry is in the BN4 format, The BN4 address accesses the four-level active list 120 via bus 197. If the 120 entry is 'valid', the original BN4 is replaced with the entry pointed to by the DBN3 address stored in the entry 164 in the entry; if the 120 entry is 'invalid', the access to the memory 122 is read by the BN4 address on the bus 197. The next three levels of data blocks are stored in the memory 160, and the corresponding entries in 164, 167, 174 and 120 are filled in the manner described above. This ensures that when the contents of a three-level data block are stored in the secondary data buffer, the next three-level data block is stored in the three-level data buffer. Optionally, when the entry pointed to by the above 128 in the three-level track table 164 is in the DBN3 format, the DBN3 is addressed to the three-level active table 167 via the bus 190 as described above, so that the secondary buffer memory 161 is being filled. The next secondary data block in the order of the secondary data blocks is also filled in 161. Of course, the data block in the reverse order can also be stored in the data cache as needed, and the 130 field in the track table is used. It is also possible to completely eliminate the data track tables 164, 165, 166. At this time, the system does not automatically fill in the order of the three-level data cache block boundary or the reverse order of the secondary data block. Pre-filling of the other data storage levels is done in the same way.

The system further reads out the primary data block from the secondary data buffer memory 161 by the above-mentioned 67 and the first-level sub-address 64 of the data address, and stores it in the primary data buffer memory 162 by the primary data cache replacement logic. The first level cache block specified by the primary data block number 68; and the first level block number 68 is stored in the entry field of the secondary data active table 168 pointed to by 67, 64 and the field is set to 'valid' . At the same time, the 67 (secondary block number) is stored in the entry of the primary correlation table 176 pointed to by the above 68. Optionally, at this time, the entry pointed to by the above 67 in the secondary track table 165 is read. If the content of the entry is in the BN3X format, the three-level active table 167 is addressed by the BN3 address via the bus 185, such as the 167 table. The entry 'valid', that is, the BN2X address in the 167 entry is written back to 165 via the bus 189 instead of the BN3X address. If the entry 167 is 'invalid', the address is read by the address-receiving three-level data buffer memory 160 at 185, and the second-level data block is stored in the secondary data buffer memory 161 in the secondary cache block address given by the cache replacement logic. Another 67-pointed L2 cache block. The other 67 is also stored in the entry of the 185 address in the three-level data active table 167, and is also stored in the secondary data track table 165 instead of the BN3X address. The corresponding entry is also established for the second level cache block in the secondary data active table 168 and the secondary data related table 175 by using the 67 address, wherein the 175 entry stores the BN3X address. This ensures that when the contents of a secondary data block are stored in the primary data buffer, the next secondary data block is stored in the secondary data cache.

The system further takes the above 68 with the DBNY in the data address 13 is stored as the primary data cache address DBN1 via the bus 193 in the row 138 field corresponding to the data load command in the step size table 150, and the row 139 status field is set to '1'. According to the state of "1", the system accesses the primary data buffer memory 162 by the above DBN1, and the read data is stored in the DRB. In the entry corresponding to the above data load instruction in 163, the data can be pushed to the processor core 23 for processing with the instruction. When the data is pushed to the processor core 23, the system begins prefetching the next data to the DRB for pushing the next time the same data load instruction is executed. Therefore, the state field 139 is '1', and the process of prefetching data for pushing is exactly the same as above, except that when the new 68 and 13 (DBN1) are generated, the DBN1 and the original 138 field in the original step size table 150 are first generated. The previous DBN1 is subtracted, and the difference is stored as a step into the branch to determine the selected entry, such as 140. The new DBN1 is then written to the 138 field to replace the old address, and the status field 139 is set to '2'.

After the second data is pushed to the processor core 23, when a branch instruction following the data load instruction determines that its branch is an 'execution branch', the system starts prefetching the next data into the DRB for next execution. Push when the same data load instruction. Thus, the status field 139 is '2' and the system no longer waits for the processor core 23 to calculate the data address. Instead, the step size table 150 directly outputs the DBN1 address in the 138 field in the corresponding row of the data load instruction, and the branch determines the selected step size (such as 140), and adds it in the adder 173. The system makes a boundary determination for the output 181 of 173. If 181 does not exceed the boundary of the primary data cache block, selector 192 selects 181 to access primary data buffer memory 162, and the read data is stored in the corresponding entry in DRB for push. The address on 181 is stored as DBN1 in the corresponding field in the 138 field in the step table. If 181 is beyond the boundary of the primary data cache block, but does not exceed the adjacent primary cache block boundary, the primary data track table 166 is addressed by 181, and the DBN1X address 132 of the next primary data block is read out. (or DBN1X address 130 of a data block in reverse order) is output via bus 191, selected by selector 192, and accessed in conjunction with DBNY address 13 on 181 to access memory 162, and the read data is stored in the corresponding entry in DRB for push. The above-mentioned flattened address DBN1 is stored in the corresponding field 138 field in the step table 150. In both cases, the state field 139 in 150 remains unchanged at '2'. If the address 132 outputted by 166 is in BN2X format, the system addresses the secondary data active table 168 by 191, such as the 168 entry 'valid', that is, the BN1X address in the 168 entry is written back to the bus 184 to replace the BN2X. address. If the 168 entry is 'invalid', that is, the address of the secondary data buffer 161 is addressed by the address 191, and the primary data block is read into the primary cache block address of the primary data buffer memory 162 and given by the cache replacement logic. The level 1 cache block pointed to by 68. The 68 is also stored in the entry addressed by the 191 in the secondary data active table 168, and is also stored in the primary data track table 166 in place of the BN2X address.

If 181 exceeds the above boundary but does not exceed the level of the secondary cache block, the system addresses the primary correlation table 176 with the DBN1 address 138 and the DBN1 address for the DBN2 address for output via the bus 182. Adder 172 adds the DBN2 addresses on steps 140 and 182, and outputs 183 to the secondary data active table 168, such as its entry 'valid', with the DBN1X address in the entry and DBNY on 183. 13 splicing, accessing the primary data buffer memory 162 via the bus 184, reading the data into the DRB entry to be pushed; and storing the DBN1 address on the 184 into the corresponding row 138 field in the step table 150, maintaining 139 The domain is '2' unchanged. If the entry in the secondary data active table 168 is 'invalid', the secondary data buffer memory 161 is addressed by 183, and the read primary data block is stored in the primary data buffer memory 162 by the primary data cache replacement logic. The first level cache block specified by the primary data block number 68 is given. The system is combined with the DBNY on the 68 and 183 as the DBN1 address access 162, and the read data is stored in the DRB entry to be pushed; and the DBN1 address is stored in the corresponding row in the step table 138 field, and the 139 field is maintained. Not changed to '2'.

If 181 exceeds the level 2 cache block boundary but does not exceed the level 3 cache block boundary, the system addresses the secondary correlation table 175 with the DBN2 address on the bus 182 described above, and maps the DBN2 address to the DBN3 address for output via the bus 186. The adder 171 adds the DBN3 address on the step sizes 140 and 186, and the output 188 addresses the three-level data active table 167. If the entry in the 167 is 'valid', the DBN2X address and the 188 are in the entry. DBNY 13 flattened, the secondary data active table 168 is addressed via the bus 189. If the entry in the 168 is 'valid', the DBNY on the bus 188 is directly combined with the DBN1X address in the entry. 13 as the DBN1 address accesses the primary data buffer memory 162 via the bus 184, the read data is stored in the DRB entry to be pushed; and the DBN1 address is stored in the corresponding row 138 field in the step table, and the 139 field is maintained. '2' does not change. If the entry in '168' is 'invalid', the secondary data buffer memory 161 is addressed by the DBN2 address on the bus 189, and the read primary data block is stored in the primary data buffer memory 166 by the primary data cache replacement logic. The primary data cache block number 68 pointed to by the primary cache block; the 68 is also stored in the entry in 168 addressed by the bus 189, the entry being set to 'valid'. The system is combined with the DBNY on the 68 and 189 as the DBN1 address access 162, and the read data is stored in the DRB entry to be pushed; and the DBN1 address is stored in the corresponding row in the step table 138 field, and the 139 field is maintained. Not changed to '2'.

If 181 exceeds the level 3 cache block boundary, but does not exceed the level 4 cache block boundary, the system addresses the level 3 correlation table 174 with the DBN3 address on bus 186 above, and maps the DBN3 address to the BN4 address for output via bus 196. The adder 170 adds the DBN4 addresses on the step sizes 140 and 196, and addresses the four-level active table 120 with its output 197. If the entry in the entry is 'valid' in 120, the DBN3X address in the entry is 197. DBNY 13 is split, and the three-level data active table 167 is addressed via the bus 125. If the entry in the 167 is 'valid', the DBNY on the bus 125 is directly combined with the DBN2X address in the entry. The secondary data active table 168 is accessed via the bus 189 as a DBN2 address. If the entry in '167 is 'invalid', the secondary data buffer memory 161 is addressed by the DBN2 address on the bus 189, and the read primary data block is stored in the primary data buffer memory 162 by the primary data cache replacement logic. The primary data cache block number 68 pointed to by the primary cache block; the 68 is also stored in the entry in 168 addressed by the bus 189, the entry being set to 'valid'. Accessing the secondary data active table 168 with the DBN2 address on bus 189 and subsequent operations are the same as described in the previous paragraph. Finally, the system accesses 162 with the DBN1 address, and the read data is stored in the DRB163 entry to be pushed; and the DBN1 address is stored in the corresponding field 138 field in the step table, and the 139 field is kept unchanged by ‘2’.

If 181 exceeds the level 4 cache block boundary, the system reads the corresponding label 61 by the label unit in the BN4 address addressing 51 on the bus 196, and sends it to the adder 169 via the bus 113. 169 associates the label 61 with the step size 140. Plus, the sum 198 is selected by the selector 54 and sent to the tag unit in 51 for matching. If the match results in a new BN4 address, the four-level active list 120 is addressed via bus 123 at the new BN4 address. If the entry in the table is 'valid', the DBN3X address in the entry is addressed to the tertiary active table 167 via the bus 125. Subsequent operations are the same as those performed by bus 125 addressing 167 in the previous segment. If the entry in 120 is 'invalid', the new BN4 address addressing memory on bus 123 reads the tertiary data block and fills the tertiary data buffer memory 160, as described above. If there is no match in the tag unit, the bus 111 is placed on the bus 198 to address the memory 111 to read the four-level data block and store it in the quaternary buffer memory 122. The process has been described in the foregoing embodiment and will not be described again. The final system accesses 162 by the DBN1 address obtained by the active table mapping of each level, and the read data is stored in the DRB entry to be pushed; and the DBN1 address is stored in the corresponding field of the 138 field in the step table, and the 139 field is maintained. '2' does not change. If the corresponding data block in a certain storage hierarchy does not exist in the process, the system will automatically read the data block from the lower one storage level and store it in the cache block specified by the cache replacement logic in the hierarchy, and the cache block address is also The low-level active table is stored, and the lower-level cache block number is stored in the related table of the level to establish a two-way mapping relationship.

The above describes the push process of data loading. Data storage can be done in a similar way, or it can be stored in a write buffer (write) Buffer), when the data cache is idle, write the data in the write buffer back to the data cache. When the load data is guessed by the step size in the step size table 150 (i.e., when the 139 field in 150 is '2'), the processor core is required to send the correct data address through the bus 49 to compare with the guessed DBN1 address. If it is different, it is necessary to discard the data of the guess load and its subsequent execution result, load the data with the correct data address on the bus 49, and set the corresponding 139 field to '0', and recalculate the step size to 150. If there is a write buffer, the guessed load address is also compared to the address in the write buffer to determine that the loaded data is updated. The DBN address can be mapped to a data address to compare with the data address on 49. It is also possible to map the upper address of 49 to the DBN address and compare it with the DBN address generated by the system guess. Further, if the valid bit of the step size read by the branch judgment in the step size table 150 is "invalid", the step size is also stored in the corresponding step size field as described above.

The data memory hierarchy in the embodiment of FIG. 18 has the lowest level cache associated with the path group, the level has a label unit, and may also have a TLB of the virtual real address translation; the level may be addressed by the memory address matching the label unit in 51. Or directly addressed by the buffer address BN4. The rest of the data cache is fully associative and is addressed by the buffer address DBN. The mapping between the DBN and the BN4 is performed by the active table and the related table. The role of the active table is to map the low-level buffer address to the high-level buffer address; the role of the related table is to map the high-level buffer address to the low-level buffer address. Please refer to Figure 19 for its mechanism of action.

FIG. 19 is a schematic diagram of the action mechanism of the data cache hierarchy in the embodiment of FIG. 18. In Figure 19, 200 is a four level cache block containing two level three cache blocks 201, 202. Each L3 cache block further contains two L2 cache blocks, such as 201 containing L2 cache blocks 203 and 204. Each L2 cache block further contains two L1 cache blocks, such as 203 containing L1 cache blocks 205 and 206. Assuming that the DBN1 address in the 138 domain of the current step size table 150 points to the L1 cache block 205, the system obtains the next level of the same data load instruction by the minimum mapping step and the least delay according to the length of the step 140. The data cache address is read in advance by the primary data cache memory 162 to store the data in the corresponding entry in the DRB.

Hereinafter, an explanation will be given with reference to FIGS. 18 and 19. Assuming that the 138 address pointing to 205 is added to 140, and the sum does not exceed the boundary of 205, the sum 181 is used as the new primary data cache address to address the primary data cache 162. The read data is stored in the DRB. The corresponding entry in 163. If the 138 address is added to 140, and the sum 181 exceeds the boundary of 205, but does not exceed the boundary of the L2 cache block 203, the 138 address needs to be mapped from the BN1 format (the implementation of the primary correlation table 176 in FIG. 18) to BN2. Format 182. The adder 172 adds the 182 address to the step size 140, and the sum 183 addresses the corresponding entry of the L2 cache block 203 in the L2 active table 168, from which the DBN1X address of the L1 cache block 206 is read, and 183 The DBNY 13 cooperates to address the primary data buffer memory 162 for DBN1 and also stores 138 fields in 150. If the address of the sequential next cache block 206 is stored in the corresponding entry of the 205 cache block in the primary data track table 166, the address of 206 can also be obtained by directly addressing 166 with 181 (ignoring the overflow of the bit in 181).

If the sum 181 is beyond the boundary of the L2 cache block 203, the DBN1 format of the 138 address needs to be mapped to the DBN2 format 182 via the primary correlation table 176, and the DBN2 format is mapped to the DBN3 format 186 and the step size via the secondary correlation table 175. 140 is added, and 174 addresses the corresponding entry of the third-level cache block 201 in the three-level active table 167, from which the DBN2 address 189 of the second-level cache block 204 is read, and then the secondary active table 168 is addressed by 189. The address DBN1 of the level one cache block 207. That is, the address can be addressed to the first stage buffer memory 162 via the bus 184 to read the data and store it in the DRB 163, and store the address in the 138 field of 150. If the sum of 181 exceeds the boundary of the level three cache block 201, the DBN1 address in 138 is mapped to a DBN2 format address via 176, and then mapped to a DBN3 format address via 175, and then mapped to a BN4 format address via 174; The address four-level active table 120 obtains the DBN3 format address 125; the three-level active table 167 is addressed by the DBN3 address to obtain the DBN2 address 189; the secondary active table 168 is addressed by the DBN2 address, and the address of the first-level cache block 207 is obtained. DBN1. That is, the address can be addressed to the first stage buffer memory 162 via the bus 184 to read the data and store it in the DRB 163, and store the address in the 138 field of 150.

In the 18th and 19th embodiments, the cache blocks of each level in the data cache hierarchy form a tree structure. The four-level cache block is the root of the tree, and the other levels of the cache block are the different levels of the root; the other levels of the cache block are the root of the higher-level cache block. Between the root and the branch, the branches and branches are connected as a tree by a bidirectional address mapping. From the first branch (level 1 cache block), you can reach the root (the same level 4 cache block) by any one of the following first level branches. It is only necessary to match the tag units in 51 if the target is beyond the root range. The target branch and the source branch belong to the same root, and there are fewer mapping levels to be experienced. The target branch and the source branch belong to different roots, and there are many mapping levels that need to be experienced. The embodiment of Figure 18 can be modified to reduce the mapping hierarchy.

Please refer to FIG. 20, which is a modified embodiment of the data cache hierarchy in the embodiment of FIG. 18. The three-level data buffer memory 160, the secondary data buffer memory 161, the primary data buffer memory 162, the data read buffer 163, the step size table 150, the three-level data track table 164, and the secondary data track table 165 are shown in FIG. , primary data track table 166; The three-level data active table 167, the secondary data active table 168; the adder 172, 173; the three-level data related table 174, the secondary data related table 175; and the module function of the selector 192 having the same number as the right half of FIG. the same. The primary data correlation table 176 format is as shown in 209. There is not only the secondary data cache block number DBN2X of the first-level cache block but also the corresponding three-level data cache block number DBN3X and the fourth-level cache block number DB4X.

The operation is similar to that of the embodiment of Fig. 18. The DBN1 address in the 138 field in the corresponding row of the data load instruction and the step size (e.g., 140) selected by the branch are outputted in the adder 173. The system makes a boundary determination for the output 181 of 173. If the boundary is judged to be in the level one cache block, the level one data buffer memory 162 is directly addressed by 181. If the boundary is judged to be outside the first level cache block. then

A row 209 in the primary correlation table 176 is addressed by the upper address 138, and the cache address of one of the selections 209 is judged according to the above-described boundary, and is added by the adder 172 and the step size 140, and the sum is 183. If the boundary is determined to be in the secondary cache block, then DBN2X and 140 are added in selection 209, and the sum 183 is sent by the system to the secondary active table 168 for addressing; if the boundary is determined to be in the tertiary cache block, then 209 is selected. The DBN3X is added to the 140, and the sum 183 is sent to the third-level active table 167 by the system; if the boundary is judged to be in the fourth-level cache block, then DBN4 and 140 are added in the selection 209, and the 183 is sent by the system. The four-level active table 120 is addressed. The rest of the operations are the same as those of the embodiment of FIG. 18 and will not be described again. The embodiment of Figure 20 can save the reverse mapping steps and delays from the branch to the root. In addition, an adder may be additionally added to add the address formed by DBNY13 in BN4X 138 in 209 to 140, and the sum is used to address the label unit in 51, and map the BN address to the data address so as to be correct with bus 49. Data address comparison.

Please refer to FIG. 21, which is an embodiment of prefetching data organized in logical relationships. Data can contain address pointers, which are organized logically. In this embodiment, prefetching data organized by other logical relationships may be deduced by analogy, for example, by prefetching data organized by a binary tree. 220-222 is the data in the memory, where 220 is the data, 221 is the address pointer of the left branch of the binary tree, and 222 is the address pointer of the right branch of the binary tree. The data buffer memory 162, the data read buffer 163, the data track table 166, Selector 192, instruction memory 22, IRB 39, and the processor core 23 have the same function as the module of the same number in FIG. Some of the modules not shown in Fig. 21 have the same functions as the modules of the same number in the embodiment of Fig. 18. A shifter 225, a learning engine 226, and a selector 227 are added. The comparison result 228 is taken from the processor core 23. The entries in the data track table (DTT) 166 in this embodiment correspond one by one to the respective data entries of the data memory (DL1) 162.

Learning engine The engine 226 is responsible for generating an entry for the data track table (DTT) 166. 230-232 is DTT The entry in 166 corresponding to the data 220-222 in 162. Each entry in 166 has a 'valid bit', wherein the data type entry 230 corresponds to the data entry 220, and the pointer entries 231 and 232 contain the address pointers in the DBN formats 221 and 222, respectively. Data type table entries, each of which has its identifier to distinguish between the two. The DBN format can directly address the data store 162.

The data read pointer 181 controls to read a line of tracks from the data track table 166. If the DBNY value in the pointer is close to the end of a line, the BN address in the track point is terminated according to the line track, and the next line is also read in the address order. And sent to the shifter 225. 225 in the row of tracks or two rows of tracks shifted to the left by the number indicated by DBNY in the data read pointer 181. The learning engine 226 receives the shifted plurality of entries, determines the data type table entry 230 based on the identifiers in the entries, and determines 226 the pointer entries 231, 232 based on the data types in the data type table entry 230. operating. The comparison result 228 generated by the processor core 23 controls the selector 227 to select the plurality of pointers output by 226 to place the data read pointer 181 to address the data memory (DL1) 162 to provide data to the processor core 23.

For example, the data value in the 220 entry of the data memory 162 is '6', the 221 entry is the 32-bit address 'L', and the 222 entry is the 32-bit address 'R'. Correspondingly, the data type in the 230 entry of the data track table 166 is a binary tree, and the control signal is that the processor core 23 executes its address as 'YYY'. Comparison result produced by the instruction In the 228; 231, the DBN format address pointer 'DBNL' obtained by the 'L' address pointer mapping in 221 is the DBN format address pointer 'DBNR' obtained by the 'R' address mapping in 222. The learning engine 226 detects a plurality of entries from the shifter 225, and selects a data type entry 230 based on the identifier. 226 outputs the 231 and 232 entries from the shifter 225 to the selection based on the binary tree data type in 230. Two inputs to the 227. Suppose the instruction address is 'YYY’ The instruction compares the value '8' to be searched with the 220 value '6' loaded from 23 in (DL1) 162, resulting in a comparison result 228 of '1', which means that the value to be found is greater than the value in the current node 220. 226 observes the address 28 of the control level 1 memory 22, after it reaches 'YYY', causes the comparison result 228 generated by the processor core to control the selector 227. 228 At this point the control 227 selects the right branch pointer 'DBNR' in the entry 232. Output to the data read pointer 181. If the valid bit in entry 232 is 'valid', then the data pointed to by the right branch pointer in 232 becomes the new current data. The selector 192 selects 181 addressing 162 (DL1), Output new current data stored in DRB 163. 181 also addresses DTT 166, causing 166 to output a corresponding data track containing the new current data to shifter 225. The intra-block offset portion DBNY in the address on 181 controls shifter 225 to shift the data track to the left to enable data type, DBNL address, DBNR Addresses (formats such as 230, 231, 232) are aligned with the input of the learning engine 226.

Each entry of DRB 163 corresponds to an intra-block offset address (Offset, DBNY), 162 (DL1) will be the entire data block (if the data specified by data type 230, such as 220-222, exceeds one data block, Then, the data block starting from the 'DBNR' address is crossed to the next data block in the order of addresses. The processor core 23 uses the data address generated by executing the load instruction (Data). Address) 94 Offset part of the addressing DRB 163, read the current data and its left branch address pointer, right branch address pointer (format such as 220, 221,222). The processor core 23 executes an instruction to compare the value '8' to be sought with the current data to produce a comparison result 228.

The learning engine 226 monitors the address 28, the comparison result 228 generated by the processor core 23, the data address 94, And corresponding data 223 output by (DL1) 162 to generate a data track (Data Track) entry to be stored in DTT 166. In the corresponding 166, when the entry is 'invalid' (not established), the data cache system sends the data address 94 generated by the processor core 23 to the tag unit 51 (not shown) for matching, and maps to the DBN address 184. 184. The data memory 162 is addressed and the read data is output to the processor core 23 via 223. The learning engine 226 records the address on 94 and the data on 223 that is addressed by the entry in the data store 162. 226 also compares the newly generated data address 94 with the previously recorded data on 223. If they are the same, the learning engine 226 matches the newly generated data address 94 and maps the resulting DBN to the same 223 data in the record. The entry in the corresponding data track table 166 of the data entry, And set these entries to 'valid'. That is, the address pointer 'L' in 221 is matched, the mapped 'DBNL' is stored in 231, and the address pointer 'R' in 222 is matched and the mapped 'DBNR' is stored in 232. Alternatively, 226 can also record and compare the mapped BN format data with the address.

226 judges the data memory 162 entry that meets the following conditions as a 'data' (non-pointer) entry. The condition is that the data address of the entry itself is only one or a few data lengths from the above-mentioned address of the entry containing the address pointer, and the data on 223 in the plurality of instruction loops is never the same as the address on the last 94. The range of the instruction loop can be determined by the IRB The branch instruction address of the reverse jump in 39 and its branch target instruction address are determined. The data track table 166 entry corresponding to the 'data' entry in the data store 162 is the data type entry. The learning engine 226 will monitor the resulting rule (ie, when the 28 address is 'YYY', if 228 is '0', the BN address in 231 is selected, and if 228 is '1', the BN address in 232 is selected) and the 'data' is stored ( Here is the corresponding data track table entry for 220) (here 230), And set the entry to 'valid'. The valid bits in the data type table entry may be complex digits, such as greater than a preset value of 'valid'; no greater than the default value of 'invalid'.

After the data track entry is established, The comparison result 228 generated by the processor core 23 to execute the instruction controls the selector 227 to select the address pointer to cause the data read pointer 181 to move along the binary tree. When a new data point is reached, based on its data type (e.g., 230), the learning engine 226 controls the same set of data and its address pointer (e.g., 220-222) to be read from the data cache 162 and stored in the DRB. 163, read by the data address 94 generated by the processor core 23. The delay of addressing the data memory 162 by the data address 94 after the tag unit is matched is avoided in this process. Data read buffer DRB The access delay of 163 is a single clock cycle, and is typically less than the access latency of 162.

Further, the data read buffer can be organized in the manner of the embodiment of FIG. 18, that is, the entry of 163 corresponds one-to-one with the entry of the IRB instruction read buffer 39. In this organization, a field is also added to each entry in the data track table (DTT) 166 for recording the address or flag of the instruction for reading data in the data memory 162 corresponding to the entry (for example, the load instruction is in the instruction). The sequence number in the loop, and the BNY address of the instruction). When the learning engine 226 controls the reading of the data in 162 according to the entry in 166, the data is stored in the DRB corresponding to the flag in the entry. 163 entries. When IRB When a load instruction in 39 is pushed to the processor core for execution, the data in a DRB entry corresponding to the IRB entry of the instruction is also pushed to the processor core 23 for use. This eliminates the load delay (Load Delay).

The learning engine 226 performs a learning. The learning proceeds are stored in the data track table 166 in the form of data types and address pointers. The type of data read from the data track table is used to control 226 itself to process other entries read from the data track, such as moving an entry of input 226 to a particular 226 output, or controlling the comparison result 228. The polarity of the selector 227 selects the correct address pointer under the control of 228 and places the data read pointer 181. The address data memory 162 outputs data (e.g., 220). The data type also controls 226 to generate and output a single or multiple subsequent addresses (adding an increment to the correct pointer address, The increment is an integer multiple of the data word length) and the addressing 162 outputs other data of the same group (eg, 221, 222). Therefore, the data type is the control setting for 226, such as the IRB address or flag when the comparison result 228 is generated, the polarity of 228, and the number of subsequent addresses to be generated. The learning engine 226 also matches the DBN address of the bus 181 with the data address 94 generated by the processor core 23, and maps the resulting DBN. 184 comparison, if not the same, the effective value in the data type table item in the corresponding DTT 166 is decremented by '1', and the DBN obtained by the mapping is obtained. 184 is placed on bus 181 to address data memory 162 to read the correct data, and DTT 166 is also addressed to read the corresponding track entry. The learning engine 226 relearns the 166 entries whose effective value is reduced to '0'.

The embodiment of Figure 21 can be used in conjunction with the embodiment of Figure 18. The learning engine 226 continuously monitors the type of data in the data track table, as well as the data on the data store output 223 and the data address 94 output by the processor core 23. If the data on 223 is not the same as the address on the next 94, the DTT corresponding to the data memory 162 entry that outputs the data will be used. The valid value in the data type entry in 166 is reduced by '1'. If the data on 223 is the same as the address on the next 94, the valid value of the data type entry is incremented by '1'. The system operates the same set of data corresponding to the RMS value greater than a preset data type entry in the manner of the embodiment of FIG. 21, that is, the data contains a data pointer. The system operates in the manner of the embodiment of FIG. 18, in which the effective value is not greater than the preset value, that is, if the address is not included in the data, the data in the DBN address read data memory 162 is stored in the DRB according to the 'step size'. 163 is used by the processor core 23. Each time the 181 upper address generated by the 21 embodiment is the same as the 94 upper address, the effective value is increased by '1'; if not, the effective value is decreased by '1'. This is a reward for the learning engine 226. The data type table entry 230 can further include a field in which the set of data is recorded in accordance with the FIG. 18 embodiment, or the FIG. 21 embodiment, or otherwise.

Figure 22 is an embodiment of a handler call (Call) and a function return (Return) instruction. Level 1 cache 22, processor core 23, track table 20, incrementer 24, The selector 25 and the register 26 have the same functions as the modules of the same number in the embodiment of Fig. 2. Stack 233 and selector 236 are newly added. Whether the decoding instruction invokes or returns an instruction when the scanner scan instruction extracts the instruction type format is recorded in the field 11 instruction type format (see FIG. 1) in the track table entry. When the instruction type on the track table output 29 in FIG. 22 is a call instruction, and the TAKEN signal 31 is 'branch successful', the controller (not shown) controls the BNX in the register 26 and the BNY output from the incrementer 24. Push stack 233. When the instruction type on the track table output 29 is a return instruction, the controller controls the selector 236 to select the output of the stack 233. When 31 is 'branch successful', the top of the stack BN in 233 is popped into the register 26. Return the program to the next instruction execution of the calling function instruction.

The instruction type (field 11) of the indirect branch instruction can also be subdivided to provide guidance to the buffer system. There is a type of indirect branch instruction that jumps to the same instruction address each time it is executed, or the instruction address generated each time it is executed is the addition of a 'step' to the address of the instruction that was generated last time. For such an indirect branch instruction, an indirect branch instruction as a duplicate class is recorded in the track table entry 11, and the generated instruction address and step size are recorded in the step size table 150 in FIG. It is also possible to generate BNX, The BNY instruction addresses are respectively stored in the 12 and 13 fields in the track table entry (see the embodiment of FIG. 1), and the step size table only records the step size. The specific operation is the same as that of FIG. 17, and FIG. 18 implements a method for generating a data address, and details are not described herein again. Because the cache system of the present invention can actively provide non-branch instructions and direct branch instructions to the processor core, and the indirect branch target address is generated based on the contents of the register or memory, the processor core using the cache system of the present invention does not need to be reserved. Program counter that generates an instruction address Counter). The program debug hardware breakpoint can be mapped to the BN format address, compared to the tracker's BN, and the interrupt is triggered the same. Accordingly, the processor core does not need to have an associated pipeline segment with instruction fetches.

Please refer to FIG. 23, which is another embodiment of the processor system of the present invention. 23 is a modification of the embodiment of FIG. 8, wherein the three-level active list 50, the three-level cached TLB and tag unit 51, the third-level buffer memory 52, the selector 54, the secondary track table 88, and the secondary active table 40 L2 cache memory 42, track table 20, level 1 cache related table 37, level 1 buffer memory 22, instruction read buffer 39, tracker 47, tracker 48, processor core 23 has the same function as the module of the same number in the embodiment of FIG. A Track Read Buffer (TRB) 238 is added, as well as selectors 237, 239.

Storage and IRB in TRB 238 The track corresponding to the instruction block stored in 39. The processor core 23 has two front-end pipelines, FT (Fall Through Sequence) and TG (Target Target). Tracker 0 (TR0) 48 provides BNY increment 38 control IRB 39 provides sequential instruction flow to the FT pipeline of processor core 23, tracker 1 (TR1) 47 looks ahead to read the TG address on the track along the track in the TRB. TG address in BN1 format addresses L1 instruction memory 22, TG address in BN2 format addresses L2 The instruction memory 42 reads the TG instruction, and the TG, which may be executed in the program order at that time, is selected by the BN1 or BN2 format control selector 239 and sent to the TG pipeline. Taken Signal 31 selects the output of the FT or TG front-end pipeline to be executed by the back-end pipeline. When the branch is successful, the TG instruction block corresponding to the branch instruction from L2 or L1 is selected by the selector 239 to be stored in the IRB. 39. Corresponding to the TG command block, the track from the secondary track table (TT2) 88 or the track table (TT) 20 is also selected by the selector 237 to be stored in the TRB 238 for 47. TR1 read. If the TG command block is read from the L2 instruction memory 42 by the BN2X address on the track, it is also stored in the L1 instruction memory 22 as the primary memory block pointed to by BN1X provided by the replacement logic. The BN1X is also stored in the entry in the AL2 active table 40 pointed to by the BN2X. The BN3 format address on the track output by the secondary track table 88 is sent to the bus via bus 89. AL3 is mapped to the BN2 address (or 52 L3 is addressed when the AL3 entry is invalid, and the read block is stored in 42). A secondary storage block of L2 whose block address is BNX2). The BN2 address replaces the original BN3 address on the track.

According to the same principle, the output from 88 TT2 or 20 TT or 238 TRB The BN2 format address on the track can be mapped to the BN1 format by 40 AL2 (or the 42 L2 is stored in 22 L1 to obtain the BN1 address). 88 in this embodiment TT2 stores the TG address in BN3 or BN2 format, 20 TT only stores the address in BN2 or BN1 format, and 238 TRB The BN3, BN2 or BN1 format TG address is allowed. The limitation of the BN format in TT2 and TT triggers the instruction to fill from the low-level memory level to the high-level memory level, which avoids the padding triggered by the cache miss in the traditional cache mechanism, so the inevitable missing. And to ensure that the branch target instruction is at the same or next memory level of the direct branch instruction. Because 47 TR1 looks ahead to read the TG address on the track, it can partially or completely cover 42 L2, or 22 L1 access delay. If there are dense branch instructions in the instruction segment, the TG addresses on the corresponding tracks can be intentionally staggered in BN1, BN2 format, and the access delays of 42 and 22 are covered as much as possible. If the address read on the TRB is in the BN3 format, if the corresponding branch is successful, the processor core 23 is waiting to be mapped by the BN3 address (the mapping process is in the track from 88). The TT2 output starts, so the BN2 format can be partially or completely masked by the AL3 or L3 delay. The branch target instruction is executed after the track in 238. If the corresponding branch is unsuccessful, the processor core 23 does not wait, directly executing the next instruction in the sequence, and the mapped BN2 format is filled in the track after being obtained. At TRB After the BN3 format address on the track in 238 is replaced with the BN2 format, the track is filled in 20 The line indicated by BN1X provided by the above replacement logic in TT. In this embodiment, the system can control the secondary instruction memory 42 or the primary instruction memory 22 to provide the TG command to the processor core 23 according to the track output of the secondary track table 88 or the primary track table 20, and the IRB. 39 provides sequential instructions to the processor core. In this embodiment, the process of executing the next instruction block to the sequence is processed by the branch, and the instruction type in the end track point in the track is set as the unconditional branch, so the processing is the same as the above-described branching process. The method and system in this embodiment are also applicable to other multi-storage hierarchical track instruction cache systems, such as the embodiment of Figures 11, 12, 13, and 18.

Referring back to FIG. 12, there are more specific embodiments of the two application forms of the structure in the embodiment of FIG. For example, each functional module in FIG. 12 is divided into two ends of a communication channel having a long delay. It is assumed that the memory 111 in Fig. 12 is located at one end of the communication channel and the remaining modules are located at the other end of the communication channel. The communication channel may be between memory from one processor core to another processor core on the same chip; or between memory from one processor lane to another processor lane on the same chip; Between the processor core on the chip and the memory on the other chip; between the processor of one computer and the memory of another computer; at the memory of one processor core or computer and the other end of the wired or wireless network And other communication channels with long delays.

The following uses the network channel as an example. The IPv6 address is 128 bits. Assuming the memory address is 64 bits, the IPv6 address and the memory address are combined into a 192-bit address to address the memory at the far end of the network. In order to support the 192-bit address, only the components of 43, 51, and 113 in Figure 12 need to meet the bandwidth of 192 bits, but the functions and operations are the same; the remaining components do not need any bandwidth due to this 192-bit bandwidth. change. Specifically, the TLB/TAG unit 51 is capable of storing a tag supporting a 192-bit address (such as a 128-bit tag plus a 64-bit memory tag), and the scanner 43 is also capable of providing the current block address of the 192-bit provided by 51 with a branch instruction. The intra-block offset address and the branch offset are added to obtain a branch target address of 192 bits. This 192-bit branch target address matches the content of the tag unit TAG in 51. If there is no match, the 192-bit branch target address is sent via bus 113 to memory 111 at the other end of the channel for instruction. If it matches, the BN3 or BN2 address is stored in the secondary track table 88 as described above in the embodiment of FIG. 12, and details are not described herein. Other channels, such as a local area network, or between processor cores and memories of different computers, may also be supported in the same manner. Before the memory address, a functional address of a different computer or memory itself is used as a prefix address in the network connected to the network. Just fine. Memory 112 in the embodiment of Figure 12 can also be placed with the memory 111 at the other end of the communication channel.

The above specific embodiment of the application form of the structure of Fig. 12 can also be applied to the structures of Figs. 13 and 18. Taking FIG. 18 as an example, assume that the memory 111 in FIG. 18 is located at one end of the communication channel and the remaining modules are located at the other end of the communication channel. Then, as in the above embodiment, the operation of the instruction memory at the far end of the communication channel can be supported as long as the bandwidth of the TLB/TAG unit 51, the scanner 43, and the bus 113 can support the memory address width with the network address prefix. The specific embodiment of FIG. 13 is the same as the instruction memory portion of FIG. 18 described above, and will not be described again. In FIG. 18, as the memory 111 and the memory 112 also store data, the bandwidth of the adder 169 and its output bus 198 in which the data address is generated can also support the memory address with the network address prefix as described above. Except for the above 51, 43, 169 modules and the bandwidth of the bus 113, 198, the remaining modules in Figure 18 need not be changed, since the remaining modules operate based on the cache address. The network memory address (network address + memory address) is mapped to the cache address via the tag unit TAG in 51. The width of the cache address depends on the organization of the buffer, regardless of the network memory address.

When the memory 111 and the other modules in FIG. 18 are separated at both ends of the network, the address on the bus 113 may be transmitted through a packet. At this time, the network address in the network memory address may be placed in the packet header, and the network is The memory address in the memory address is placed in the contents of the packet. When memory 111 is accessible by a plurality of processor cores or computers, an arbiter should be present in 111 to determine the order of access. The network address corresponding to the thread is stored in the processor core by the thread register. The adder in the adder 169 or the scanner 43 in Fig. 18 can use the bit width equal to the bit width of the network memory address, but the optimized implementation of the bit width can be as long as the memory address bit width is satisfied. At the same time as the adder operation obtains the memory address of the branch target or data, the thread address stored in the thread register is read by the thread number being executed at that time, and the network address stored by the thread is read. The network address is concatenated with the calculated memory address, which is the network memory address, and is sent to the tag unit TAG in 51 to match.

In the same 51, the tag unit can store a plurality of network memory addresses, for example, each entry is 192 bits. But there are several ways to optimize. One is to use two tables, one of the entries in Table 2 stores the row number of another table 1 in addition to the label of the storage memory address. The network address is stored in each entry in Table 1. The network address in the network memory address is first matched with the contents of Table 1 to obtain the row number of Table 1. The obtained table 1 row number is combined with the memory address to match Table 2. The matching result of Table 2 is the cache address. If there is no match, the network memory address is fetched from the memory 111 via the bus 113 into the memory 112. The other is to use only the row number (or thread number) of the above thread register in Table 2 except for the label storing the memory address. At this point, the thread number (or thread number) of the thread register is combined with the memory address to match Table 2. If there is no match, the network address addressed by the thread register line number (or thread number) in the thread register is concatenated with the memory address as the network memory address is fetched from memory 111 via bus 113 into memory 112. Therefore, the actual cost of the increase is not much.

The scanner 43 in the embodiment of Figures 12, 13, and 18 calculates the branch target instruction address of the branch instruction based on the instruction block address from the branch instruction of the tag unit in 51. The physical address is stored in the tag unit in 51, so the branch target instruction address calculated by the scanner 43 is a physical address. The physical address of the branch target instruction can directly match the content in the label unit in 51 as long as it does not cross the physical page boundary, and does not need to be mapped by TLB. Similarly, in the embodiment of FIG. 18, the data address generated by the adder 169 based on the physical address in the tag unit of 51 is also the physical address. As long as the physical page boundary is not crossed, the content of the tag unit in 51 can be directly matched without going through TLB mapping. The match is the BN address of the lowest layer cache. In Figures 4, 5, 12, 13, 18, only the indirect branch instruction address on bus 46 is a virtual address and needs to be mapped to a physical address via the TLB in 51. The scanner 43 and the data address generator 169 generate physical addresses that can be directly matched to the TAGs in 51. Other addressing the last level cache (last Level The address of the cache) is as shown in Figure 4, 5, bus 29, bus 8, 89 in Figures 8, 11, 12, and the address on bus 119 in Figures 13, 18 are cache address format BN, which can directly address the last level. The cache memory, the active table AL, the associated table CT, and the tag unit TAG in 51, do not need to go through the TLB or tag unit TAG mapping in 51.

Although the embodiments of the present invention are only described in terms of structural features and/or methods of the present invention, it should be understood that the claims of the present invention are not limited to the features and processes. Rather, the features and processes are merely illustrative of several embodiments of the invention. It should be understood that the various components listed in the above embodiments are merely for convenience of description, and may include other components, or some components may be combined or omitted. The plurality of components may be distributed among multiple systems, may be physically present or virtual, or may be implemented in hardware (such as an integrated circuit), implemented in software, or implemented in a combination of hardware and software.

Obviously, in accordance with the description of the preferred embodiments described above, the present invention may be practiced by one of ordinary skill in the art in accordance with the principles of the present invention, regardless of how the technology in the field develops, and what progress may be made in the future that is not readily predictable. Corresponding parameters, configurations, adaptations, adjustments and improvements are intended to be included within the scope of the appended claims.

Industrial applicability

The systems and methods proposed by the present invention can be used in a variety of computing, data processing systems, information, data storage systems, and communication systems. The system and method proposed by the present invention can mask or significantly Reduce storage system access latency and cache misses.

Sequence table free content

Claims

A processor system comprising: a processor core and a buffer; wherein:

The buffer pushes instructions and data to the processor core for execution and processing by the processor core.
The system of claim 1 wherein:

The processor core provides a branch determination to the cache system;

The buffer reviews the instructions stored therein, extracts and stores control flow information of the instructions;

The buffer pushes an instruction to the processor core for execution by the processor core according to the control flow information and the branch determination.
The system of claim 2 wherein:

The processor core provides a base address of the indirect branch instruction to the cache system;

The buffer generates an indirect branch target address according to the base address, and pushes an indirect branch instruction to the processor core for execution by the processor core.
The system of claim 1 wherein:

The processor core pipeline does not set an instruction pipeline segment;

The processor core does not generate an instruction address;

The processor core does not provide an instruction address to the buffer to read the instruction.
The system of claim 1 wherein:

The buffer of the system is connected to the memory;

The buffer generates and provides a memory address to the memory;

The memory provides instructions to the buffer based on the memory address.
The system of claim 2 wherein:

Only the lowest storage level in the buffer has a virtual real address translation;

Only the lowest storage level in the buffer has a mapping of memory addresses to buffer addresses.
The system of claim 2 wherein:

The lowest storage level in the buffer is organized in a way group association manner;

The other storage levels in the cache except the lowest storage level are all-associated.
The system of claim 2 wherein:

Each of the buffers in the buffer has a scanner between adjacent storage levels;

The scanner examines instructions passed between the adjacent storage hierarchies to extract control information flows.
The system of claim 2 wherein:

The buffer has a scanner between the lowest storage level and the second lowest storage level;

The scanner reviews instructions passed between the lowest and second lowest storage levels to extract a control information flow;

The extracted control information stream is stored for a higher storage level call than the second lowest storage level.
The system of claim 2 wherein:

The highest storage level in the buffer has a first read port and a second read port;

The highest storage level in the buffer has a first tracker and a second tracker;

And the first and second trackers control, according to the stored control flow information and the branch determination, a sequence instruction after the first read port and the second read port provide a branch instruction to the processor core, and Branch target instruction;

The processor core executes the branch instruction to generate a branch determination;

The processor core determines to execute and write back a sequential instruction or a branch target instruction with the branch decision.
The system of claim 3 wherein:

Storing, in the buffer, the base address of the indirect branch instruction and the indirect branch target instruction pair;

The buffer may provide the stored indirect branch target instruction to the processor core according to the indirect branch instruction and the base address.
A buffer replacement method; characterized in that the replaced cache block is determined with a minimum degree of association.
The method of claim 12 wherein the cache block being replaced is further determined on the principle that it was first replaced.
The method of claim 12;

The cache block in the buffer stores an associated record;

The number of instructions that use the cache block as a branch target is recorded in the association record as the degree of association.
The method of claim 12;

The cache block in the buffer stores an associated record;

The association record records the number of higher-level cache blocks that are identical to part or all of the contents of the cache block as the degree of association.
The method of claim 12;

Control buffer information is stored in the buffer, and a branch target address is recorded in the control flow information;

The cache block in the buffer stores an associated record;

Recording, in the associated record, an address of the cache block in a lower storage level;

Recording, in the associated record, an address of a branch source cache block that uses the cache block as a branch target;

When the cache block is replaced, the address of the cache block recorded in the control flow information is replaced with the lower one storage hierarchy address of the cache block.
The method of claim 12;

Querying the control flow information to determine an address of a corresponding high storage hierarchy cache block of content in a low storage hierarchy cache block;

Displace the corresponding high storage hierarchy cache block to reduce the association of the storage hierarchy cache block.
The method of claim 12;

Replaces a cache block that is not associated with other cache blocks.
An information processing method, characterized in that:

The instructions and data are pushed by the cache system to the processor core for execution by the processor core.
The method of claim 19, comprising:

Step A: The processor core provides a branch judgment to the cache system.

Step B: The buffer reviews the instructions stored therein, and extracts and stores the control flow information of the instruction;

Step C: The buffer pushes an instruction to the processor core for execution by the processor core according to the control flow information and the branch determination.
The method of claim 19 wherein:

The processor core provides a base address of the indirect branch instruction to the cache system;

The buffer generates an indirect branch target address according to the base address, and pushes an indirect branch instruction to the processor core for execution by the processor core.
The method of claim 19 wherein:

Providing a buffer address by the tracker to address the buffer to push an instruction to the processor core;

Storing, by a thread, a register state in the tracker and in the processor core;

The stored register state is swapped by threads with states in the tracker and the processor core for thread switching.
The method of claim 19 wherein:

The cache system uses the main memory as the lowest level cache;

The main memory is addressed by a cache address.
The method of claim 23 wherein:

The cache is addressed by a real address;

The cache system does not perform virtual and real address translation.
The method of claim 19 wherein:

The main memory is composed of a nonvolatile memory and a volatile memory;

The volatile memory acts as a cache for the non-volatile memory.
The method of claim 19 wherein:

The storage blocks in each storage hierarchy in the buffer are organized in a tree;

The memory blocks in the different storage levels are associated in a mapping relationship.
The method of claim 26 wherein:

The mapping relationship may be a forward mapping from a low storage level to a high storage level mapping;

The mapping relationship may be a reverse mapping from a high storage hierarchy to a low storage hierarchy.
The method of claim 27 wherein:

Reversely mapping a higher layer address of the cache block to a lower layer cache address;

Adding the lower layer cache address to the target offset to obtain a lower layer cache address of the next data load;

The target lower layer cache address is forward mapped to the next data load upper level cache address.
The method of claim 28 wherein:

Loading the higher level cache address with the next data to retrieve data from the cache before the processor requests;

The data is pushed to the processor with corresponding data access instructions.
The method of claim 28 wherein:

The target offset is selected by the branch determination result of the branch instruction of the reverse jump after the relevant instruction.