WO2014206217A1

WO2014206217A1 - Management method for instruction cache, and processor

Info

Publication number: WO2014206217A1
Application number: PCT/CN2014/080059
Authority: WO
Inventors: 郭旭斌; 侯锐; 冯煜晶; 苏东锋
Original assignee: 华为技术有限公司
Priority date: 2013-06-28
Filing date: 2014-06-17
Publication date: 2014-12-31
Also published as: CN104252425A; CN104252425B

Abstract

A management method for an instruction cache, and a processor, which relate to the field of computers, and can expand the instruction cache capacity of hardware threads, reduce the missing rate of an instruction cache and improve the system performance. A hardware thread identifier in a shared instruction cache of a processor is used for identifying a hardware thread corresponding to a cache line in the shared instruction cache. A private instruction cache is used for storing an instruction cache line which is replaced from the shared instruction cache. A missing cache is also included. When acquiring an instruction from an instruction cache, the hardware thread of the processor simultaneously accesses the shared instruction cache and the private instruction cache corresponding to the hardware thread in the instruction cache, determines whether the shared instruction cache and the private instruction cache corresponding to the hardware thread have instructions or not, and acquires the instructions from the shared instruction cache or the private instruction cache corresponding to the hardware thread according to a judgement result. The management method is used for managing an instruction cache of a processor.

Description

The invention relates to a method and a processor for managing an instruction cache. The application is filed on June 28, 2013 by the Chinese Patent Office, the application number is 201310269557.0, and the invention is entitled "A Method of Cache Management and Processor". The entire contents of the above-identified patent application are incorporated herein by reference.

Technical field

The present invention relates to the field of computers, and in particular, to a method and a processor for managing an instruction cache.

Background technique

The CPU (Central Processing Unit) cache (Cache Memory) is a temporary memory located between the CPU and the memory. The capacity is much smaller than the memory, which solves the contradiction between the CPU operation speed and the memory read/write speed. CPU read speed.

In a multi-threaded processor, multiple hardware threads fetch instructions from the same I-Cache (instruction cache). When there is no instruction to be fetched in the I-Cache, the missing request is sent to the next-level Cache. Switching to other hardware threads to access the I-Cache continues to fetch, reducing the stalls caused by the I-Cache miss and improving pipeline efficiency. However, because the shared I-Cache resources allocated by each hardware thread are insufficient, the I-Cache miss rate increases, and the missing requests sent by the I-Cache to the next-level cache frequently occur, and the instructions are retrieved from the next-level cache. When backfilling, when the thread data increases, the cache line where the filled instruction is located will be filled into the missing I-Cache and will not be used immediately, and the replaced cache line may be used again.

In addition, when adjusting the Thread (thread) scheduling policy according to the Cache hit situation, it will try to ensure that the priority fetching instruction fetches the thread with a high hit rate in the Cache for a period of time, but for each hardware thread, the shared I is shared. - The problem of insufficient Cache resources has not improved. Summary of the invention

Embodiments of the present invention provide a method and a processor for managing an instruction cache, which can expand the instruction cache capacity of a hardware thread, reduce the missing rate of the instruction cache, and improve system performance.

In order to achieve the above object, embodiments of the present invention use the following technical solutions:

In a first aspect, a processor is provided, comprising: a program counter, a register file, an instruction prefetching component, an instruction decoding component, an instruction transmitting component, an address generating unit, an arithmetic logic unit, a shared floating point unit, and a data cache. As well as the internal bus, it also includes:

a shared instruction cache, a shared instruction for storing all hardware threads, including a tag storage array and a data storage array, the tag storage array for storing tags, the data storage array including stored instructions and hardware thread identifiers, the hardware The thread identifier is used to identify a hardware thread corresponding to the cache line in the shared instruction cache;

a private instruction cache, configured to store an instruction cache line that is replaced from the shared instruction cache, where the private instruction cache corresponds to the hardware thread;

a missing cache, configured to save the cache line retrieved from the next-level cache of the shared instruction cache in a missing cache of the hardware thread when the fetched instruction does not exist in the shared instruction cache. When the hardware thread corresponding to the instruction fetches the instruction, the cache line in the missing buffer is backfilled into the shared instruction cache, and the missing cache corresponds to the hardware thread.

With reference to the first aspect, in a first possible implementation manner of the first aspect, the method further includes: label comparison logic, configured to: when the hardware thread fetches, the label in the private instruction cache corresponding to the hardware thread In contrast to the physical address translated by the translation buffer, the private instruction cache is logically coupled to the tag comparison such that the hardware thread accesses the private instruction cache while accessing the shared instruction cache.

With reference to the first possible implementation manner of the first aspect, in a second possible implementation manner, the processor is a multi-thread processor, and the structure of the private instruction cache is a fully associative structure, the full phase The association structure maps any block of instructions in the private instruction cache to any block of instructions in the main memory.

In conjunction with the second possible implementation of the first aspect, in a third possible implementation manner, the shared instruction cache, the private instruction cache, and the missing cache are static memory chips or dynamic memory chips. The second aspect provides a method for managing an instruction cache, including:

When the hardware thread of the processor acquires the instruction from the instruction cache, simultaneously accessing the shared instruction cache in the instruction cache and the private instruction cache corresponding to the hardware thread; determining that the shared instruction cache corresponds to the hardware thread Whether the instruction exists in the private instruction cache, and the instruction is obtained from the shared instruction cache or the private instruction cache corresponding to the hardware thread according to the judgment result.

With reference to the second aspect, in a first possible implementation manner of the second aspect, the shared instruction cache includes a label storage array and a data storage array, the label storage array is configured to store labels, and the data storage array includes storage And the hardware thread identifier, the hardware thread identifier is used to identify a hardware thread corresponding to the cache line in the shared instruction cache; the structure of the private instruction cache is a fully associative structure, and the fully associative structure is mainly Any block of instructions in the memory maps any block of instructions in the private instruction cache, the private instruction cache corresponding to the hardware thread.

With reference to the first possible implementation manner of the second aspect, in a second possible implementation manner, the determining, by the shared instruction cache, and the private instruction cache corresponding to the hardware thread, whether the instruction exists, and determining The obtaining the instruction from the shared instruction cache or the private instruction cache corresponding to the hardware thread includes:

And if the shared instruction cache and the private instruction cache corresponding to the hardware thread simultaneously exist in the instruction, acquiring the instruction from the shared instruction cache;

If the instruction exists in the shared instruction cache and the private instruction cache corresponding to the hardware thread does not have the instruction, the instruction is obtained from the shared instruction cache; if the hardware thread corresponds to a private instruction cache If the instruction exists and the instruction does not exist in the shared instruction cache, the instruction is obtained from a private instruction cache corresponding to the hardware thread.

In conjunction with the second possible implementation of the second aspect, in a third possible implementation, the method further includes:

If the instruction does not exist in the shared instruction cache and the private instruction cache, sending a cache miss request to the next level cache of the shared instruction cache by the hardware thread,

If the instruction exists in the next level cache, the hardware thread is used to Obtaining the instruction in the next level cache, and storing the cache line in which the instruction is located in a missing cache corresponding to the hardware thread, and backfilling the cache line to the share when the hardware thread fetches the finger In the instruction cache;

And if the instruction does not exist in the next level cache, sending the missing request to the main memory by the hardware thread, acquiring the instruction from the main memory, and storing the cache line where the instruction is located In the missing cache corresponding to the hardware thread, when the hardware thread fetches, the cache line is backfilled into the shared instruction cache;

The missing cache corresponds to the hardware thread.

In conjunction with the third possible implementation of the second aspect, in a fourth possible implementation manner, when the cache line is backfilled into the shared instruction cache, if the shared instruction cache does not have idle resources, And replacing the cache line with the first cache line in the shared instruction cache, backfilling the cache line into the shared instruction cache, and according to the hardware thread of the first hardware thread acquiring the first cache line Identifying, storing the first cache line in a private instruction cache corresponding to the first hardware thread;

The first cache line is determined by a least recently used algorithm.

In conjunction with the fourth possible implementation of the second aspect, in a fifth possible implementation manner, when the replaced first cache line is stored in the private instruction cache corresponding to the first hardware thread, The private cache corresponding to the first hardware thread does not have an idle resource, and the first cache line is replaced with a second cache line in the private instruction cache corresponding to the first hardware thread, and the first cache line is used. Backfilling into the private instruction cache corresponding to the first hardware thread;

The second cache line is determined by the least recently used algorithm. Embodiments of the present invention provide a method and a processor for managing an instruction cache. The processor includes a program counter, a register file, an instruction prefetching component, an instruction decoding component, an instruction transmitting component, an address generating unit, an arithmetic logic unit, and a shared floating point. Units, data caches, and internal buses also include shared instruction caches, private instruction caches, missing caches, and tag comparison logic. The shared instruction cache is used to store shared instructions of all hardware threads, including a tag storage array and a data storage array. The data storage array includes stored instructions and hardware thread identifiers, and the hardware thread identifier is used to identify cache lines in the shared instruction cache. Corresponding hardware thread, private instruction cache, for storing instruction cache lines replaced from the shared instruction cache, private instruction cache Corresponding to the hardware thread - corresponding to the label comparison logic, when the hardware thread fetches the pointer, compares the label in the private instruction cache corresponding to the hardware thread with the physical address converted by the translation backup buffer, and the private instruction cache and label The comparison logic is connected such that the hardware thread accesses the private instruction cache while accessing the shared instruction cache. When the hardware thread of the processor acquires the instruction from the instruction cache, the shared instruction cache in the instruction cache and the private corresponding to the hardware thread are simultaneously accessed. The instruction cache determines whether there is an instruction in the shared instruction cache and the private instruction cache corresponding to the hardware thread, and obtains an instruction from the shared instruction cache or the private instruction cache corresponding to the hardware thread according to the judgment result, thereby expanding the instruction cache capacity of the hardware thread and reducing the instruction. The cache miss rate improves system performance.

DRAWINGS

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings to be used in the embodiments or the description of the prior art will be briefly described below. Obviously, the drawings in the following description are Some embodiments of the present invention may also be used to obtain other drawings based on these drawings without departing from the prior art.

1 is a schematic structural diagram of a processor according to an embodiment of the present invention;

2 is a schematic flowchart of a method for managing an instruction cache according to an embodiment of the present invention; FIG. 3 is a schematic diagram of a simultaneous access to a shared instruction cache and a private instruction cache according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a method for retrieving a cache line according to a cache miss request according to an embodiment of the present invention.

detailed description

The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without departing from the inventive scope are the scope of the present invention. In modern multi-threaded processor design, as the number of hardware threads increases, the shared resources that each hardware thread can share are insufficient. For example, the importance of LI (Level 1) Cache in Cache is This is especially true for shared resources. The instruction cache capacity of the LI Cache allocated by each hardware thread is too small, there will be a miss in L1, and the L1 miss rate will increase, resulting in increased communication between LI Cache and L2 Cache, fetching from L2 Cache, or from main memory. In the middle fetch, the processor power consumption increases.

The embodiment of the present invention provides a processor 01, as shown in FIG. 1, including a program counter 011, a register file 012, an instruction prefetching component 013, an instruction decoding component 014, an instruction transmitting component 015, an address generating unit 016, and an arithmetic logic. Unit 017, shared floating point unit 018, data cache 019, and internal bus also include:

Shared Instruction Cache (I-Cache) 020, Private Instruction Cache 021, Miss Buffer 022 and Tag Comparison 023.

The shared instruction cache 020 is configured to store sharing instructions of all hardware threads, including a tag storage array ( Tag Array ) 0201 and a data storage array (Data Array ) 0202. The tag storage array 0201 is used to store tags, and the data storage array 0202 includes The stored instruction 02021 and the hardware thread identifier (Thread ID) 02022 are used to identify the hardware thread corresponding to the cache line in the shared instruction cache 020.

The private instruction cache 021 is used to store the instruction cache line replaced from the shared instruction cache 020, and the private instruction cache 021 corresponds to the hardware thread.

The missing cache 022 is configured to cache the cache line retrieved from the next level cache of the shared instruction cache 020 in the missing cache of the hardware thread when the instruction fetched in the shared instruction cache 020 does not exist, in the fetched instruction When the corresponding hardware thread fetches, the cache line in the missing cache 022 is backfilled into the shared instruction cache, and the missing cache 022 corresponds to the hardware thread.

Tag comparison logic, when the hardware thread fetches, compares the tag in the private instruction cache corresponding to the hardware thread with the PA (Physis Adress) converted by TLB (Translation Look-aside Buffers), The private instruction cache 021 is logically coupled to the tag comparison such that the hardware thread accesses the private instruction cache 021 while accessing the shared instruction cache 020.

Among them, TLB or page table buffer, which stores some page table files (virtual address to physical address conversion table, you can convert the virtual address of the fetched instruction through TLB For the physical address, after comparing the physical address with the label in the private instruction cache, if the physical address is the same as the label in the private instruction cache, the hardware thread accesses the private instruction cache while accessing the shared instruction cache.

For example, there are 16 PC (Program Counter), which is PC0-PC15. The number of logical processor cores (hardware threads) in a processor core is the same as the number of PCs.

GRF (General Register File), a logical processor core in a processor core corresponds to a GRF, and the number is the same as the number of PCs.

Fetch (instruction prefetching component) is used to get the instruction, Decoder (instruction decoding component) is used to decode the instruction, Issue is the instruction transmitting component, used to transmit the instruction, AGU (Address Generator Unit) is used to do all The module for address calculation generates an address for controlling access to the memory. The ALU (Arithmetic Logic Unit) is an execution unit of the CPU (Central Processing Unit), an arithmetic logic unit that can be composed of "And Gate" and "Or Gate". Shared Float Point Unit is a circuit unit dedicated to floating-point arithmetic in a processor. A data buffer (D-Cache) is used to store data, and an internal bus is used to connect various components in the processor.

For example, the processor 01 is a multi-threaded processor, and the structure of the private instruction cache 021 is a fully associative structure. The fully associative structure maps an arbitrary instruction cache in the private instruction cache to any one of the instruction caches in the main memory.

For example, shared instruction cache 020, private instruction cache 021, and missing cache 022 are static memory chips or dynamic memory chips.

For example, you can add a Thread ID (hardware thread ID) to the I-Cache Data Array, which is used to indicate which hardware thread the Cache Line is issued by. Missing) The request was retrieved.

For example, when the hardware thread accesses the I-Cache of L1, that is, the instruction to be obtained by the hardware thread does not exist in the I-Cache, L1 sends a Cache Miss request to the next-level cache L2 Cache of L1, and if the L2 Cache hits, When the instruction to be obtained by the hardware thread exists in the L2 Cache, the hardware thread backfills the cache line (Cache Line) where the instruction is located in the L2 Cache into the LI Cache, or the hardware thread receives the return. In the Cache Line, the Cache Line is not directly filled in the LI Cache, but the Cache Line is stored in the Miss Buffer corresponding to the hardware thread, and the Cache Line is filled in until the hardware layer picks up. LI Cache.

In this way, when the hardware thread backfills the cache line in which the instruction is located in the L2 cache to the L1 cache, the replaced Cache Line is not directly discarded, and the Thread ID of the hardware thread corresponding to the replaced Cache Line can be used. The replaced Cache Line is filled in the private instruction cache corresponding to the hardware thread corresponding to the replaced Cache Line.

For example, the replacement may be caused by the absence of idle resources in the LI Cache, and the replaced Cache Line may be obtained according to the LRU (Least Recently Used, least recently used) algorithm.

Among them, the LRU algorithm is to replace the one of the longest unused instructions out of the cache once the instruction cache is missing. In other words, the cache first retains the most frequently used instructions.

For example, when a hardware thread fetches a finger, it can simultaneously access the I-Cache and the private Cache corresponding to the hardware thread.

If the instruction fetched in the I-Cache and the private Cache corresponding to the hardware thread does not have the fetched instruction, the fetched instruction is obtained from the I-Cache;

If the instruction fetched in the I-Cache does not exist and the private Cache corresponding to the hardware thread has an instruction fetched, the instruction fetched from the private Cache corresponding to the hardware thread obtains the fetched instruction;

If the instruction fetched by the I-Cache and the private Cache corresponding to the hardware thread, the fetched instruction is obtained from the I-Cache;

If the I-Cache and the private Cache corresponding to the hardware thread do not have the fetched instruction, the hardware thread sends a Cache Miss request to the next-level cache of the I-Cache to obtain the fetched instruction.

For example, according to the hardware thread's retrieval strategy, when the next cycle is switched to another thread fetching, the private cache corresponding to the new hardware thread is logically connected with the tag (tag) while accessing the shared instruction cache. The Tag comparison logic read by the private Cache is compared with the PA (Physical Address) outputted by the TLB (Translation Look-aside Buffers) to generate a private Cache Miss signal and a private Cache data output. When the new hardware thread corresponds to the private Cache The instruction fetched, the private Cache Miss signal indicates that there is an instruction, and there is an instruction output.

Therefore, an embodiment of the present invention provides a processor including a program counter, a register file, an instruction prefetching component, an instruction decoding component, an instruction transmitting component, an address generating unit, an arithmetic logic unit, a shared floating point unit, and data. The cache and the internal bus also include a shared instruction cache, a private instruction cache, a missing cache, and a tag comparison logic. The hardware thread identifier is added to the data storage array of the shared instruction cache, and the cache line retrieved when the cache is missing is Which hardware thread sends a cache miss request, when the shared instruction cache is replaced, the replaced cache line is stored in the private instruction cache corresponding to the corresponding hardware thread according to the hardware thread identifier, and the missing cache is used for When the hardware thread receives the cache line returned by the cache miss request, it does not directly fill the cache line back into the shared instruction cache, but saves the cache line in the missing cache until the hardware thread fetches the line. Backfilling into the shared instruction cache, reducing the upcoming access In addition to the chance that the cache line is replaced by the instruction cache, in addition, the increased private instruction cache increases the cache capacity of each hardware thread and improves system performance.

A further embodiment of the present invention provides a method for managing an instruction cache, as shown in FIG. 2, including:

101. When the hardware thread of the processor acquires an instruction from the instruction cache, the processor simultaneously accesses the shared instruction cache in the instruction cache and the private instruction cache corresponding to the hardware thread.

Illustratively, the Central Processing Unit (CPU) can be a multi-threaded processor. A physical core can have multiple hardware threads, also called logical cores or logical processors, but a hardware thread does not represent a physical core. Windows will each hardware thread be treated as a schedulable logical processor, each The logical processor can run the code of the software thread. The instruction cache can be a shared instruction cache (I-Cache) in the LI Cache in the processor and a private instruction cache of the hardware thread. The LI Cache includes a data cache (D-Cache) and an instruction cache (I-Cache).

Specifically, a fully associated private Cache can be set in each hardware thread, that is, the private Cache corresponds to the hardware thread. The fully associative structure maps any block of instructions in the private instruction cache to any block of instructions in the main memory.

In addition, a Tag (tag) comparison logic can be added. When the hardware thread fetches the finger, the private thread of the hardware thread is actively connected with the Tag logically, so that when a hardware thread fetches the finger, the I-Cache and the I-Cache are simultaneously accessed. The private cache corresponding to the hardware thread. 102. The processor determines whether the shared instruction cache and the private instruction cache corresponding to the hardware thread have an instruction, and then proceeds to step 103 or 104 or 105 or 106.

Exemplarily, when the hardware thread accesses the I-Cache and the private Cache corresponding to the hardware thread, it determines whether the I-Cache and the private Cache corresponding to the hardware thread have the fetched instruction.

If the multi-threaded processor has 32 hardware threads, the 32 hardware threads share a 64 KB I-Cache, that is, the shared instruction cache capacity is 64 KB. Each hardware thread contains a 32-way fully-associated private Cache that can store 32 replaced Cache Lines, each of which contains 64 Bytes, so that each private Cache has a capacity of 2 KB.

When a 32-way Tag comparison logic is added, the hardware thread compares the 32-way Tag read by the hardware thread with the PA (Physical Address) output by the TLB while accessing the I-Cache shared instruction cache, and Generate private Cache Miss signal and private Cache data output. If the 32-channel tag is the same as the PA, the private Cache Miss signal indicates that the private Cache of the hardware thread has the fetched instruction, and the private Cache data is a valid instruction. As shown in Figure 3.

Among them, TLB or page table buffer, which stores some page table files (virtual address to physical address conversion table, you can convert the virtual address of the fetched instruction into a physical address through TLB, in the physical address and private After the tags in the instruction cache are compared, if the physical address is the same as the tag in the private instruction cache, the hardware thread accesses the private instruction cache while accessing the shared instruction cache.

103. If the shared instruction cache and the private instruction cache corresponding to the hardware thread have the same instruction, the processor obtains the instruction from the shared instruction cache.

Exemplarily, when the I-Cache and the private Cache are accessed at the same time, if the I-Cache and the private Cache have the fetched instruction at the same time, the instruction is fetched from the I-Cache.

104. If there is an instruction in the shared instruction cache and the private instruction cache corresponding to the hardware thread does not exist, the processor acquires the instruction from the shared instruction cache.

Exemplarily, if there is an instruction in the I-Cache, and the private Cache of the hardware thread does not have the fetched instruction, the fetched instruction exists from the I-Cache.

105. If the private instruction cache corresponding to the hardware thread has an instruction and the instruction does not exist in the shared instruction cache, the processor acquires the instruction from the private instruction cache corresponding to the hardware thread. Exemplarily, if the I-Cache does not hit, that is, there is no instruction fetched, and the fetch instruction exists in the private Cache corresponding to the hardware thread, the instruction is obtained from the private Cache corresponding to the hardware thread. In this way, by actively selecting the private cache corresponding to the hardware thread to participate in the tag comparison, the Cache capacity allocated by each hardware thread can be expanded, and the hit rate of the instruction cache of the hardware thread is increased.

106. If there is no instruction in the shared instruction cache and the private instruction cache, the processor sends a cache miss request to the next level cache of the shared instruction cache through the hardware thread.

Exemplarily, if the I-Cache and the private cache corresponding to the hardware thread do not have the fetched instruction, the hardware thread issues a Cache Miss to the next-level cache of the I-Cache.

For example, if the LI Cache and the private Cache corresponding to the hardware thread do not have the fetched instruction, the hardware thread sends a Cache Miss to the L2 Cache of the next Cache of the LI Cache to obtain the fetched instruction from the L2 Cache.

107. If there is an instruction in the next level cache, the processor acquires the instruction from the next level cache through the hardware thread, and stores the cache line where the instruction is located in a missing cache corresponding to the hardware thread, when the hardware thread fetches the finger, Backfill the cache line into the shared instruction cache.

Exemplarily, when the instruction in the L2 Cache exists, the instruction is obtained from the L2 Cache, and the Cache Line where the instruction is located is not directly backfilled into the LI Cache, but the Cache Line where the instruction is fetched is saved in the In the Miss Buffer corresponding to the hardware thread, the Cache Line is filled into the LI Cache until the hardware thread fetches.

Among them, Miss Buffer and hardware thread - corresponding, that is, each hardware thread has a Miss Buffer, each hardware thread uses a Miss Buffer to cache the Cache Line returned by the Cache Miss request, which is due to the Cache Line Replacement occurs when backfilling to LI Cache. The replaced Cache Line may be the Cache Line to be accessed. The existence of Miss Buffer optimizes the backfilling time of Cache Line, and reduces the cache that will be replaced by the Cache Line to be accessed. The chance.

108. If there is no instruction in the next level cache, the processor sends a missing request to the main memory through the hardware thread, obtains the instruction from the main memory, and stores the cache line where the instruction is located in the missing cache corresponding to the hardware thread. When the hardware thread fetches, the cache line is backfilled into the shared instruction cache. Exemplarily, if the fetched instruction does not exist in the L2 Cache, the hardware thread issues a Cache Miss request to the main memory to obtain the fetched instruction from the main memory. If the fetched instruction exists in the main memory, the fetched instruction is obtained, and the Cache Line where the fetched instruction is stored is stored in the Miss Buffer corresponding to the hardware thread, until the hardware thread fetches the instruction, the Cache Line is filled in. In the LI Cache.

Alternatively, when the instruction in the L2 Cache does not exist, the hardware thread sends a Cache Miss request to the L3 Cache. If the instruction fetched in the L3 Cache, the fetched instruction is obtained, and if the fetch instruction does not exist in the L3 Cache, A Cache Miss request is issued to the main memory to obtain the fetched instruction.

The exchange unit between the CPU and the Cache is a word. When the CPU reads a word in the main memory, the memory address of the word is sent to the Cache and the main memory at the same time, and the LI Cache or the L2 Cache or the L3 Cache can be controlled in the Cache. The logic determines whether there is a word according to the Tag tag part of the address. If it hits, the CPU obtains the word. If it does not, it reads out from the main memory and outputs it to the CPU using the main memory read cycle, even if the current CPU reads only one word. The Cache controller also copies a complete Cache line containing the word in the main memory to the Cache. This operation of transferring a row of data to the Cache is called Cache line filling.

In addition, when the cache line is backfilled into the shared instruction cache, if there is no idle resource in the shared instruction cache, the cache line is replaced with the first cache line in the shared instruction cache, and the cache line is backfilled into the shared instruction cache, and Obtaining a hardware thread identifier of the first hardware thread of the first cache line, and storing the first cache line in a private instruction cache corresponding to the first hardware thread. The first cache line is determined by an LRU (Least Recently Used) algorithm.

For example, you can add a Thread ID (hardware thread ID) to the I-Cache Data Array, which is used to indicate which Cache Line is a Cache Miss request from which hardware thread. Taken back. In this way, when a fully-associated private cache is set in each hardware thread, when the I-Cache is replaced, the replaced Cache Line is not directly discarded, and the replaced Cache can be replaced according to the Thread ID. Line is filled in the private Cache of the hardware thread identified by the Thread ID, which is due to the possibility that the replaced Cache Line will be accessed soon. As shown in Figure 4.

Storing the replaced first cache line in the private one corresponding to the first hardware thread In the instruction cache, if the private instruction cache corresponding to the first hardware thread does not have an idle resource, the first cache line is replaced with the second cache line in the private instruction cache corresponding to the first hardware thread, and the first cache line is backfilled to The first hardware thread corresponds to the private instruction cache. The second cache line is determined by an LRU algorithm.

In this way, by increasing the private Cache, the instruction cache capacity allocated by each hardware thread is effectively expanded, the hit rate of the instruction cache of the hardware thread is increased, and the communication between the I-Cache and the next-level Cache is reduced. At the same time, the added buffer buffer optimizes the backfilling time of the Cache Line, reduces the probability that the Cache Line to be accessed is replaced, and increases the Tag comparison logic, so that the shared instruction cache and the private instruction cache are simultaneously accessed when accessing the I-Cache. , increased the hit rate of the instruction cache.

An embodiment of the present invention provides a method for managing an instruction cache. When a hardware thread of a processor acquires an instruction from an instruction cache, the shared instruction cache in the instruction cache and the private instruction cache corresponding to the hardware thread are simultaneously accessed to determine a shared instruction cache. Whether the instruction exists in the private instruction cache corresponding to the hardware thread, and obtains an instruction from the shared instruction cache or the private instruction cache corresponding to the hardware thread according to the judgment result, and if there is no instruction in the shared instruction cache and the private instruction cache, the hardware thread The next level cache of the shared instruction cache sends a cache miss request, and stores the cache line of the instruction in the missing cache corresponding to the hardware thread. When the hardware thread fetches the instruction, the cache line is backfilled into the shared instruction cache, and the cache is cached. When the row is backfilled into the shared instruction cache, if there is no free resource in the shared instruction cache, the cache line is replaced with the first cache line in the shared instruction cache, the cache line is backfilled into the shared instruction cache, and the first cache line is obtained according to the Hardware thread of the first hardware thread The first cache line is stored in the private instruction cache corresponding to the first hardware thread, which can expand the instruction cache capacity of the hardware thread, reduce the missing rate of the instruction cache, and improve system performance.

In the several embodiments provided by the present application, it should be understood that the disclosed processor and method may be implemented in other manners. For example, the device embodiments described above are only schematic. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not executed. Another point, the display The coupling or direct coupling or communication connection between the components shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.

In addition, in the devices and systems in the various embodiments of the present invention, each functional unit may be integrated into one processing unit, or each unit may be physically included separately, or two or more units may be integrated into one unit. The above units may be implemented in the form of hardware or in the form of hardware plus software functional units.

All or part of the steps of implementing the foregoing method embodiments may be performed by hardware related to the program instructions. The foregoing program may be stored in a computer readable storage medium, and when executed, the program includes the steps of the foregoing method embodiments; The foregoing storage medium includes:

U disk, removable hard disk, read only memory (ROM), random access memory (RAM), disk or optical disk, etc. can store various program code media.

The above is only a specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily think of changes or substitutions within the technical scope of the present invention. It should be covered by the scope of the present invention. Therefore, the scope of the invention should be determined by the scope of the appended claims.

Claims

claims

1. A processor, characterized in that it includes a program counter, a register file, an instruction prefetch unit, an instruction decoding unit, an instruction issuing unit, an address generation unit, an arithmetic logic unit, a shared floating point unit, a data cache and an internal bus , which is characterized by also including:

The shared instruction cache is used to store shared instructions of all hardware threads, including a tag storage array and a data storage array. The tag storage array is used to store tags. The data storage array includes stored instructions and hardware thread identifiers. The hardware The thread identifier is used to identify the hardware thread corresponding to the cache line in the shared instruction cache;

A private instruction cache, used to store instruction cache lines replaced from the shared instruction cache, the private instruction cache corresponds to the hardware thread;

Missing cache, used to save the cache line retrieved from the next-level cache of the shared instruction cache in the missing cache of the hardware thread when the fetched instruction does not exist in the shared instruction cache. When the hardware thread corresponding to the fetched instruction fetches an instruction, the cache line in the missing cache is backfilled into the shared instruction cache, and the missing cache corresponds to the hardware thread one-to-one.

2. The processor according to claim 1, further comprising:

Tag comparison logic, used to compare the tag in the private instruction cache corresponding to the hardware thread with the physical address converted by the translation backup buffer when the hardware thread fetches an instruction, and the private instruction cache and the tag The comparisons are logically connected such that the hardware thread accesses the private instruction cache while accessing the shared instruction cache.

3. The processor according to claim 2, characterized in that, the processor is a multi-threaded processor, the structure of the private instruction cache is a fully associative structure, and the fully associative structure is in the main memory. Any block of instruction cache maps any block of instruction cache in the private instruction cache.

4. The processor according to claim 3, wherein the shared instruction cache, the private instruction cache and the missing cache are static memory chips or dynamic memory chips.

5. An instruction cache management method, characterized by including:

When the hardware thread of the processor obtains an instruction from the instruction cache, it simultaneously accesses the shared instruction cache in the instruction cache and the private instruction cache corresponding to the hardware thread;

Determine whether the instruction exists in the shared instruction cache and the private instruction cache corresponding to the hardware thread, and obtain the instruction from the shared instruction cache or the private instruction cache corresponding to the hardware thread according to the determination result.

6. The method according to claim 5, characterized in that, the shared instruction cache includes a tag storage array and a data storage array, the tag storage array is used to store tags, and the data storage array includes stored instructions and hardware Thread identifier, the hardware thread identifier is used to identify the hardware thread corresponding to the cache line in the shared instruction cache;

The structure of the private instruction cache is a fully associative structure. The fully associative structure maps any instruction cache in the private instruction cache to any instruction cache in the main memory. The private instruction cache and the hardware Thread - Correspondence.

7. The method according to claim 6, characterized in that: determining whether the instruction exists in the shared instruction cache and the private instruction cache corresponding to the hardware thread, and based on the determination result, from the shared instruction cache or the private instruction cache corresponding to the hardware thread. Obtaining the instruction from the private instruction cache corresponding to the hardware thread includes:

If the instruction exists in both the shared instruction cache and the private instruction cache corresponding to the hardware thread, the instruction is obtained from the shared instruction cache;

If the instruction exists in the shared instruction cache and the instruction does not exist in the private instruction cache corresponding to the hardware thread, obtain the instruction from the shared instruction cache;

If the instruction exists in the private instruction cache corresponding to the hardware thread and the instruction does not exist in the shared instruction cache, the instruction is obtained from the private instruction cache corresponding to the hardware thread.

8. The method according to claim 7, wherein the method further includes: if the instruction does not exist in the shared instruction cache or the private instruction cache, sending the instruction to the shared instruction cache through the hardware thread. The next-level cache of the instruction cache sends a cache miss request; if the instruction exists in the next-level cache, the instruction is obtained from the next-level cache through the hardware thread, and the instruction is stored in the next-level cache. The cache line is stored in the missing cache corresponding to the hardware thread, and when the hardware thread fetches an instruction, the cache line is backfilled into the shared instruction cache;

If the instruction does not exist in the next-level cache, the missing request is sent to the main memory through the hardware thread, the instruction is obtained from the main memory, and the cache line where the instruction is located is stored. In the missing cache corresponding to the hardware thread, when the hardware thread fetches an instruction, the cache line is backfilled into the shared instruction cache;

Wherein, the missing cache corresponds to the hardware thread.

9. The method according to claim 8, wherein when backfilling the cache line into the shared instruction cache, if there are no idle resources in the shared instruction cache, then Replace the first cache line in the shared instruction cache with the cache line, backfill the cache line into the shared instruction cache, and obtain the hardware thread identifier of the first hardware thread of the first cache line at the same time. , store the first cache line in the private instruction cache corresponding to the first hardware thread;

Wherein, the first cache line is determined through a least recently used algorithm.

10. The method of claim 9, wherein when storing the replaced first cache line in the private instruction cache corresponding to the first hardware thread, if If there are no idle resources in the private instruction cache, replace the first cache line with the second cache line in the private instruction cache corresponding to the first hardware thread, and backfill the first cache line to the first hardware thread. in the corresponding private instruction cache;

Wherein, the second cache line is determined through the least recently used algorithm.