CN107479860B

CN107479860B - Processor chip and instruction cache prefetching method

Info

Publication number: CN107479860B
Application number: CN201610397009.XA
Authority: CN
Inventors: 沈亦翀; 方磊; 罗会斌
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2016-06-07
Filing date: 2016-06-07
Publication date: 2020-10-09
Anticipated expiration: 2036-06-07
Also published as: WO2017211240A1; CN107479860A

Abstract

The embodiment of the invention discloses a processor chip and a prefetching method of an instruction cache, which can improve the prefetching accuracy of the instruction cache. The processor chip comprises a processor core and a Cache memory Cache, wherein the Cache comprises a first-level instruction Cache L1I-Cache and a Cache controller, the L1I-Cache comprises at least one Cache unit Cache line, and each Cache line comprises a tag field, data, a flag bit and an expansion bit for storing offset information of an access address; the CPU core is used for acquiring an access address of the first instruction and accessing the L1I-Cache according to the access address of the first instruction; the Cache controller is used for reading offset information of an access address in an extension bit of a first Cache line when the first Cache line corresponding to the access address of the first instruction in the L1I-Cache is hit, and calculating to obtain an access address of a second instruction according to the offset information of the access address and the access address of the first instruction; the CPU core is also used for executing the pre-fetching of the second instruction according to the access address of the second instruction.

Description

Processor chip and instruction cache prefetching method

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a processor chip and a prefetch method for an instruction cache.

Background

In the architecture of a modern computer, a Cache (Cache) is a storage unit located at a higher level in a storage hierarchy of the computer, and is mainly used as a bridge between a low-level storage unit and a Central Processing Unit (CPU), so that the time delay of the CPU for directly accessing data from the low-level storage unit (such as a main memory or a disk) is reduced. The Cache is composed of several independent Cache modules, including an Instruction Cache (I-Cache), a Data Cache (D-Cache), and a Translation Lookaside Buffer (TLB). At present, most of the mainstream CPUs are provided with a first-level Cache (L1Cache) and a second-level Cache (L2Cache), and a few high-end processors are integrated with a third-level Cache. The first-level Cache can be divided into a first-level instruction Cache (L1I-Cache) and a first-level data Cache (L1D-Cache).

A CPU of a modern computer obtains an instruction to be executed by directly accessing an L1I-Cache of the modern computer, the CPU accesses an L1I-Cache by accessing an address, and when the instruction corresponding to the access address exists in the L1I-Cache, the instruction is called a hit (Cache hit); when the instruction corresponding to the access address cannot be found in the L1I-Cache, it is called miss (cachemiss). When a miss occurs, the CPU will go to the next level of memory cells to look for. But accessing lower levels of memory cells directly results in a significant increase in access latency, causing a block (stall) in CPU instruction execution, thereby affecting the performance of the computer. Cache prefetch (Cache prefetch) techniques have thus been proposed, the main idea being to prefetch instructions into the L1I-Cache that may be accessed before the CPU fetches the instructions, thereby avoiding CPU blocking due to misses.

To account for the instruction fetch latency caused by CPU instructions missing from the L1I-Cache, a typical solution is to insert a functional unit called a stream buffer (stream buffer), which is effectively a First-in-First-out (FIFO) queue, between the L1I-Cache and the L2I-Cache. When instructions in the L1I-Cache are accessed consecutively, missing in the L1I-Cache but able to hit in the stream buffer, then they are fetched directly from the stream buffer, greatly reducing the latency incurred by fetching instructions to a lower memory location (e.g., L2I-Cache).

In the above solution, since the stream buffer is a FIFO queue, and there is only one comparator at the head of the stream buffer (first column in the FIFO queue), when L1I-Cache is missing, if the corresponding instruction cannot be found in the head data (head entry) of the stream buffer through comparison by the comparator, even if the required instruction is in the original stream buffer (not at the head), the entire stream buffer will be reset, and then prefetched from the stream buffer again. This results in a low utilization of the stream buffer, and the prefetched instructions are likely not to be accessed, so the prefetch accuracy of the instruction cache is low.

Disclosure of Invention

The embodiment of the invention provides a processor chip and a prefetching method of an instruction cache, which can improve the prefetching accuracy of the instruction cache.

The first aspect of the embodiments of the present invention provides a processor chip, where the processor chip includes a processor core CPU and a Cache memory Cache, and the Cache includes a first-level instruction Cache L1I-Cache and a Cache controller, and in order to improve the processing performance of the CPU core, the Cache may further include a second-level instruction Cache L2I-Cache. The Cache can be realized by a high-speed static memory chip or integrated into a CPU chip to store instructions or operation data frequently accessed by the CPU. The L1I-Cache stores instructions frequently accessed by a CPU, the L1I-Cache comprises at least one Cache unit, each Cache line has a uniform data structure, and the data structure of the Cache lines can comprise tag fields, data and flag bits.

On this basis, the embodiment of the present invention expands the data structure of the Cache line, so that the difference from the conventional Cache line is that the new Cache line, that is, the data structure of each Cache line described above, further includes an expansion bit for storing offset information of an access address, where the offset information of the access address is an address variation of an access address included in two adjacent access instructions issued by the CPU core to the L1I-Cache. Thus, a new data structure of the cache line is formed.

Optionally, before the CPU Core obtains the access address of the first instruction, the Cache controller calculates offset information between the access address of the first instruction and the access address of the second instruction, and writes the offset information into an extension bit of the Cache line corresponding to the access address of the first instruction. Thus, the L1I-Cache is configured in advance, and the continuity of instruction prefetching is improved.

When the CPU core calls data, an access address of a first instruction needs to be obtained, and the L1I-Cache is accessed according to the access address of the first instruction.

The CPU core accesses the L1I-Cache through the access address of the first instruction, and if the access address of the first instruction exists in the L1I-Cache, the first Cache line corresponding to the access address of the first instruction is hit in the L1I-Cache. At this time, the Cache controller reads the offset information of the access address in the extension bit of the first Cache line, and calculates the access address of the second instruction according to the offset information of the access address and the access address of the first instruction.

Optionally, if the access address of the first instruction does not exist in the L1I-Cache, the first Cache line corresponding to the access address of the first instruction may not be hit in the L1I-Cache, and at this time, the Cache controller may search for the position of the first instruction according to the access address of the first instruction, and prefetch the first instruction into the L1I-Cache. Therefore, the first instruction corresponding to the missed Cache line is pre-fetched into the L1I-Cache, and the access rate after instruction pre-fetching is improved.

After calculating the access address of the second instruction, the CPU core may search the location of the second instruction through the access address of the second instruction, and perform prefetching on the second instruction.

Because each Cache line also comprises an extension bit for storing the offset information of the access address, the offset information of the access address is the address variation of the access address contained in two adjacent access instructions sent to the L1I-Cache by the CPU core of the processor core; therefore, when the Cache controller determines that the first Cache line corresponding to the access address of the first instruction in the L1I-Cache is hit, the Cache controller can read the offset information of the access address in the extension bit of the first Cache line, and calculate the access address of the second instruction according to the offset information of the access address and the access address of the first instruction; and the CPU core executes the pre-fetching of the second instruction according to the access address of the second instruction calculated by the Cache controller. The access rate of the instruction pre-fetched through the offset information of the access address is high, so that the pre-fetching accuracy of the instruction cache can be improved.

Optionally, the Cache may further include a second-level instruction Cache L2I-Cache, and when a storage space in the L1I-Cache is full, the CPU Core performs an eviction operation on a Cache space in the L1I-Cache, and evicts a second Cache line in the L1I-Cache to the L2I-Cache. The Cache controller sets an eviction weight for the second Cache line, which is used to mark the priority of the second Cache line being evicted in the L2I-Cache. By setting the evicted weight, the utilization rate of the evicted cache line is improved.

Optionally, the inheritance and sharing of the prefetching experience can be realized for the evicted Cache line, and when it is determined that the access address of the CPUCore execution instruction corresponds to the second Cache line evicted to the L2I-Cache, the second Cache line evicted to the L2I-Cache is prefetched to the L1I-Cache. Therefore, information stored by the second Cache line which is evicted to the L2I-Cache is directly used, configuration resources are saved, and the utilization rate of the Cache line is improved.

A second aspect of the embodiments of the present invention provides a method for prefetching an instruction Cache, which is applied to a processor chip, where the processor chip includes a processor core CPU core and a Cache memory Cache, the Cache includes a first-level instruction Cache L1I-Cache and a Cache controller, the L1I-Cache includes at least one Cache unit Cache line, each Cache line includes a tag field, data, and a flag bit, where each Cache line further includes an extension bit for storing offset information of an access address, and the offset information of the access address is an address variation of an access address included in two adjacent access instructions sent by the processor core CPU core to the L1I-Cache; the method comprises the following steps:

Optionally, the Cache may further include a second-level instruction Cache L2I-Cache, and when a storage space in the L1I-Cache is full, the CPU Core performs an eviction operation on a Cache space in the L1I-Cache, and evicts a second Cache line in the L1I-Cache to the L2I-Cache. The Cache controller sets an eviction weight for the second Cache line, which is used to mark the priority of the second Cache line being evicted in the L2I-Cache. Therefore, information stored by the second Cache line which is evicted to the L2I-Cache is directly used, configuration resources are saved, and the utilization rate of the Cache line is improved.

A third aspect of the embodiments of the present invention provides a data structure of a Cache line in a Cache unit, where the Cache line includes a tag field, data, and a flag bit, and in addition, the Cache line further includes an extension bit used for storing offset information of an access address, where the offset information of the access address is an address variation of an access address included in two adjacent access instructions sent by a Core of a processor Core to a first-level instruction Cache L1I-Cache.

A fourth aspect of the embodiments of the present invention provides a storage medium, where a program code is stored in the storage medium, and when the program code is executed by a processor chip, the method for prefetching an instruction cache according to the second aspect or any implementation manner of the second aspect is performed. The storage medium includes, but is not limited to, a flash memory (flash memory), a Hard Disk Drive (HDD), or a Solid State Drive (SSD).

Drawings

FIG. 1 is a schematic diagram of an organization of an embodiment of a processor chip provided by the present invention;

FIG. 2 is a flowchart illustrating an embodiment of a method for prefetching an instruction cache according to the present invention;

FIG. 3 is a schematic structural diagram of a cache line embodiment of a cache unit provided in the present invention;

fig. 4 is a flowchart illustrating an embodiment of a method for configuring offset information of an access address according to the present invention.

Detailed Description

The terms "first," "second," and the like in the description and in the claims, and in the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The technical solutions in the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Throughout this specification, an effective instruction cache prefetch technique should have two basic attributes: higher prediction accuracy and sufficiently small software and hardware overhead. The higher prediction accuracy can improve the hit rate of the I-Cache and improve the performance of the CPU for executing the instruction; while introducing sufficiently small hardware and software overhead to reduce the impact of prefetching (prefetch) on Cache performance. The typical computer cache prefetching technology at present can be roughly divided into two types from the implementation means: software-based prefetching and hardware-based prefetching. Typical prefetching mechanisms include: OBL (One Block Lookahead), adaptive sequential prefetch, streambuffer, history-based prefetch techniques, and the like.

The stream buffer is a typical instruction prefetching technology, and temporarily stores prefetched instruction data by directly inserting a stream buffer into an L1I-Cache and an L2I-Cache, so that the prefetched instruction data can be effectively stored while Cache pollution (Cache pollution) is avoided, and the prefetched instruction data can be accessed by the L1Cache with low delay when needed.

History-based prefetching is another typical instruction prefetching mechanism, which calculates instructions that are likely to be executed by using a certain analysis prediction algorithm based on instruction execution history information (including instruction execution history, branch prediction history, or cache miss history), and prefetches the instructions.

The prior art mainly has two problems: firstly, for instruction sequences with complex access rules, the existing method cannot guarantee effective analysis learning and accurate prediction, and the software and hardware expenses required for making the prefetch prediction reaching a certain accuracy rate depend on the characteristics of the input instruction sequences; secondly, for a prefetching mechanism using historical information for prediction, a long initialization time which is not negligible generally exists for analysis and learning, and the time is long, and meanwhile, a prefetching operation with low accuracy rate may be generated, and the execution performance of the CPU instruction is damaged.

The problem solved by the invention and distinguished from the prior art is that: the invention expands the bit width in a Cache unit (Cache line) of the I-Cache, adds the offset information of the address of the next instruction relative to the address of the instruction in the Cache line, and prefetches the corresponding instruction when the Cache line hits. The hardware mechanism of the method is different from various prefetching mechanisms mentioned above, and the method can adapt to instruction sequences of various different access rules, reduce the training (training) overhead of prefetching, and effectively improve the accuracy of prefetching.

As shown in fig. 1, which is a schematic diagram of an organization structure of a processor chip 200 according to an embodiment of the present invention, the processor chip 200 includes a processor core202 and a Cache204, the Cache includes a first-level instruction Cache L1I-Cache 2042 and a Cache controller 2044, and in order to improve the processing performance of the processor core202, the Cache204 may further include a second-level instruction Cache L2I-Cache 2046. The processor chip 200 may also include a bus 208 and a communication interface 206.

The CPU core202, the Cache204, and the communication interface 206 may be connected to each other through a bus 208, or may be communicated through other means such as wireless transmission.

The Cache204 in the embodiment of the present invention may be a random-access memory (RAM); specifically, the memory may be a static random-access memory (SRAM). In practical application, a Cache (Cache) basically uses a RAM, and an SRAM is a memory with a static access function, and can store data stored in the SRAM without a refresh circuit, so that the SRAM has high performance. The Cache204 may be implemented by a high-speed static memory chip, or integrated into a CPU chip, and stores instructions or operation data frequently accessed by the CPU. According to the data reading sequence and the tightness degree of combination with the CPU, the CPU cache can be divided into a first-level cache and a second-level cache, and part of high-level CPUs are also provided with third-level caches. Generally, a level one cache may be divided into a level one data cache and a level one instruction cache. The two are used for storing data and decoding the instructions for executing the data in real time respectively, and the two can be accessed by the CPU at the same time, thereby reducing the conflict caused by contention Cache and improving the efficiency of the processor.

The L1I-Cache 2042 in the embodiment of the present invention includes at least one Cache line (a plurality of Cache units 1 to n shown in fig. 1), each existing Cache line has a uniform data structure, and the data structure of the Cache line may include a tag field, data, and a flag bit. On this basis, the data structure of cacheline shown in fig. 3 expands the data structure of the conventional cacheline, that is, the extension bits (Extra bits) for storing the offset information of the access address are added. The offset information of the access address is the address variation of the access address contained in the two adjacent access instructions issued by the processor core202 to the L1I-Cache 2042.

Before the CPU Core202 acquires the access address of the first instruction, the Cache controller 2044 in the embodiment of the present invention calculates offset information between the access address of the first instruction and the access address of the second instruction, and writes the offset information into an extension bit of the Cache line corresponding to the access address of the first instruction. For example, referring to the flow chart of calculating the offset information of the access address shown in fig. 4, the CPU Core202 accesses a Cache line in the L1I-Cache 2042, and queries whether the address register is empty by the Cache controller 2044: if the Cache line is empty, the Cache line is the first Cache line in the access sequence, and the Cache controller 2044 writes the current access address cur _ addr into the address register and waits for the next Cache line access; if not, the Cache controller 2044 assigns the value of the address register (e.g., cur _ addr of the previous access) to last _ addr, while writing the current access address as new cur _ addr into the address register.

It is understood that the address register is empty, which means that the Cache line in the L1I-Cache 2042 is not accessed, and therefore, after being accessed for the first time, the access address of the first instruction is written into the address register, and after the access address of the next instruction is obtained, the Cache line in the L1I-Cache 2042 is accessed again.

For example, after the access address (cur _ addr) of the first instruction is written into the address register after the first access, the address register is not empty, and when the Cache line in the L1I-Cache 2042 is accessed again, the access address of the first instruction written into the address register is read (since the current access to the L1I-Cache 2042 is not necessarily the access address of the first instruction, the access address of the first instruction stored in the address register may be understood as last _ addr). The Cache controller 2044 updates the access address (last _ addr) written in the address register to the access address of the second instruction to be accessed again (at this time, since the current access address is the access address of the second instruction, the access address of the second instruction may be understood as cur _ addr at this time). In this case, the access address of the second instruction is the access address of another instruction obtained when the CPU Core202 accesses the L1I-Cache 2042 for the second time after accessing the L1I-Cache 2042 for the first time.

The Cache controller 2044 calculates the address change Δ cur _ addr-last _ addr according to last _ addr and the current access address (i.e., new cur _ addr) given above, and writes the address change Δ into the extension bit (Extra bit) of the Cache line accessed last time (i.e., corresponding to the access address of the first instruction in this scenario).

In the embodiment of the present invention, when the CPU core202 calls data, an instruction for executing the data is read and decoded immediately, where each instruction has an access address, and a location where the instruction is stored can be found by using the access address. Therefore, when calling data, the CPU core202 needs to obtain an access address of the first instruction, and accesses the L1I-Cache 2042 according to the access address of the first instruction; by executing the first instruction, the corresponding data can be called.

The CPU core202 accesses the L1I-Cache 2042 via the access address of the first instruction, and if the access address of the first instruction exists in the L1I-Cache 2042, the first Cache line corresponding to the access address of the first instruction will be hit in the L1I-Cache 2042. At this time, the Cache controller 2044 reads the offset information of the access address in the extension bit of the first Cache line, and obtains the access address of the second instruction by calculation according to the offset information of the access address and the access address of the first instruction.

If the access address of the first instruction does not exist in the L1I-Cache 2042, the first Cache line corresponding to the access address of the first instruction will not be hit in the L1I-Cache 2042, and at this time, the Cache controller 2044 may search for the position of the first instruction according to the access address of the first instruction, and prefetch the first instruction into the L1I-Cache 2042.

After the access address of the second instruction is obtained through calculation, the CPU core202 may search for the location of the second instruction through the access address of the second instruction, and perform prefetching on the second instruction, for example, when the second instruction is found in the L2I-Cache through the access address of the second instruction obtained through calculation, the second instruction is prefetched from the L2I-Cache to the L1I-Cache.

The Cache line provided by the embodiment of the invention also comprises an extension bit for storing the offset information of the access address, wherein the offset information of the access address is the address variation of the access address contained in the adjacent two access instructions sent by the CPU core202 of the processor core to the L1I-Cache 2042; therefore, the Cache controller 2044 determines that when the first Cache line corresponding to the access address of the first instruction in the L1I-Cache 2042 is hit, offset information of the access address in the extension bit of the first Cache line may be read, and calculates the access address of the second instruction according to the offset information of the access address and the access address of the first instruction; the CPU core202 performs prefetching on the second instruction according to the access address of the second instruction calculated by the Cache controller 2044. The access rate of the instruction pre-fetched through the offset information of the access address is high, so that the pre-fetching accuracy of the instruction cache can be improved.

The Cache204 may also include a second level instruction Cache L2I-Cache 2046, where when the storage space in the L1I-Cache 2042 is full, the CPU Core202 performs an eviction operation on the Cache space in the L1I-Cache 2042. For example, a second Cache line in the L1I-Cache 2042 (for convenience of explanation, the Cache line to be evicted is named the second Cache line, the same below) is evicted into the L2I-Cache 2046. The Cache controller 2044 sets an evicted weight for the second Cache line, which is used to mark the priority of the second Cache line being evicted in the L2I-Cache 2046.

For the above-described eviction phase, when the processor Core CPU 202 performs an eviction operation on the L1I-Cache 2042, the Cache controller 2044 analyzes the extension bit (Extra bits) of the Cache line to be evicted: when the extension bits are null, no processing is required; when the extension bit (Extra bits) is not empty (not null), the evicted weight of this Cache line in the L2I-Cache 2046 is set to be the lowest, so that it is kept as much as possible in the L2I-Cache 2046. Common eviction policies include Least frequently Used page Replacement algorithm (LRU), Least frequently Used page Replacement algorithm (LFU), Adaptive Cache Replacement (ARC), and the like.

Taking the LRU as an example, if the Cache uses the LRU eviction algorithm, in the embodiment of the present invention, the evicted weight value "age bits" of each Cache line is recorded and stored in the L2I-Cache, and every time one Cache line is accessed, then 1 is added to the "age bits" of the Cache lines in all other L2I-caches. Then the cache line with the largest eviction weight value "age bits" may be considered to have the highest priority for eviction (i.e., the first eviction) in the L2I-cache. When a Cache line with an expansion bit (Extra bits) is evicted from the L1I-Cache, then the "age bits" value of the Cache line with the expansion bit (Extra bits) in the L2I-Cache need only be set to 0 to minimize the priority of the Cache line being evicted in the L2I-Cache (i.e., the Cache line is evicted last). Thus, the priority (i.e., the precedence order of evictions) of a cache line in an L2I-cache may be achieved by setting the size of the "age bits" value of the cache line.

Taking LFU as an example, if the Cache uses an LFU eviction algorithm, in the embodiment of the present invention, each Cache line stored in L2I-Cache is set with a counter, and when a Cache line is accessed, the counter corresponding to the Cache line is incremented by 1. Then the cache line corresponding to the minimum value in the counter may be considered to have the highest priority for eviction (i.e., the first eviction) in the L2I-cache. When a Cache line with an expansion bit (Extra bits) is evicted from the L1I-Cache, then the corresponding counter value of the Cache line with the expansion bit (Extra bits) in the L2I-Cache is set to the maximum value of all counters in the current L2I-Cache, so that the Cache line is evicted with the lowest priority (i.e., is evicted last) in the L2I-Cache. Therefore, the priority (i.e., the eviction precedence) of Cache line evicted from the L2I-Cache may be achieved by setting the size of the counter value corresponding to the Cache line.

Taking ARC as an example, the ARC eviction algorithm is a compromise and balance between LRU and LFU, the Cache includes two tables T1 and T2, T1 stores Cache line using LRU, T2 stores Cache line using LFU, and T1 and T2 together constitute the whole Cache. The ARC algorithm adaptively adjusts each eviction selection based on the history of cache lines being evicted from either T1 or T2. For example, a Cache line "located in the middle of the L2I-Cache, i.e., away from both the T1 and T2 regions to be evicted, is the lowest priority for being evicted (i.e., the last to be evicted) in the L2I-Cache. When a Cache line with an expansion bit (Extra bits) is evicted from the L1I-Cache, then the Cache line with the expansion bit (Extra bits) need only be inserted in the middle of the L2I-Cache, i.e., at a location away from the to-be-evicted regions of both T1 and T2, so that the Cache line has the lowest priority for being evicted in the L2I-Cache (i.e., is evicted last). Therefore, the priority (i.e., the precedence order of evictions) of a Cache line in L2I-Cache can be achieved by setting the location of the Cache line in the L2I-Cache region.

The inheritance and sharing of the prefetching experience can be realized for the evicted Cache lines, and when it is determined that the access address of the instruction executed by the CPU core202 corresponds to the second Cache line evicted to the L2I-Cache 2046, the second Cache line evicted to the L2I-Cache 2046 is prefetched to the L1I-Cache 2042.

For example, the access address of the instruction executed by the CPU core202 in the processor core corresponds to the second Cache line that is evicted to the L2I-Cache 2046, and the Cache line may be prefetched from the L2I-Cache, so that the Cache line generated by the CPU core202 and having the extended bits (Extra bits) stored therein is directly obtained, and the inheritance of the "experience" of prefetching is realized.

Based on the processor chip 200 provided in fig. 1, in the embodiment of the present invention, the processor chip 200 may further integrate a plurality of processor core CPU cores, and each CPU core may be respectively configured with one L1I-Cache, so that each CPU core has an L1I-Cache capable of being independently accessed; all CPU cores can be commonly configured with one L2I-Cache 2046, so that each CPU core shares the same L2I-Cache 2046, and the resource sharing of the L2I-Cache 2046 is realized. For example, when the processor chip has a plurality of processor cores, CPU cores, the Cache line that is evicted may be shared with the multi-core prefetching experience, for example, when an access address of an instruction executed by another processor core (e.g., CPU core 2) also corresponds to the second Cache line that is evicted to the L2I-Cache 2046, the Cache line may also be prefetched from the L2I-Cache 2046, so as to implement the sharing of the multi-core prefetching "experience".

An embodiment of the present invention further provides a method for prefetching an instruction cache, where the processor chip 200 in fig. 1 executes the method when running, and a flowchart thereof is shown in fig. 2.

402. The CPU core obtains the access address of the first instruction and accesses the L1I-Cache according to the access address of the first instruction.

When the CPU core calls data, the CPU core reads the instruction for executing the data to perform instant decoding, wherein each instruction corresponds to an access address, and the position stored by the instruction can be found through the access address. Therefore, when calling data, the CPU core needs to acquire an access address of a first instruction, and accesses the L1I-Cache according to the access address of the first instruction; by executing the first instruction, the corresponding data can be called.

To improve the continuity of instruction prefetching, the initialized Cache may be configured in advance. Optionally, before the CPUcore acquires the access address of the first instruction, the method further includes:

and the Cache controller calculates offset information between the access address of the first instruction and the access address of the second instruction, and writes the offset information into an expansion bit of the Cache line corresponding to the access address of the first instruction.

For example, referring to the flow diagram of calculating the offset information of the access address shown in fig. 4, the CPU Core accesses the Cache line in the L1I-Cache, and queries whether the address register is empty through the Cache controller: if the Cache line is empty, the Cache line is the first Cache line in the access sequence, the Cache controller writes the current access address cur _ addr into the address register, and waits for the next Cache line access; if not, the Cache controller assigns the value of the address register (e.g., cur _ addr of the previous access) to last _ addr, and writes the current access address as new cur _ addr into the address register. The related description refers to the apparatus part, and is not repeated here.

404. And the Cache controller reads the offset information of the access address in the extension bit of the first Cache line when determining that the first Cache line corresponding to the access address of the first instruction in the L1I-Cache is hit, and calculates the access address of the second instruction according to the offset information of the access address and the access address of the first instruction.

If the access address of the first instruction does not exist in the L1I-Cache, the first Cache line corresponding to the access address of the first instruction cannot be hit in the L1I-Cache, and at this time, the Cache controller can search the position of the first instruction according to the access address of the first instruction and pre-fetch the first instruction into the L1I-Cache.

406. And the CPU core executes the pre-fetching of the second instruction according to the access address of the second instruction calculated by the Cache controller.

After the access address of the second instruction is obtained through calculation, the CPU core may search for the location of the second instruction through the access address of the second instruction, and perform prefetching on the second instruction, for example, when the second instruction is found in the L2I-Cache through the access address of the second instruction obtained through calculation, the second instruction is prefetched from the L2I-Cache to the L1I-Cache.

For example, after the Cache controller calculates offset information between an access address of a first instruction and an access address of a second instruction, and writes the offset information into an extension bit of a Cache line corresponding to the access address of the first instruction, an address change Δ n of the access address in the Cache line of the Cache unit into which the extension bit (Extra bits) has been written is as follows in table 1:

TABLE 1

Extra bits	address	hit/miss	prefetch addr
				Δ0	A	yes	A+Δ0
Δ1	B＝A+Δ0	yes	B+Δ1
				Δ2	C＝B+Δ1	no	none
Δ3	D	yes	D+Δ3
				Δ4	E＝D+Δ3	yes	E+Δ4

Table 1 lists address variation Δ n of an access address in which extension bits (Extra bits) have been written in 5 cache lines, which are respectively: Δ 0, Δ 1, Δ 2, Δ 3, and Δ 4.

Taking the address variation Δ n of the access address written in table 1 as an example, the CPU core needs to access the Cache line with the address a, hit the Cache line in the L1I-Cache, read the address variation Δ 0 of the access address stored in the Extra bits field, calculate that the address of the next Cache line is a + Δ 0 ═ B, and pre-fetch the instruction corresponding to the access address B into the L1I-Cache.

The CPU core needs to access the Cache line with the address B, the Cache line is hit in the L1I-Cache, the address variation delta 1 stored in the Extra bits field of the Cache line is read, the address of the next Cache line is calculated to be B + delta 1 ═ C, and the instruction corresponding to the access address C is pre-fetched into the L1I-Cache.

The CPU core needs to access the Cache line with the address of X, the Cache line is missed in the L1I-Cache, and prefetching is not carried out.

The CPU core needs to access the Cache line with the address D, the Cache line is hit in the L1I-Cache, the address variation delta 3 stored in the Extra bits field of the Cache line is read, the address of the next Cache line is calculated to be D + delta 3 ═ E, and the instruction corresponding to the access address E is pre-fetched into the L1I-Cache.

The CPU core needs to access the Cache line with the address E, the Cache line is hit in the L1I-Cache, the address variation delta 4 stored in the Extra bits field of the Cache line is read, the address of the next Cache line is calculated to be E + delta 4, and the instruction corresponding to the access address E + delta 4 is pre-fetched into the L1I-Cache.

During the prefetch phase, an appropriate prefetch depth (prefetch _ depth) may be set. For example, the current cache line is accessed, and the access address of the instruction which is possibly executed by accessing the data next time is calculated through the offset information of the access address in the cache line. If the Cache line corresponding to the access address of the instruction is hit in the L1I-Cache, the instruction is pre-fetched into the L1I-Cache through the access address of the instruction. According to the prefetching depth, prefetching is continuously performed on the prefetched instruction, and the access address of the instruction which can be executed next time when the data is accessed is continuously prefetched (namely, cyclic prefetching is performed) by reading the offset information of the access address of the Cache line where the instruction prefetched into the L1I-Cache is located. The determination of the prefetch depth may be different according to different types of workloads (workloads), and different types of workloads may be applicable to different prefetch depths, so that tests may need to be performed for different typical applications, and some default values may need to be set. Note that the prefetch depth needs to be set properly, which is too small to affect the prefetch effect and reduce locality; too large may cause cache pollution (cache pollution) or the prefetched data to be overwritten.

The Cache line provided by the embodiment of the invention also comprises an extension bit for storing the offset information of the access address, wherein the offset information of the access address is the address variation of the access address contained in the two adjacent access instructions sent to the L1I-Cache by the processor core; therefore, when the Cache controller determines that the first Cache line corresponding to the access address of the first instruction in the L1I-Cache is hit, the Cache controller can read the offset information of the access address in the extension bit of the first Cache line, and calculate the access address of the second instruction according to the offset information of the access address and the access address of the first instruction; and the CPU core executes the pre-fetching of the second instruction according to the access address of the second instruction calculated by the Cache controller. The access rate of the instruction pre-fetched through the offset information of the access address is high, so that the pre-fetching accuracy of the instruction cache can be improved.

The Cache204 may also include a second level instruction Cache L2I-Cache, where when the storage space in the L1I-Cache is full, the CPU Core performs an eviction operation on the Cache space in the L1I-Cache. For example, a second Cache line in the L1I-Cache (for ease of illustration, the Cache line being evicted is named the second Cache line, the same below) is evicted into the L2I-Cache. The Cache controller sets an eviction weight for the second Cache line, which is used to mark the priority of the second Cache line being evicted in the L2I-Cache. The related description for setting the evicted weight in the eviction phase refers to the apparatus section and is not described in detail here.

And when the access address of the CPU Core execution instruction is determined to correspond to the second Cache line which is evicted to the L2I-Cache, the second Cache line which is evicted to the L2I-Cache is prefetched to the L1I-Cache. The related description refers to the apparatus part, and is not repeated here.

In the case that a plurality of processor cores exist in the processor chip, the Cache line which is evicted can be shared with the experience of multi-Core prefetching, for example, when the access address of the instruction executed by other processor cores (such as CPU Core 2) also corresponds to the second Cache line which is evicted to the L2I-Cache, the Cache line can also be prefetched from the L2I-Cache, and the sharing of the experience of multi-Core prefetching is realized. The related description refers to the apparatus part, and is not repeated here.

An embodiment of the present invention further provides a computer storage medium, where the computer storage medium may store a program, and when the program is executed, the computer storage medium includes some or all of the steps of the instruction cache prefetching method described in the foregoing method embodiment.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A processor chip, comprising: the system comprises a processor core CPU and a Cache memory Cache, wherein the Cache comprises a first-level instruction Cache L1I-Cache and a Cache controller, the L1I-Cache comprises at least one Cache unit Cache line, each Cache line comprises a tag field, data and a flag bit,

each Cache line further comprises an extension bit used for storing offset information of an access address, wherein the offset information of the access address is the address variation of the access address contained in two adjacent access instructions sent to the L1I-Cache by the CPU core;

the CPU core is used for acquiring an access address of a first instruction and accessing the L1I-Cache according to the access address of the first instruction;

the Cache controller is configured to read offset information of an access address in an extension bit of a first Cache line when the first Cache line corresponding to the access address of the first instruction in the L1I-Cache is hit, and calculate an access address of a second instruction according to the offset information of the access address and the access address of the first instruction;

and the CPU core is further used for executing the pre-fetching of the second instruction according to the access address of the second instruction calculated by the Cache controller.

2. The chip of claim 1,

the Cache controller is further configured to, when determining that a first Cache line corresponding to an access address of the first instruction in the L1I-Cache is not hit, find a position of the first instruction according to the access address of the first instruction, and prefetch the first instruction into the L1I-Cache.

3. The chip according to claim 1 or 2,

and the Cache controller is further configured to calculate offset information between the access address of the first instruction and the access address of the second instruction before the access address of the first instruction is acquired, and write the offset information into an extension bit of the Cache line corresponding to the access address of the first instruction.

4. The chip of any of claims 1 to 2, wherein the Cache further comprises a level two instruction Cache L2I-Cache,

the Cache controller is further configured to set an evicted weight for a second Cache line in the L1I-Cache when the second Cache line is evicted to the L2I-Cache, where the evicted weight is used to mark a priority of eviction of the second Cache line in the L2I-Cache.

5. The chip of claim 4,

the Cache controller is further configured to, when it is determined that an access address of the CPU Core execution instruction corresponds to the second Cache line, prefetch the second Cache line, which is evicted to the L2I-Cache, to the L1I-Cache.

6. A prefetch method of instruction Cache is applied to a processor chip, the processor chip comprises a processor core CPU and a Cache memory Cache, the Cache comprises a first-level instruction Cache L1I-Cache and a Cache controller, the L1I-Cache comprises at least one Cache unit Cache line, each Cache line comprises a tag field, data and a flag bit, the prefetch method is characterized in that,

each Cache line further comprises an extension bit used for storing offset information of an access address, wherein the offset information of the access address is the address variation of the access address contained in two adjacent access instructions sent to the L1I-Cache by the CPU core; the method comprises the following steps:

the CPU core acquires an access address of a first instruction and accesses the L1I-Cache according to the access address of the first instruction;

when determining that a first Cache line corresponding to an access address of the first instruction in the L1I-Cache is hit, the Cache controller reads offset information of the access address in an extension bit of the first Cache line, and calculates an access address of a second instruction according to the offset information of the access address and the access address of the first instruction;

and the CPU core executes the pre-fetching of the second instruction according to the access address of the second instruction calculated by the Cache controller.

7. The method of claim 6, further comprising:

and when the Cache controller determines that a first Cache line corresponding to the access address of the first instruction in the L1I-Cache is not hit, searching the position of the first instruction according to the access address of the first instruction, and prefetching the first instruction into the L1I-Cache.

8. The method of claim 6 or 7, wherein prior to the CPU core fetching the access address of the first instruction, the method further comprises:

and the Cache controller calculates offset information between the access address of the first instruction and the access address of the second instruction, and writes the offset information into an extension bit of the Cache line corresponding to the access address of the first instruction.

9. The method of any of claims 6 to 7, wherein the Cache further comprises a level two instruction Cache L2I-Cache, the method further comprising:

when the Cache controller executes the eviction of a second Cache line in the L1I-Cache to the L2I-Cache, an evicted weight is set for the second Cache line, and the evicted weight is used for marking the evicted priority of the second Cache line in the L2I-Cache.

10. The method of claim 9, further comprising:

and when the Cache controller determines that the access address of the CPU Core execution instruction corresponds to the second Cache line, prefetching the second Cache line which is evicted to the L2I-Cache into the L1I-Cache.