CN118171612A

CN118171612A - Method, device, storage medium and program product for optimizing instruction cache

Info

Publication number: CN118171612A
Application number: CN202410598217.0A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Bi Ren Technology Co ltd; Beijing Bilin Technology Development Co ltd
Current assignee: Shanghai Bi Ren Technology Co ltd; Beijing Bilin Technology Development Co ltd
Priority date: 2024-05-14
Filing date: 2024-05-14
Publication date: 2024-06-11

Abstract

The embodiment of the application provides an optimization method, equipment, a storage medium and a program product for instruction cache, which relate to the technical field of chip design verification, and the method comprises the following steps: and determining the priority of each of the plurality of thread bundles based on the historical instruction fetching number of each of the plurality of thread bundles, wherein the historical instruction fetching number is inversely related to the priority. And then determining at least one thread bundle with arbitration permission from the plurality of thread bundles based on the priorities of the plurality of thread bundles, so that the thread bundles with slow execution progress preferentially acquire arbitration permission and acquire corresponding target instructions, and the execution progress of the plurality of thread bundles is gradually unified, thereby improving the locality and hit rate of the instruction cache and further improving the performance of the instruction cache. Because the execution progress among the thread bundles is more synchronous, other resources such as data caches can be fully utilized, external repeated access is reduced, meanwhile, the problems that a single thread bundle is too slow to execute and other thread bundles wait are avoided, and the utilization rate of hardware resources is further improved.

Description

Method, device, storage medium and program product for optimizing instruction cache

Technical Field

The embodiment of the application relates to the technical field of chip design verification, in particular to an optimization method, equipment, a storage medium and a program product for instruction cache.

Background

In order to improve the performance of the whole chip system, a buffer memory component is designed between the processor and an external memory to solve the problem of the working frequency of the processor for accessing data and instructions. Multi-core processor architectures typically employ a multi-level cache architecture, where the level of cache includes: an instruction Cache (I-Cache) and a data Cache (D-Cache); the secondary cache is designed as a multi-core shared cache.

An instruction Cache (I-Cache) is a memory that is close to a processor, from which instructions are frequently fetched and executed by the processor when executing programs. When a processor executes programs of multiple thread bundles in parallel, the synchronicity of execution progress of the multiple thread bundles will affect the performance of the instruction cache.

Under the related technology, the instruction scheduling module is used for fairly arbitrating, sequentially taking the instructions of each thread bundle from the instruction cache, and transmitting the instructions to the execution unit for execution so as to control the synchronicity of the execution progress of the thread bundles. However, when the execution progress between thread bundles is controlled by the instruction scheduling module located at the downstream of the instruction cache, the control effect is poor, and the performance of the instruction cache is further affected.

Disclosure of Invention

The embodiment of the application provides an instruction cache optimization method, equipment, a storage medium and a program product, which are used for synchronizing execution progress among thread bundles so as to improve the performance of instruction cache.

In one aspect, an embodiment of the present application provides a method for optimizing instruction cache, including:

acquiring instruction fetching requests of a plurality of thread bundles;

determining respective priorities of the plurality of thread bundles based on respective historical instruction fetch numbers of the plurality of thread bundles, wherein the historical instruction fetch numbers are inversely related to the priorities;

Determining at least one thread bundle of arbitration permissions from the plurality of thread bundles based on the respective priorities of the plurality of thread bundles;

and executing the instruction fetching request of the at least one thread bundle to acquire a corresponding target instruction.

In one aspect, an embodiment of the present application provides an apparatus for optimizing instruction cache, including:

the acquisition module is used for acquiring instruction fetching requests of a plurality of thread bundles;

the processing module is used for determining the priority of each of the plurality of thread bundles based on the historical instruction fetching number of each of the plurality of thread bundles, wherein the historical instruction fetching number is inversely related to the priority;

an arbitration module, configured to determine at least one thread bundle of arbitration permissions from the plurality of thread bundles based on priorities of the plurality of thread bundles;

and the execution module is used for executing the instruction fetching request of the at least one thread bundle and acquiring a corresponding target instruction.

Optionally, the arbitration module is specifically configured to:

If the historical instruction fetching number reaching the counting upper limit value is contained in the historical instruction fetching number of each of the plurality of thread bundles, setting the priority of the thread bundle corresponding to the historical instruction fetching number reaching the counting upper limit value as a blocking priority;

and eliminating the thread bundles corresponding to the blocking priority from the thread bundles, and determining at least one thread bundle with arbitration permission from the remaining thread bundles according to the respective priorities of the remaining thread bundles.

Optionally, the historical instruction fetching number of each of the plurality of thread bundles is recorded by a dynamic synchronization matrix, wherein the dynamic synchronization matrix includes: a plurality of recording areas, each recording area corresponding to a thread bundle; the location of the record mark in each record area is used to characterize the historical instruction fetch count of the corresponding thread bundle.

Optionally, the processing module is further configured to:

If the recording mark in the recording area moves to the highest position in the recording area, the historical instruction fetching number of the thread bundle corresponding to the recording area reaches the counting upper limit value.

Optionally, the processing module is further configured to:

when none of the recording marks in the plurality of recording areas is located at the lowest position, the recording marks in the plurality of recording areas are respectively moved by one bit in the direction of the lowest position.

Optionally, the execution module is specifically configured to:

When a plurality of instruction fetching requests with the same instruction addresses exist in the instruction fetching requests of the at least one thread bundle, merging the plurality of instruction fetching requests into a target request;

And executing the target request, and acquiring a corresponding target instruction from the instruction cache or the upper level cache.

Optionally, the execution module is specifically configured to:

For the instruction fetching request of the at least one thread bundle, respectively executing the following operations:

Executing an instruction fetching request of a thread bundle, and acquiring a corresponding target instruction from an instruction cache or an upper level cache.

In one aspect, an embodiment of the present application provides a computer device, including a memory, a processor chip, and a computer program stored in the memory and capable of running on the processor chip, where the processor chip implements the steps of the method for optimizing instruction cache when the processor chip executes the computer program.

In one aspect, embodiments of the present application provide a computer readable storage medium storing a computer program executable by a computer device, which when run on the computer device, causes the computer device to perform the steps of the above-described method of optimizing instruction caching.

In one aspect, embodiments of the present application provide a computer program product comprising a computer program stored on a computer readable storage medium, the computer program comprising program instructions which, when executed by a computer device, cause the computer device to perform the steps of the above-described method of optimizing instruction caching.

In an embodiment of the application, a priority of each of the plurality of thread bundles is determined based on a historical instruction fetch count of each of the plurality of thread bundles, wherein the historical instruction fetch count is inversely related to the priority. And then determining at least one thread bundle with arbitration permission from the plurality of thread bundles based on the priorities of the plurality of thread bundles, so that the thread bundles with slow execution progress can obtain the arbitration permission preferentially and obtain corresponding target instructions, the execution progress of the plurality of thread bundles is gradually unified, the locality and hit rate of the instruction cache are improved, and the performance of the instruction cache is improved.

And secondly, because the execution progress among the thread bundles is more synchronous, other resources such as data caches can be fully utilized, and external repeated access is reduced. In addition, because the execution progress among the thread bundles is more synchronous, the problems that the execution of a single thread bundle is too slow and other thread bundles wait for a long time are avoided, and the utilization rate of hardware resources is further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it will be apparent that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic structural diagram of a processor chip according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating a method for optimizing an instruction cache according to an embodiment of the present application;

Fig. 3 is a flowchart of a counting method based on a dynamic synchronization matrix according to an embodiment of the present application;

fig. 4 is a second flow chart of a counting method based on a dynamic synchronization matrix according to an embodiment of the present application;

fig. 5 is a flowchart illustrating a counting method based on a dynamic synchronization matrix according to an embodiment of the present application;

Fig. 6 is a flow chart diagram of a counting method based on a dynamic synchronization matrix according to an embodiment of the present application;

Fig. 7 is a flowchart of a counting method based on a dynamic synchronization matrix according to an embodiment of the present application;

FIG. 8 is a second flow chart of an optimization method for instruction cache according to an embodiment of the present application;

Fig. 9 is a flowchart of a counting method based on a dynamic synchronization matrix according to an embodiment of the present application;

FIG. 10 is a schematic structural diagram of an optimizing apparatus for instruction cache according to an embodiment of the present application;

Fig. 11 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantageous effects of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Referring to fig. 1, which is a block diagram of a processor chip to which an embodiment of the present application is applied, the processor chip 100 includes at least: a plurality of computing cores 101, an instruction cache 102, and an upper level cache 103.

The instruction cache 102 serves multiple compute cores 101 simultaneously. Multiple compute cores 101 may execute programs of multiple thread bundles in parallel, a thread bundle referring to: multiple threads that are bound together and executed simultaneously.

The program for each thread bundle runs on one of the compute cores 101. Of course, one computing core 101 may execute programs of a plurality of thread bundles in parallel, and the present application is not limited in this regard.

In some cases, the instruction cache 102 may achieve optimal performance when the programs of the plurality of thread bundles are the same and the execution progress of the plurality of thread bundles is the same.

Instruction cache 102 is a temporary memory that is smaller in size but swaps faster than upper level cache 103. The instructions in the instruction cache 102 are obtained from the upper level cache 103, which are about to be accessed by the computing core 101 in a short period of time. Thus, when the computing core 101 calls the instruction, the instruction can be directly called from the instruction cache 102 while avoiding the upper level cache 103, thereby increasing the reading speed.

The upper level buffer 103 may be a second level buffer in a multi-level buffer structure, or may be a video memory.

The processor chip 100 of the present application may include other structures besides the above-described structures, and the present application is not particularly limited thereto.

The processor chip 100 may be: central processing units (Central Processing Unit, CPU for short), graphics processing units (Graphics Processing Unit, GPU for short), general-purpose graphics processing units (General-purpose computing on graphics processing units, GPGPU for short), domain-specific architecture (Domain Specific Architecture, DSA for short), etc.

In practical applications, when a processor executes programs of multiple thread bundles in parallel, the synchronicity of execution progress of the multiple thread bundles will affect the performance of instruction caches. The more synchronous the execution progress of the plurality of thread bundles, the better the access locality of the instruction cache, so the better the performance of the instruction cache. The more the execution progress of the plurality of thread bundles is distributed, the worse the access locality of the instruction cache is, and the worse the performance of the instruction cache is.

Under the related technology, the instruction scheduling module is used for fairly arbitrating, sequentially taking the instructions of each thread bundle from the instruction cache, and transmitting the instructions to the execution unit for execution so as to control the synchronicity of the execution progress of the thread bundles.

However, the execution times are not the same when the execution units actually execute instructions of the respective thread bundles. For example, some instructions may have long execution times during execution because some operations are delayed for a long time. And some instructions are executed quickly, i.e. the execution time is shorter. The execution progress of each thread bundle is gradually dispersed due to the different execution time of the instructions of each thread bundle. That is, when the execution progress between the thread bundles is controlled by the instruction scheduling module located at the downstream of the instruction cache, the control effect is poor, and the performance of the instruction cache is further affected.

In view of this, the present application provides a flow of an optimization method of instruction cache based on the architecture diagram of the processor chip shown in fig. 1, and as shown in fig. 2, the flow of the method is executed by the processor chip, and includes the following steps:

In step 201, instruction fetch requests for a plurality of thread bundles are obtained.

Specifically, the thread bundle refers to: multiple threads that are bound together and executed simultaneously. A plurality of thread bundles executing in parallel on a processor chip issue instruction fetch requests to an instruction cache. The plurality of thread bundles may run on one compute core in the processor chip or may run on multiple compute cores in the processor chip.

Step 202, determining priorities of the plurality of thread bundles based on respective historical instruction fetching numbers of the plurality of thread bundles, wherein the historical instruction fetching numbers are inversely related to the priorities.

Specifically, at the instruction cache entry, a historical instruction fetch number of each of the plurality of thread bundles is queried, wherein the historical instruction fetch number of the thread bundles is used for representing an execution progress of the thread bundles. The more the historical instruction fetching number of the thread bundle is, the faster the execution progress of the thread bundle is; accordingly, the lower the priority assigned to a thread bundle. The fewer the historical instruction fetching number of the thread bundle is, the slower the execution progress of the thread bundle is; accordingly, the higher the priority assigned to a thread bundle.

And setting a plurality of priorities by taking the negative correlation of the historical instruction fetching numbers and the priorities as a reference in advance, wherein each historical priority corresponds to an instruction fetching number range, and each instruction fetching number range comprises one or more historical instruction fetching numbers. After determining the historical instruction fetch number of the thread bundle, a matching instruction fetch number range is determined from the plurality of instruction fetch number ranges based on the historical instruction fetch number of the thread bundle. And then taking the priority corresponding to the matched instruction fetching number range as the priority of the thread bundle.

In some embodiments, the historical instruction fetch count for each of the plurality of thread bundles is recorded by a dynamic synchronization matrix, wherein the dynamic synchronization matrix comprises: a plurality of recording areas, each recording area corresponding to a thread bundle; the location of the record mark in each record area is used to characterize the historical instruction fetch count of the corresponding thread bundle.

Specifically, each recording area in the dynamic synchronization matrix may be one or more rows in the dynamic synchronization matrix, one or more columns in the dynamic synchronization matrix, or an area obtained by other division manners. The recording mark may be a symbol such as a character, a number, or the like. In the reset phase, the numerical value of each position in the dynamic synchronization matrix is set to 0. Along with the gradual increase of the historical instruction fetching number of the thread bundle, the record mark in the record area corresponding to the thread bundle gradually moves from the lowest position to the highest position, and the number of the bits moved each time can be set according to actual conditions.

For example, as shown in fig. 3, a dynamic synchronization matrix is set for recording the historical instruction fetching numbers of 4 thread bundles, each row of the dynamic synchronization matrix corresponds to one thread bundle, and each row of the dynamic synchronization matrix includes 8 bits (from right to left, 0 th bit to 7 th bit, 0 th bit is the lowest bit, and 7 th bit is the highest bit). The 4 thread bundles are respectively: thread bundle 0, thread bundle 1, thread bundle 2, and thread bundle 3, record labels 1.

In the reset phase, the numerical value of each position in the dynamic synchronization matrix is set to 0.

When thread bundle 2 and thread bundle 3 obtain arbitration permissions, bit 0 of each corresponding row of thread bundle 2 and thread bundle 3 in the dynamic synchronization matrix is set to 1. That is, the record mark (1) is placed at the lowest bit of the row corresponding to each of the thread bundle 2 and the thread bundle 3.

Next, when thread bundle 0 and thread bundle 3 obtain arbitration grants, bit 0 of the row corresponding to thread bundle 0 in the dynamic synchronization matrix is set to 1. In the row corresponding to thread bundle 3, record mark (1) is shifted left by 1 bit.

In some embodiments, if the recording mark in the recording area moves to the highest bit in the recording area, the historical instruction fetch number of the thread bundle corresponding to the recording area reaches the count upper limit.

Specifically, the count upper limit value is associated with the window size of the dynamic synchronization matrix, and the count upper limit value may be set according to the attribute of hardware for implementing the dynamic synchronization matrix.

When the recording mark is moved to the highest bit in the recording area, the recording mark in the recording area is not moved to a higher position and is kept at the highest bit without being moved even if the thread bundle corresponding to the recording area subsequently acquires the arbitration grant.

For example, as shown in fig. 4, a dynamic synchronization matrix is set for recording the historical instruction fetching numbers of 4 thread bundles, each row of the dynamic synchronization matrix corresponds to one thread bundle, and each row of the dynamic synchronization matrix includes 8 bits (from right to left, 0 th bit to 7 th bit, 0 th bit is the lowest bit, and 7 th bit is the highest bit). The 4 thread bundles are respectively: thread bundle 0, thread bundle 1, thread bundle 2, and thread bundle 3, record labels 1.

In the dynamic synchronization matrix, the record mark (1) of the thread bundle 3 has been shifted to the highest position, i.e. the historical instruction fetch number of the thread bundle 3 reaches the count upper limit.

Assuming that thread bundle 3 has obtained arbitration grant, record tag (1) of thread bundle 3 is no longer moved.

In the embodiment of the application, a plurality of thread bundles are synchronized in one window, so that the locality and hit rate of the instruction cache can be effectively improved, and the performance of the instruction cache is improved.

In some embodiments, when none of the recording marks in the plurality of recording areas is located at the lowest order, the recording marks in the plurality of recording areas are respectively shifted by one bit in the direction of the lowest order.

Specifically, when the recording marks are contained in each of the plurality of recording areas (i.e., all start counting), none of the plurality of recording marks is located at the lowest bit. At this time, the least significant bits of the plurality of recording areas no longer play a role in the subsequent count. While the size of the recording area is limited, the recording area cannot continue counting when the recording mark in the recording area moves to the highest. In order to ensure the consistency of a plurality of recording areas in counting and dynamically expand the sizes of the recording areas, the application provides that when none of the recording marks in the plurality of recording areas is located at the lowest position, the recording marks in the plurality of recording areas are respectively moved to the direction of the lowest position by one bit. Thus, for a recording area in which the recording mark moves to the highest order, counting can be continued; for a recording area where the recording mark does not move to the most significant bit, one more bit is used for counting.

In practical applications, the recording marks in the plurality of recording areas may not be located at the lowest position, and the recording marks in the plurality of recording areas may be moved by one bit in the direction of the lowest position. The present application is not particularly limited in that the recording marks in the plurality of recording areas are not located at the lowest position, and the recording marks in the plurality of recording areas are moved one bit in the direction of the lowest position when the recording marks in the recording areas are moved to the highest position.

For example, as shown in fig. 5, a dynamic synchronization matrix is set for recording the historical instruction fetching numbers of 4 thread bundles, each row of the dynamic synchronization matrix corresponds to one thread bundle, and each row of the dynamic synchronization matrix includes 8 bits (from right to left, 0 th bit to 7 th bit, 0 th bit is the lowest bit, and 7 th bit is the highest bit). The 4 thread bundles are respectively: thread bundle 0, thread bundle 1, thread bundle 2, and thread bundle 3, record labels 1.

In the dynamic synchronization matrix, the record mark (1) of the thread bundle 3 is moved to the highest position, namely the historical instruction fetching number of the thread bundle 3 reaches the counting upper limit value. The recording marks (1) of the thread bundle 2 and the thread bundle 1 are moved to the 1 st bit. The record mark (1) of thread bundle 0 is located at bit 0.

Assuming that thread bundle 0 obtains arbitration permission, in the row corresponding to thread bundle 0, record flag (1) is shifted left by 1 bit, i.e., record flag (1) is shifted to 1 st bit.

At this time, when the lowest bits of the 4 recording areas are all 0, the recording marks (1) of the 4 recording areas are all shifted to the right by one bit, so that the recording mark (1) of the thread bundle 3 is shifted to the 6 th bit (no longer at the highest bit); the record marks (1) of thread bundle 0, thread bundle 1, and thread bundle 2 are moved to bit 0.

In the embodiment of the application, when the record marks in the plurality of record areas are not positioned at the lowest position, the record marks in the plurality of record areas are respectively moved to the direction of the lowest position by one bit, thereby realizing the dynamic expansion of the record areas and saving the resource consumption.

At step 203, at least one thread bundle of the arbitration grant is determined from the plurality of thread bundles based on the priorities of the respective plurality of thread bundles.

Specifically, the higher the priority of the thread bundle, the higher the probability of obtaining arbitration permissions; accordingly, the lower the priority of the thread bundle, the lower the probability of obtaining arbitration grants. The priorities corresponding to the plurality of thread bundles in the present application may be completely different or partially the same.

The number of thread bundles allowed to obtain arbitration grants per beat is determined based on hardware resources (i.e., cache resources). When the thread bundles allowed to obtain the arbitration permission are multiple at a time (each beat), the thread bundles with high priority are prioritized to obtain the arbitration permission, and when the thread bundles with high priority obtain the arbitration permission and the remaining hardware resources exist, the thread bundles with low priority can be allowed to obtain the arbitration permission.

In some embodiments, for a thread bundle whose historical instruction fetch count reaches an upper count limit, the present application proposes at least several arbitration modes:

In the first mode, if the history instruction fetch number of each of the plurality of thread bundles includes the history instruction fetch number up to the count upper limit value, the priority of the thread bundle corresponding to the history instruction fetch number is set as the blocking priority. And eliminating the thread bundles corresponding to the blocking priority from the thread bundles, and determining at least one thread bundle with arbitration permission from the remaining thread bundles according to the respective priorities of the remaining thread bundles.

In particular, the thread bundles corresponding to the blocking priority may be one or more. The instruction cache blocks the thread bundle at the blocking priority, i.e. the thread bundle is not allowed to obtain arbitration permissions, and the remaining thread bundles obtain arbitration permissions according to the priority.

And in a second mode, aiming at the thread bundle of which the historical instruction fetching number reaches the counting upper limit value, determining the priority of the thread bundle by taking the negative correlation between the historical instruction fetching number and the priority as a reference. At least one thread bundle of arbitration grants is then determined from the plurality of thread bundles according to the respective priorities of the plurality of thread bundles.

Specifically, the instruction cache still allows a thread bundle whose historical instruction fetch count reaches the count upper limit to obtain an arbitration grant, but the thread bundle has the lowest priority to obtain an arbitration grant.

In some embodiments, two synchronization control modes, a strong synchronization mode and a weak synchronization mode, may be set. When the synchronous control mode is set to be a strong synchronous mode, the mode is adopted to arbitrate a thread bundle with the historical instruction fetching number reaching the counting upper limit value. When the synchronous control mode is set to be a weak synchronous mode, the second mode is adopted to arbitrate the thread bundles with the historical instruction fetching number reaching the counting upper limit value.

For example, referring to fig. 6, a dynamic synchronization matrix is set for recording the historical instruction fetching numbers of 4 thread bundles, each row of the dynamic synchronization matrix corresponds to one thread bundle, and each row of the dynamic synchronization matrix includes 8 bits (from right to left, 0 th bit to 7 th bit, respectively, 0 th bit is the lowest bit, and 7 th bit is the highest bit). The 4 thread bundles are respectively: thread bundle 0, thread bundle 1, thread bundle 2, and thread bundle 3, record labels 1.

Setting the synchronous control mode as a strong synchronous mode, and setting the priority rule as follows: when the record mark (1) is positioned from the 0 th bit to the 3 rd bit, the record mark corresponds to the first priority; when the record mark (1) is positioned from the 4 th bit to the 6 th bit, the record mark corresponds to the second priority; when the record mark (1) is positioned at the 7 th bit, the record mark corresponds to the blocking priority.

In the dynamic synchronization matrix, if the record mark (1) of the thread bundle 3 is located at the 7 th bit, the priority of the thread bundle 3 is the blocking priority.

The record marks (1) of the thread bundle 2 and the thread bundle 1 are positioned at the 1 st bit, and the record mark (1) of the thread bundle 0 is positioned at the 0 th bit, so that the priorities of the thread bundle 0, the thread bundle 1 and the thread bundle 2 are all the first priorities.

The instruction cache blocks thread bundle 3 and then arbitrates according to the priority of each of thread bundle 0, thread bundle 1, and thread bundle 2.

Referring to fig. 7, a dynamic synchronization matrix is set for recording the historical instruction fetching numbers of 4 thread bundles, each row of the dynamic synchronization matrix corresponds to one thread bundle, and each row of the dynamic synchronization matrix includes 8 bits (from right to left, 0 th bit to 7 th bit, 0 th bit is the lowest bit, and 7 th bit is the highest bit). The 4 thread bundles are respectively: thread bundle 0, thread bundle 1, thread bundle 2, and thread bundle 3, record labels 1.

Setting the synchronous control mode as a weak synchronous mode, and setting the priority rule as follows: when the record mark (1) is positioned from the 0 th bit to the 3 rd bit, the record mark corresponds to the first priority; the recording mark (1) corresponds to the second priority when it is located at the 4 th bit to the 7 th bit.

In the dynamic synchronization matrix, if the record mark (1) of the thread bundle 3 is located at the 7 th bit, the priority of the thread bundle 3 is the second priority.

Arbitration is performed according to the respective priorities of thread bundle 0, thread bundle 1, thread bundle 2 and thread bundle 3, wherein the priorities of thread bundle 0, thread bundle 1 and thread bundle 2 are higher than the priorities of thread bundle 3, so that the probability that thread bundle 0, thread bundle 1 and thread bundle 2 obtain arbitration permission is higher.

It is assumed that thread bundle 3 has obtained arbitration grant, but the recording mark (1) corresponding to thread bundle 3 is already at the highest bit, and therefore, recording mark (1) is no longer moved.

In the embodiment of the application, the blocking priority is set for the thread bundles with the historical instruction fetching number reaching the counting upper limit value to perform arbitration blocking, so that the progress of a plurality of thread bundles can be effectively controlled in one window, thereby improving the locality and hit rate of the instruction cache and further improving the performance of the instruction cache.

Step 204, executing the instruction fetching request of at least one thread bundle to obtain the corresponding target instruction.

Specifically, instruction fetch requests for at least one thread bundle are placed in a request processing queue. And sequentially executing instruction fetching requests in the request processing queue to obtain corresponding target instructions, and returning the target instructions to the corresponding thread bundles for execution.

In the embodiment of the application, the priority of each of the plurality of thread bundles is determined based on the historical instruction fetching number of each of the plurality of thread bundles, wherein the historical instruction fetching number is inversely related to the priority. And then determining at least one thread bundle with arbitration permission from the plurality of thread bundles based on the priorities of the plurality of thread bundles, so that the thread bundles with slow execution progress can obtain the arbitration permission preferentially and obtain corresponding target instructions, the execution progress of the plurality of thread bundles is gradually unified, the locality and hit rate of the instruction cache are improved, and the performance of the instruction cache is improved. And secondly, because the execution progress among the thread bundles is more synchronous, other resources such as data caches can be fully utilized, and external repeated access is reduced. In addition, because the execution progress among the thread bundles is more synchronous, the problems that the execution of a single thread bundle is too slow and other thread bundles wait for a long time are avoided, and the utilization rate of hardware resources is further improved.

It should be noted that, the present application may combine the inter-thread bundle progress control of the execution of the upstream instruction cache described in the foregoing steps 201 to 204 with the inter-thread bundle progress control of the downstream instruction scheduling module, and through the common control of the upstream and the downstream, the present application reduces the executing dispersion among the thread bundles to the greatest extent, improves the memory locality, and brings the best performance for the instruction cache.

In some embodiments, for instruction fetch requests of at least one thread bundle, the following operations are performed separately: executing an instruction fetching request of a thread bundle, and acquiring a corresponding target instruction from an instruction cache or an upper level cache.

Specifically, for each thread bundle that obtains arbitration permissions, the instruction cache is queried based on the instruction address in the instruction fetch request of that thread bundle. And returning the target instruction to the thread bundle when the corresponding target instruction is contained in the instruction cache. And when the instruction cache does not contain the corresponding target instruction, acquiring the corresponding target instruction from the upper-level cache, and returning the target instruction to the thread bundle for execution.

Meanwhile, in the dynamic synchronization matrix, if the record mark corresponding to the thread bundle is not located at the highest position, the record mark corresponding to the thread bundle is shifted one bit to the left. If the record mark corresponding to the thread bundle is already positioned at the highest position, the record mark corresponding to the thread bundle is not moved.

In some embodiments, when there are multiple instruction fetch requests with the same instruction address in the instruction fetch requests of at least one thread bundle, the multiple instruction fetch requests are combined into one target request. And executing the target request and acquiring a corresponding target instruction from the instruction cache or the upper level cache.

Specifically, when there are multiple instruction fetch requests with the same instruction address, it is stated that the multiple instruction fetch requests are essentially fetching the same instruction. Thus, multiple instruction fetch requests are consolidated into one target request that carries indication information that characterizes multiple instruction fetch requests that compose the target request.

Next, the instruction cache is queried based on the instruction address in the target request. And when the instruction cache contains a corresponding target instruction, acquiring the target instruction from the instruction cache. And when the instruction cache does not contain the corresponding target instruction, acquiring the corresponding target instruction from the upper-level cache.

After the target instruction is obtained, a plurality of instruction fetch requests synthesizing the target request and a thread bundle sending the plurality of instruction fetch requests are determined based on the target request carrying the indication information. And then respectively returning the obtained target instructions to the corresponding thread bundles.

Meanwhile, in the dynamic synchronization matrix, for each thread bundle in the plurality of thread bundles, if the record mark corresponding to the thread bundle is not located at the highest position, the record mark corresponding to the thread bundle is shifted one bit to the left. If the record mark corresponding to the thread bundle is already positioned at the highest position, the record mark corresponding to the thread bundle is not moved.

In the embodiment of the application, the priority of the thread bundles is controlled, so that the thread bundles with slow execution progress can obtain the arbitration permission preferentially and obtain the corresponding target instruction, and the execution progress of the thread bundles is gradually improved. As thread bundle synchronicity becomes better, the probability of request merge occurring increases, i.e., the probability of there being multiple instruction fetch requests of the same instruction address becomes greater. When a plurality of instruction fetching requests with the same instruction address exist, the plurality of instruction fetching requests are combined into one target request, the target request is executed, and corresponding target instructions are acquired from an instruction cache or an upper level cache, so that the number of requests processed by the instruction cache is reduced, and the performance of the instruction cache is improved.

In order to better explain the embodiments of the present application, the following describes, in conjunction with a specific implementation scenario, a method for optimizing an instruction cache, where the method may be executed by the instruction cache.

Referring to fig. 8, the instruction cache entry includes: the system comprises a dynamic synchronization matrix, a polling arbitration module 1, a polling arbitration module 2, an arbitration output module, a request merging module and a request processing queue. The synchronous control mode supported by the instruction cache comprises the following steps: strong synchronization mode and weak synchronization mode.

Setting a dynamic synchronization matrix for recording historical instruction fetching numbers of 4 thread bundles, wherein each row of the dynamic synchronization matrix corresponds to one thread bundle, and each row of the dynamic synchronization matrix comprises 8 bits (from right to left, 0 th bit to 7 th bit respectively, 0 th bit is the lowest bit, and 7 th bit is the highest bit). The 4 thread bundles are respectively: thread bundle 0, thread bundle 1, thread bundle 2, and thread bundle 3, record labels 1. As shown in fig. 9, the dynamic synchronization matrix is specifically that the recording mark (1) of the thread bundle 3 is located at the 7 th bit, the recording marks (1) of the thread bundle 2 and the thread bundle 1 are located at the 1 st bit, and the recording mark (1) of the thread bundle 0 is located at the 0 th bit.

Assuming that the synchronization control mode is configured as a strong synchronization mode, the priority setting rule is: when the record mark (1) is positioned from the 0 th bit to the 3 rd bit, the record mark corresponds to the first priority; when the record mark (1) is positioned from the 4 th bit to the 6 th bit, the record mark corresponds to the second priority; when the record mark (1) is positioned at the 7 th bit, the record mark corresponds to the blocking priority.

The 4 thread bundles respectively send instruction fetching requests to the instruction caches, and request masks at the entry of the instruction caches. Specifically, the poll arbitration module 1 acquires the first priority mask, that is, all data from the 0 th bit to the 3 rd bit in the dynamic synchronization matrix, from the dynamic synchronization matrix. As can be seen from fig. 9, when the record marks (1) of the thread bundle 0, the thread bundle 1 and the thread bundle 2 are all located on the first priority mask, the priorities of the thread bundle 0, the thread bundle 1 and the thread bundle 2 are all the first priorities.

The polling arbitration module 1 arbitrates the thread bundle 0, the thread bundle 1 and the thread bundle 2. Assuming that thread bundle 0 and thread bundle 1 obtain arbitration grants, the record marks (1) corresponding to thread bundle 0 and thread bundle 1 are shifted left by 1 bit in the dynamic synchronization matrix, see specifically fig. 9.

The polling arbitration module 2 obtains the second priority mask, i.e. all data from bit 4 to bit 7 in the dynamic synchronization matrix, from the dynamic synchronization matrix. As can be seen from fig. 9, when the record mark (1) of the thread bundle 3 is located at the second priority mask and at the most significant bit (7 th bit), the priority of the thread bundle 3 is set to the blocking priority. The polling arbitration module 2 blocks the thread bundle 3, i.e., the polling arbitration module 2 does not output an instruction fetch request of the thread bundle 3.

The polling arbitration module 1 outputs a first instruction fetch request of the thread bundle 0 and a second instruction fetch request of the thread bundle 1 to the request merging module through the arbitration output module. Because the instruction addresses in the first instruction fetch request and the second instruction fetch request are the same, the request merging module merges the first instruction fetch request and the second instruction fetch request into a target request, and places the target request into a request processing queue.

The target request is fetched from the request processing queue and the instruction cache is queried based on the instruction address in the target request. And when the instruction cache contains a corresponding target instruction, acquiring the target instruction from the instruction cache. And when the instruction cache does not contain the corresponding target instruction, acquiring the corresponding target instruction from the upper-level cache.

And finally, returning the target instruction to the thread bundle 0 and the thread bundle 1 for execution respectively.

In the embodiment of the application, the thread bundles with slow execution progress can obtain the arbitration permission and the corresponding target instruction preferentially by performing the progress control among the thread bundles in the upstream instruction cache, so that the execution progress of a plurality of thread bundles is gradually unified, the locality and hit rate of the instruction cache are improved, and the performance of the instruction cache is further improved. And secondly, because the execution progress among the thread bundles is more synchronous, other resources such as data caches can be fully utilized, and external repeated access is reduced.

Based on the same technical concept, an embodiment of the present application provides a structural schematic diagram of an instruction cache optimizing apparatus, as shown in fig. 10, the instruction cache optimizing apparatus 1000 includes:

an obtaining module 1001, configured to obtain instruction fetching requests of a plurality of thread bundles;

a processing module 1002, configured to determine respective priorities of the plurality of thread bundles based on respective historical instruction fetch numbers of the plurality of thread bundles, where the historical instruction fetch numbers are inversely related to the priorities;

An arbitration module 1003, configured to determine at least one thread bundle of arbitration permissions from the plurality of thread bundles based on priorities of the plurality of thread bundles;

The execution module 1004 is configured to execute the instruction fetching request of the at least one thread bundle, and obtain a corresponding target instruction.

Optionally, the arbitration module 1003 is specifically configured to:

If the historical instruction fetching number reaching the counting upper limit value is contained in the historical instruction fetching number of each of the plurality of thread bundles, setting the priority of the thread bundle corresponding to the historical instruction fetching number reaching the counting upper limit value as blocking priority;

Optionally, the processing module 1002 is further configured to:

Optionally, the execution module 1004 is specifically configured to:

And determining the priority of each of the plurality of thread bundles based on the historical instruction fetching number of each of the plurality of thread bundles, wherein the historical instruction fetching number is inversely related to the priority. And then determining at least one thread bundle with arbitration permission from the plurality of thread bundles based on the priorities of the plurality of thread bundles, so that the thread bundles with slow execution progress can obtain the arbitration permission preferentially and obtain corresponding target instructions, the execution progress of the plurality of thread bundles is gradually unified, the locality and hit rate of the instruction cache are improved, and the performance of the instruction cache is improved.

Based on the same technical concept, the embodiment of the present application provides a computer device, as shown in fig. 11, including at least one processor chip 100 and a memory 1101 connected to the at least one processor chip 100, where a specific connection medium between the processor chip 100 and the memory 1101 is not limited in the embodiment of the present application, and in fig. 11, the processor chip 100 and the memory 1101 are connected by a bus as an example. The buses may be divided into address buses, data buses, control buses, etc.

In the embodiment of the present application, the memory 1101 stores instructions executable by the at least one processor chip 100, and the at least one processor chip 100 can perform the steps of the above-described method for optimizing the instruction cache by executing the instructions stored in the memory 1101.

The processor chip 100 is a control center of a computer device, and various interfaces and lines can be used to connect various parts of the computer device, and by executing or executing instructions stored in the memory 1101 and invoking data stored in the memory 1101, the optimization of the instruction cache can be realized. Alternatively, the processor chip 100 may include one or more processing units, and the processor chip 100 may integrate an application processor and a modem processor, wherein the application processor primarily processes an operating system, a user interface, an application program, and the like, and the modem processor primarily processes wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor chip 100. In some embodiments, the processor chip 100 and the memory 1101 may be implemented on the same chip, and in some embodiments they may be implemented separately on separate chips.

The processor chip 100 may be a general purpose processor such as a Central Processing Unit (CPU), digital signal processor, application SPECIFIC INTEGRATED Circuit (ASIC), field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the application. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in the processor for execution.

The memory 1101 is a non-volatile computer-readable storage medium, and can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory 1101 may include at least one type of storage medium, and may include, for example, flash Memory, a hard disk, a multimedia card, card Memory, random access Memory (Random Access Memory, RAM), static random access Memory (Static Random Access Memory, SRAM), programmable Read-Only Memory (Programmable Read Only Memory, PROM), read-Only Memory (ROM), charged erasable programmable Read-Only Memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ-Only Memory, EEPROM), magnetic Memory, magnetic disk, optical disk, and the like. The memory 1101 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer device, but is not limited thereto. The memory 1101 in the embodiments of the present application may also be circuitry or any other device capable of implementing a memory function for storing program instructions and/or data.

Based on the same inventive concept, an embodiment of the present application provides a computer readable storage medium storing a computer program executable by a computer device, which when run on the computer device causes the computer device to perform the steps of the above-described method of optimizing instruction caching.

Based on the same inventive concept, embodiments of the present application provide a computer program product comprising a computer program stored on a computer readable storage medium, the computer program comprising program instructions which, when executed by a computer device, cause the computer device to perform the steps of the above-described method of optimizing an instruction cache.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, or as a computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer device or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer device or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer device or other programmable apparatus to produce a computer device implemented process such that the instructions which execute on the computer device or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. An optimization method for instruction cache, comprising the steps of:

acquiring instruction fetching requests of a plurality of thread bundles;

2. The method of claim 1, wherein the determining at least one thread bundle of arbitration grants from the plurality of thread bundles based on the priorities of the respective plurality of thread bundles comprises:

3. The method of claim 1, wherein the historical instruction fetch count for each of the plurality of thread bundles is recorded by a dynamic synchronization matrix, wherein the dynamic synchronization matrix comprises: a plurality of recording areas, each recording area corresponding to a thread bundle; the location of the record mark in each record area is used to characterize the historical instruction fetch count of the corresponding thread bundle.

4. A method as recited in claim 3, further comprising:

5. A method as recited in claim 3, further comprising:

6. The method according to any one of claims 1 to 5, wherein said executing an instruction fetch request of said at least one thread bundle, obtaining a corresponding target instruction, comprises:

7. The method according to any one of claims 1 to 5, wherein said executing an instruction fetch request of said at least one thread bundle, obtaining a corresponding target instruction, comprises:

8. A computer device comprising a memory, a processor chip and a computer program stored on the memory and executable on the processor chip, characterized in that the processor chip implements the steps of the method according to any of claims 1-7 when the computer program is executed.

9. A computer readable storage medium, characterized in that it stores a computer program executable by a computer device, which computer program, when run on the computer device, causes the computer device to perform the steps of the method according to any one of claims 1-7.

10. A computer program product comprising a computer program stored on a computer readable storage medium, the computer program comprising program instructions which, when executed by a computer device, cause the computer device to carry out the steps of the method according to any one of claims 1-7.