CN111159062A

CN111159062A - Cache data scheduling method and device, CPU chip and server

Info

Publication number: CN111159062A
Application number: CN201911305828.7A
Authority: CN
Inventors: 陈立勤
Original assignee: Haiguang Information Technology Co Ltd
Current assignee: Haiguang Information Technology Co Ltd
Priority date: 2019-12-20
Filing date: 2019-12-20
Publication date: 2020-05-15
Anticipated expiration: 2039-12-20
Also published as: CN111159062B

Abstract

The embodiment of the invention discloses a method and a device for scheduling cache data, a CPU chip and a server, relates to the technical field of computers, and can effectively improve the instruction execution speed. The method comprises the following steps: before executing a data loading instruction, searching target data in a cache according to a storage address of the target data; the data loading instruction is used for loading the target data from the memory into the processor; and if the target data is not in the cache, requesting the target data from the memory according to the storage address of the target data in the memory. The invention is suitable for the buffer data scheduling.

Description

Cache data scheduling method and device, CPU chip and server

Technical Field

The invention relates to the technical field of computers, in particular to a cache data scheduling method and device, a CPU chip and a server.

Background

When the CPU executes a load/store instruction, it will send the effective address of the data to the memory access unit of the CPU, and at this time, if the data required by the load/store instruction is not in the CACHE (i.e. CACHE MISS occurs), the CPU can only wait for the memory access unit of the CPU to load the data of the corresponding address from the next level of memory into the data CACHE, and then can continue the execution of the instruction. At this stage, the CPU needs to wait for multiple cycles, severely reducing instruction execution speed.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for scheduling cache data, a CPU chip, and a server, which can effectively improve instruction execution speed.

In a first aspect, an embodiment of the present invention provides a method for scheduling cache data, including:

before executing a data loading instruction, searching target data in a cache according to a storage address of the target data; the data loading instruction is used for loading the target data from the memory into the processor;

and if the target data is not in the cache, requesting the target data from the memory according to the storage address of the target data in the memory.

Optionally, before executing the data load instruction, searching for the target data in the cache according to the storage address of the target data includes:

determining an execution order of the data load instructions in all instructions of a program compilation result;

selecting an instruction as a target instruction in an instruction set of which the execution order is prior to the data loading instruction, and attaching a pre-fetching operator to the target instruction;

when the target instruction is executed, searching the target data in the cache according to the storage address of the target data and the pre-fetching operator.

Optionally, the selecting an instruction as a target instruction in an instruction set whose execution order precedes that of the data load instruction includes:

and in an instruction set of which the execution order precedes that of the data loading instruction, selecting a storage address generation instruction of the target data or selecting an instruction of which the execution order follows the storage address generation instruction as the target instruction.

Optionally, the scheduling method further includes:

and adjusting the execution order of the instructions in the program compiling result so that at least one program instruction is separated between the target instruction and the data loading instruction.

Optionally, before searching for the target data in the cache according to the storage address of the target data, the method further includes:

and generating a storage address of the target data according to the numerical value stored in the preset register.

Optionally, after the request for the target data from the memory, the method further includes:

and executing the data loading instruction, wherein the fetching operation of the data loading instruction is at least one clock cycle later than the operation of requesting the target data from the memory.

In a second aspect, an embodiment of the present invention further provides a scheduling apparatus for buffering data, including:

the searching unit is used for searching the target data in the cache according to the storage address of the target data before executing the data loading instruction; the data loading instruction is used for loading the target data from the memory into the processor;

and the request unit is used for requesting the target data from the memory according to the storage address of the target data in the memory if the target data is not found in the cache.

Optionally, the searching unit includes:

the determining module is used for determining the execution order of the data loading instruction in all instructions of a program compiling result;

the selection module is used for selecting an instruction as a target instruction in an instruction set of which the execution order is prior to the data loading instruction, and attaching a pre-fetching operator to the target instruction;

and the searching module is used for searching the target data in the cache according to the storage address of the target data and the pre-fetching operator when the target instruction is executed.

Optionally, the selecting module is specifically configured to:

Optionally, the apparatus further comprises:

and the adjusting unit is used for adjusting the execution order of the instructions in the program compiling result so as to enable at least one program instruction to be separated between the target instruction and the data loading instruction.

Optionally, the apparatus further includes a generating unit, configured to generate the storage address of the target data according to a numerical value stored in a preset register before searching the target data in the cache according to the storage address of the target data.

Optionally, the apparatus further includes an execution unit, configured to execute the data load instruction after the target data is requested to the memory, where a fetch operation of the data load instruction is at least one clock cycle later than the operation of requesting the target data to the memory.

In a third aspect, an embodiment of the present invention further provides a CPU chip, including: at least one processor core, a cache; the processor core to:

before executing a data loading instruction, searching the target data in the cache according to the storage address of the target data; the data loading instruction is used for loading the target data from the memory into the processor;

Optionally, the processor core is specifically configured to:

when the target instruction is executed, searching the target data in a cache according to the storage address of the target data and the pre-fetching operator.

Optionally, the processor core is specifically configured to:

Optionally, the processor core is further configured to:

and generating the storage address of the target data according to the numerical value stored in a preset register before searching the target data in the cache according to the storage address of the target data.

Optionally, the processor core is further configured to:

and executing the data loading instruction after the target data is requested to the memory, wherein the fetching operation of the data loading instruction is at least one clock cycle later than the operation of requesting the target data to the memory.

In a fourth aspect, an embodiment of the present invention further provides a server, including: the device comprises a shell, a processor, a memory, a circuit board and a power circuit, wherein the circuit board is arranged in a space enclosed by the shell, and the processor and the memory are arranged on the circuit board; a power supply circuit for supplying power to each circuit or device of the electronic apparatus; the memory is used for storing executable program codes; the processor runs a program corresponding to the executable program code by reading the executable program code stored in the memory, and is used for executing any scheduling method for caching data provided by the embodiment of the invention.

According to the scheduling method and device for cache data, the CPU chip and the server provided by the embodiment of the invention, before the data loading instruction is executed, the target data can be searched in the cache according to the storage address of the target data, and if the target data is not in the cache, the target data is requested to the memory according to the storage address of the target data in the memory. Therefore, corresponding data can be fetched from the cache in advance before the data loading instruction is executed, so that even if the data is not in the cache, the data is read from the memory into the cache before the data loading instruction is executed, the data is loaded and utilized by the processor, the processor does not need to wait for a plurality of clock cycles, and the execution efficiency of the program instruction is effectively improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a method for scheduling cache data according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a scheduling apparatus for buffering data according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a CPU chip according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a server according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, an embodiment of the present invention provides a method for scheduling cache data, including:

s11, before executing the data loading instruction, searching the target data in the cache according to the storage address of the target data; the data loading instruction is used for loading the target data from the memory into the processor;

the data load instruction may refer to an instruction to read data from the memory into the processor. When a pipeline processor executes an instruction, it can be generally broken down into the following operations: fetching an instruction, decoding the instruction, executing the instruction, accessing a memory, and writing back an execution result. The time period required for a processor to execute an instruction may be referred to as an instruction cycle, and an instruction cycle may include one or more clock cycles. The processor may perform one of the above operations on one instruction every clock cycle therein. Meanwhile, after one operation of one instruction is completed, the processor can execute the operation on the next instruction, so that an instruction pipeline processing structure is formed. For example, after the instruction detection 1 is decoded, on one hand, detection 1 may be executed, and on the other hand, detection 2 may be decoded at the same time.

In an embodiment of the present invention, a memory access unit exists in the processor, and the memory access unit is dedicated to data interaction with the memory, and in this step, before the data load instruction is executed, the storage address of the target data may be sent to the memory access unit, so that the memory access unit may request data from the cache or the memory in advance before the data load instruction is executed.

And S12, if the target data is not in the cache, requesting the target data from the memory according to the storage address of the target data in the memory.

In this step, if the target data is not in the cache, the data may be requested from the next level of memory, and since the data is requested from the memory before the data load instruction is executed, there is more time to read the data from the memory to the cache before the data load instruction is executed, which effectively improves the program execution speed of the processor.

According to the scheduling method of the cache data provided by the embodiment of the invention, before the data loading instruction is executed, the target data can be searched in the cache according to the storage address of the target data, and if the target data is not in the cache, the target data is requested to the memory according to the storage address of the target data in the memory. Therefore, corresponding data can be fetched from the cache in advance before the data loading instruction is executed, so that even if the data is not in the cache, more time is provided for reading the data from the memory to the cache before the data loading instruction is executed, the data is loaded and utilized by the processor, the processor does not need to wait for a plurality of clock cycles, and the execution efficiency of the program instruction is effectively improved.

Optionally, in an embodiment of the present invention, before executing the data load instruction, in step S11, searching for the target data in the cache according to the storage address of the target data may specifically include:

For example, in one embodiment of the present invention, the program compilation results in INST1, INST2, INST3, and INST4 instructions being executed in that order. If INST3 is a data load instruction and the data address used by INST3 is calculated in INSN2, then an operator may be appended to the INSN2 instruction to request data from the cache in advance, or if the data is not present in the cache, the data may be brought into the cache in advance from memory. If the data address used in INST3 is calculated in INST1, a prefetch operator may be appended in INST1 or INST 2.

In the embodiment of the present invention, as a result of program compilation, there may be a plurality of instructions whose execution order precedes that of the data load instruction, and the specific selection of which instruction is used as the target instruction may be set as required.

Optionally, in an embodiment of the present invention, in an instruction set whose execution order precedes that of the data load instruction, selecting an instruction as a target instruction may include:

That is, since data prefetching needs to be performed according to the memory address where the target data is located, the prefetching operation can be performed when the storage address of the target data is generated, so that the prefetched data can be obtained as early as possible, and the prefetching operation can be performed after the storage address of the target data is generated and before the target data is loaded. Optionally, the prefetch operation may be implemented by adding a prefetch operator to the original instruction, or by adding a prefetch operation instruction alone, which is not limited in the embodiment of the present invention.

Further, after the target instruction is selected, in an embodiment of the present invention, an execution order of instructions in the program compilation result may be adjusted, so that at least one program instruction is separated between the target instruction and the data load instruction. For example, an instruction unrelated to data loading can be inserted between the target instruction and the data loading instruction for execution, so that the prefetch operation can be performed more clock cycles ahead, and smooth execution of the data loading instruction is further ensured. For another example, in one embodiment of the invention, the instruction execution order is: INST1, INST4, INST3, INST2, if INST3 is a data load instruction and INST1 is a target instruction, and if INST2 is not related to data load, the instruction execution order in the program compilation result can be readjusted, INST2 is inserted between INST1 and INST3, and the adjusted instruction execution order may be, for example: INST1, INST4, INST2 and INST 3.

Since the target data needs to be searched for in the cache based on the storage address of the target data in step S11, at least the storage address of the target data is already known when searching for the target data. The method for obtaining the storage address of the target data may be various, and various known addressing methods may be adopted, which is not limited in the embodiment of the present invention.

For example, in an embodiment of the present invention, before searching the target data in the cache according to the storage address of the target data in step S11, the method for scheduling cache data according to the embodiment of the present invention may further include: and determining the storage address of the target data according to the numerical value stored in the preset register. For example, when instruction1 is an addition of two immediate numbers and instruction2 is an address obtained by taking the operation result of instruction1 as the address and fetching the data stored in the address, the storage address of the target data is generated when instruction1 is executed.

Further, in step S12, if the target data is not found in the cache, after the request for the target data from the memory according to the storage address of the target data in the memory, the method for scheduling cache data according to the embodiment of the present invention may further include: and executing the data loading instruction, wherein the fetching operation of the data loading instruction is at least one clock cycle later than the operation of requesting the target data from the memory.

That is, when the target data needs to be requested from the memory, the operation time of at least one clock cycle for the data requesting operation is prepared in advance, so that the data loading instruction can smoothly load the target data from the memory into the processor.

The scheduling method of the pre-fetch data provided by the embodiment of the present invention is described in detail below by specific embodiments.

A portion of the high level language program instructions in an embodiment of the invention include:

high level language (e.g., C language) fragments:

int a; // a is a global variable with an address of 0x12345678

{

…

Int c＝a；

…

}

The pipeline of assembler instructions for the high-level language program may be as shown in table 1.

TABLE 1

Where i is the instruction number, INSN ASM is the instruction code, and n cycles, etc. are the clock cycles.

i：lui$r2，0x1234//$r2＝0x1234,0000

i + 1: p $ r2, $ r2, $ x 5678/$ r2 ═ 0x1234, 5678, $ r2 get the address of variable a, and the CPU sends address 0x12345678 to the memory access unit for data prefetch in (n +3) cycle.

i + 2: lw $ r2, ($ r 2)/$ r2 takes the value of variable a.

The ith instruction, after raising the hexadecimal immediate 1234 to the high order and then supplementing four 0's, puts it into the register $ r2, i.e. putting 1234 x 2¹⁶Register $ r2 is placed. The (i +1) th instruction, which represents a logical OR of the hexadecimal number 5678 with the number in register $ r2, results in the hexadecimal number 0x12345678, and places 0x12345678 in register $ r 2. Meanwhile, the prefetch function is executed to prefetch the data with hexadecimal number 0x12345678 as the address, that is, to inquire whether the cache has the data in the memory location with the memory address 0x 12345678. The (i +2) th instruction, which represents the address of the value in the register $ r2, reads the data in the address and stores the data in the register $ r 2.

As can be seen from the pipeline of table 1, since the address forming instruction ori.p with the corresponding implied nature is added, when the CPU executes the (i +1) th instruction, the CPU sends a register value of $ r2 as an address to the memory access unit for prefetching data (mem prefetch) in the (n +3) th cycle, and if data CACHE MISS occurs, the memory access unit can request data from the next stage of memory in advance.

When the CPU executes the (i +2) th instruction, the CPU issues the register value of $ r2 as an address to the memory access unit in the (n +5) th cycle, and since the data prefetch command is issued two times earlier in the (i +1) th instruction, the (i +2) th instruction completes the execution of the instruction2 clock cycles earlier even if the data CACHE MISS occurs.

In this way, the high-level language compiler generates an address forming instruction with an implied property, and guides the CPU to simultaneously send the address immediate to the memory access unit in advance when executing the instruction, so as to complete the pre-fetching function of the memory represented by the address immediate.

According to the cache data scheduling method provided by the embodiment of the invention, the use position of the data address can be found in advance through the high-level language compiler, the corresponding address forming instruction with the suggestive property is generated, and the data request is sent to the memory access unit in advance, so that even if the data CACHE MISS occurs, the waiting clock period can be saved when the data is actually used, the effect of improving the hit rate of the data cache is achieved, and the program execution speed of the processor is effectively improved.

In the prior art, the (i +1) th instruction is not prefetched, but is directly executed in the (i +2) th instruction, and as can be seen from the pipeline in table 2, when the CPU executes the (i +2) th instruction, the CPU sends the register value of $ r2 as an address to the memory access unit in the (n +5) th cycle, and at this time, once the data CACHE MISS, CPU occurs, a wait cycle is inserted into the stream until the memory access unit retrieves the data in the corresponding address from the next stage of memory.

TABLE 2

Correspondingly, as shown in fig. 2, an embodiment of the present invention further provides a scheduling apparatus for buffering data, including:

a search unit 31, configured to search the target data in the cache according to a storage address of the target data before executing the data load instruction; the data loading instruction is used for loading the target data from the memory into the processor;

a requesting unit 32, configured to request the memory for the target data according to a storage address of the target data in the memory if the target data is not found in the cache.

The scheduling apparatus for cache data provided in the embodiments of the present invention can search the target data in the cache according to the storage address of the target data before executing the data load instruction, and request the target data from the memory according to the storage address of the target data in the memory if the target data is not in the cache. Therefore, corresponding data can be fetched from the cache in advance before the data loading instruction is executed, so that even if the data is not in the cache, the data is read from the memory into the cache before the data loading instruction is executed, the data is loaded and utilized by the processor, the processor does not need to wait for a plurality of clock cycles, and the execution efficiency of the program instruction is effectively improved.

Optionally, the searching unit 31 may include:

Optionally, the selection module may be specifically configured to:

Optionally, the scheduling apparatus may further include:

the generating unit is used for determining the storage address of the target data according to the numerical value stored in the preset register before searching the target data in the cache according to the storage address of the target data.

Optionally, the scheduling apparatus may further include:

and the execution unit is used for executing the data loading instruction after the target data is requested to the memory, wherein the fetching operation of the data loading instruction is at least one clock cycle later than the operation of requesting the target data to the memory.

Correspondingly, as shown in fig. 3, an embodiment of the present invention further provides a CPU chip 5, including: at least one processor core 51, a cache 52;

a processor core 51 to:

The processor core 51 according to the embodiment of the present invention can search the target data in the cache according to the storage address of the target data before executing the data loading instruction, and request the target data from the memory according to the storage address of the target data in the memory if the target data is not in the cache. Therefore, corresponding data can be fetched from the cache in advance before the data loading instruction is executed, so that even if the data is not in the cache, more time is provided for reading the data from the memory to the cache before the data loading instruction is executed, the data is loaded and utilized by the processor, the processor does not need to wait for a plurality of clock cycles, and the execution efficiency of the program instruction is effectively improved.

Optionally, the processor core 51 may be specifically configured to:

Optionally, the processor core 51 may be further configured to:

and according to the storage address of the target data, before the target data is searched in the cache, determining the storage address of the target data according to the numerical value stored in a preset register.

Optionally, the processor core 51 may be further configured to:

Accordingly, as shown in fig. 4, a server provided in an embodiment of the present invention may include: the device comprises a shell 61, a processor 62, a memory 63, a circuit board 64 and a power circuit 65, wherein the circuit board 64 is arranged inside a space enclosed by the shell 61, and the processor 62 and the memory 63 are arranged on the circuit board 64; a power supply circuit 65 for supplying power to each circuit or device of the electronic apparatus; the memory 63 is used to store executable program code; the processor 62 executes a program corresponding to the executable program code by reading the executable program code stored in the memory 63, so as to execute any one of the scheduling methods of cache data provided in the foregoing embodiments.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The term "comprising", without further limitation, means that the element so defined is not excluded from the group consisting of additional identical elements in the process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments.

In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

For convenience of description, the above devices are described separately in terms of functional division into various units/modules. Of course, the functionality of the units/modules may be implemented in one or more software and/or hardware implementations of the invention.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for scheduling buffered data, comprising:

2. The method of claim 1, wherein the searching for the target data in the cache according to the storage address of the target data before executing the data load instruction comprises:

3. The method of claim 2, wherein selecting an instruction as a target instruction in an instruction set that precedes the data load instruction in execution order comprises:

4. The scheduling method of claim 2 or 3, further comprising:

5. The scheduling method according to any one of claims 1 to 3, wherein before searching the target data in the cache according to the storage address of the target data, the method further comprises:

and determining the storage address of the target data according to the numerical value stored in the preset register.

6. The scheduling method according to any one of claims 1 to 3, wherein after the request for the target data from the memory, the method further comprises:

7. A scheduling apparatus for buffering data, comprising:

8. The scheduling apparatus of claim 7, wherein the searching unit comprises:

9. The scheduling device of claim 8, wherein the selection module is specifically configured to:

10. The scheduling apparatus according to claim 8 or 9, further comprising:

11. The scheduling apparatus according to any one of claims 7 to 10, further comprising a generating unit, configured to determine a storage address of target data according to a value stored in a preset register before searching the target data in a cache according to the storage address of the target data.

12. The scheduling apparatus of any one of claims 7 to 10, further comprising an execution unit configured to execute the data load instruction after requesting the target data from a memory, wherein a fetch operation of the data load instruction is at least one clock cycle later than the request of the target data from the memory.

13. A CPU chip, comprising: at least one processor core, a cache;

the processor core to:

14. The CPU chip of claim 13, wherein the processor core is specifically configured to:

15. The CPU chip of claim 14, wherein the processor core is specifically configured to:

16. The CPU chip of claim 14 or 15, wherein the processor core is further configured to:

17. The CPU chip of any of claims 13 to 15, wherein the processor core is further configured to:

18. The CPU chip of any of claims 13 to 15, wherein the processor core is further configured to:

19. A server, comprising: the device comprises a shell, a processor, a memory, a circuit board and a power circuit, wherein the circuit board is arranged in a space enclosed by the shell, and the processor and the memory are arranged on the circuit board; a power supply circuit for supplying power to each circuit or device of the electronic apparatus; the memory is used for storing executable program codes; the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, for executing the scheduling method of the cache data according to any one of the preceding claims 1 to 6.