CN115827504A

CN115827504A - Data access method for multi-core graphic processor, graphic processor and medium

Info

Publication number: CN115827504A
Application number: CN202310072594.6A
Authority: CN
Inventors: 阙恒; 朱康挺; 孙鹏; 谢嵘
Original assignee: Li Computing Technology Shanghai Co ltd; Nanjing Lisuan Technology Co ltd
Current assignee: Li Computing Technology Shanghai Co ltd; Nanjing Lisuan Technology Co ltd
Priority date: 2023-01-31
Filing date: 2023-01-31
Publication date: 2023-03-21
Anticipated expiration: 2043-01-31
Also published as: CN115827504B

Abstract

A data access method of a multi-core graphics processor, a graphics processor and a medium, the data access method of the multi-core graphics processor includes: acquiring a target data reading instruction; detecting that the target data reading instruction has a data invalidation flag bit, and invalidating data corresponding to a target buffer view indicated by the target data reading instruction in a cache; and reading the target data corresponding to the target buffer view from the memory, and storing the target data in the buffer. By adopting the scheme, the coherent memory access performance can be improved.

Description

Data access method for multi-core graphic processor, graphic processor and medium

Technical Field

The present invention relates to the field of graphics processor technologies, and in particular, to a data access method for a multi-core graphics processor, a graphics processor, and a medium.

Background

A Graphics Processing Unit (GPU) needs to access a memory (memory) frequently when performing Processing such as Graphics rendering, and power consumption and time delay thereof are largely due to access to memory data. In order to reduce the number of times of accessing the memory, a cache (cache) is arranged inside the GPU for caching the loaded memory data. When the GPU needs to access the memory data, the GPU firstly searches from the cache, and if the needed data is resident in the cache, the needed data is directly returned. When the memory is written, the corresponding data in the cache is updated, and when the cache line (line) is replaced or the cache is refreshed, the updated data is flushed into the memory.

The conventional GPU usually comprises a plurality of groups of cores, and cache memory data can be set in each group of cores. When multiple cores access the same memory address, a coherence problem occurs, that is: the data written into the memory by the core 1 can be read by the core 2. In other words, the real-time data cached inside core 1 needs to be flushed into the memory, while core 2 needs to read the latest data from the memory.

To solve the problem of memory coherence, it is a common practice to set the data access attribute of the GPU core internal cache with respect to a specific memory segment (i.e. buffer view) as global coherence, i.e. to close the data caching function of the buffer view.

However, closing the data caching function of the buffered view may cause all instruction requests accessing the buffered view to be in a non-cached (non-cached) mode, resulting in low performance of coherent memory accesses.

Disclosure of Invention

The embodiment of the invention solves the technical problem of low coherent memory access performance.

To solve the foregoing technical problem, an embodiment of the present invention provides a data access method for a multi-core graphics processor, including: acquiring a target data reading instruction; detecting that the target data reading instruction has a data invalidation flag bit, and invalidating data corresponding to a target buffer view indicated by the target data reading instruction in a cache; and reading the target data corresponding to the target buffer view from the memory, and storing the target data in the buffer.

Optionally, before the target data reading instruction is obtained, the method further includes: receiving a write operation instruction and a refresh instruction; writing the data indicated by the write operation instruction into the cache; and writing the data indicated by the write operation instruction into an address field described by the destination buffer view according to the refresh instruction.

Optionally, after writing the data indicated by the write operation instruction into the address field described by the destination buffer view according to the refresh instruction, the method further includes: inserting a memory barrier; setting the data invalid zone bit for a first data reading instruction which points to the target buffer view after the memory barrier is inserted; n is a positive integer.

Optionally, the data invalid flag bit is not set for the nth data reading instruction pointing to the destination buffer view after the data reading instruction is inserted into the memory barrier; n is more than or equal to 2.

Optionally, if it is detected that the target data reading instruction does not have the data invalid flag bit, searching the target data from the cache; if the target data is found in the cache, reading the target data from the cache; and if the target data is not found in the cache, reading the target data from the memory and storing the target data in the cache.

An embodiment of the present invention further provides a graphics processor, including: the scheduling execution unit is used for sending a target data reading instruction; the memory access control unit is used for acquiring a target data reading instruction; detecting that the target data reading instruction has a data invalidation flag bit, and invalidating data corresponding to a target buffer view indicated by the target data reading instruction in a cache; and reading the target data corresponding to the target buffer view from the memory, and storing the target data in the buffer.

Optionally, the graphics processor further includes: the compiler is suitable for setting the data invalid zone bit for a first data reading instruction which points to the target buffer view after the memory barrier is inserted; the data invalid flag bit is not set for the Nth data reading instruction which points to the target buffer view after the memory barrier is inserted; n is more than or equal to 2.

Optionally, the memory access control unit is further configured to determine that data corresponding to the destination buffer view is valid when it is detected that the destination data read instruction does not have the data invalid flag bit.

Optionally, the memory access control unit is further configured to: when detecting that the data corresponding to the target buffer view is valid, searching the target data from the buffer; the target data exists in the cache, and the target data is read from the cache; and the target data does not exist in the cache, reading the target data from the memory, and storing the target data in the cache.

An embodiment of the present invention further provides a computer-readable storage medium, which is a non-volatile storage medium or a non-transitory storage medium, and stores a computer program thereon, where the computer program is executed by a processor to perform any of the steps of the data access method described in the foregoing.

Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:

and when the target data reading instruction has a data invalid flag bit, invalidating data corresponding to a target buffer view indicated by the target data reading instruction in the cache, reading target data corresponding to the target buffer view from the memory, and storing the target data in the cache. Therefore, when the problem of memory coherence is solved, the technical scheme provided by the embodiment of the invention does not need to close the cache (cache) function of the target buffer view, and can improve the coherent memory access performance.

Drawings

FIG. 1 is a flow chart of a method for processing instructions of a graphics processor according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for instruction compilation according to an embodiment of the present invention;

FIG. 3 is a block diagram of a graphics processor according to an embodiment of the present invention;

FIG. 4 is a block diagram of an instruction encoding apparatus according to an embodiment of the present invention.

Detailed Description

As described above, in the prior art, to solve the problem of memory coherence, the data caching function of a specific buffered view is usually turned off. However, closing the data caching function of the buffered view may cause all instruction requests accessing the buffered view to be in a non-cached (non-cached) mode, resulting in low performance of coherent memory accesses.

In the embodiment of the invention, when the problem of memory coherence is solved, the cache function of the target buffer view does not need to be closed, and the coherent memory access performance can be improved.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

An embodiment of the present invention provides an instruction processing method of a graphics processor, which is described in detail below with reference to fig. 1 through specific steps.

Step 101, a target data reading instruction is obtained.

In a specific implementation, the memory access control unit may receive a destination data read instruction sent by the schedule execution unit.

In the embodiment of the present invention, before receiving the target data read instruction, the memory access control unit may receive a write operation instruction and a FLUSH (FLUSH) instruction sent by the scheduling execution unit. Writing data into a cache according to the received write operation instruction; and writing the data into a memory address field described by a destination buffer view (buffer view) according to the received FLUSH instruction.

In a specific implementation, the write operation instruction may be a general term for an instruction capable of implementing a write operation, such as an ST instruction.

Step 102, detecting that the target data reading instruction has a data invalidation flag bit, and invalidating data corresponding to the target buffer view indicated by the target data reading instruction in the cache.

In the embodiment of the present invention, a compiler (compiler) may set a corresponding flag bit for the destination data reading instruction. Specifically, the compiler may set a data invalid flag bit for a data read instruction inserted into the first entry after a memory Barrier (memory Barrier).

In one embodiment, the destination data read instruction points to the destination buffer view, unless otherwise specified.

In the embodiment of the present invention, during the process of compiling the target data reading instruction, the compiler may set a data invalid flag bit for the first target data reading instruction pointing to a certain target buffer view after inserting the memory barrier.

For example, the data invalid flag bit is ". Inv". The instruction 1 is a destination data reading instruction which points to a destination buffer view u8 after the memory barrier is inserted; the instruction 2 is a destination data reading instruction which points to a destination buffer view u8 after the memory barrier is inserted; instruction 1 and instruction 2 are as follows:

instruction 1: ld.inv r16.Xyzw, [ r0.Xy ], u8;

instruction 2: LD r16.Xyzw, [ r0.Xy ], u8;

it can be seen that the compiler adds the. Inv flag to instruction 1 compared to instruction 2 to characterize instruction 1 as having a data invalid flag bit.

In a specific implementation, the compiler may also select a suitable free location in the destination data reading instruction to define a bit field, and a value of the bit field is used to characterize whether a data invalid flag bit exists. When the value of the bit field is a first value, the representation target data reading instruction has a data invalid zone bit; and when the value of the bit field is a second value, the representation target data reading instruction does not have a data invalid zone bit.

In one embodiment of the present invention, the bit field is named "inv field", where inv is a abbreviation for invalidate. The length of the bit field can be 1bit, and can also be 2bit or other bits. When the value of an 'inv field' of a certain target data reading instruction is detected to be a first value, the target data reading instruction can be judged to have a data invalid zone bit; when the value of the 'inv field' of a certain target data reading instruction is detected to be a second value, the target data reading instruction can be judged not to have a data invalid flag bit.

If the length of the bit field is 1bit, when the value of the bit field is 1, the representation target data reading instruction has a data invalid zone bit; and when the value of the bit field is 0, the representation target data reading instruction does not have a data invalid zone bit. Or when the value of the bit field is 1, the representation target data reading instruction does not have a data invalid zone bit; and when the value of the bit field is 0, the representation target data reading instruction has a data invalid zone bit.

It can be understood that the length of the bit field may be set according to a specific application scenario requirement, and the relationship between the value of the bit field and the validity of the flag bit may also be set according to a specific application scenario, which is not described in detail in the embodiments of the present invention.

In a specific implementation, after the compiler is inserted into the memory barrier, a data invalid flag bit is set for only a first destination data read instruction pointing to a certain destination buffer view (i.e. a ". Inv" flag is added to the first destination data read instruction), and a second destination data read instruction pointing to the destination buffer view does not set any data invalid flag bit any more until the next memory barrier is inserted.

That is, after the current time of inserting the memory barrier and before the next time of inserting the memory barrier, only the flag bit of the first target data read instruction pointing to the target buffer view is set with the data invalid flag bit, and other target data read instructions pointing to the same target buffer view do not have the data invalid flag bit.

For example, the destination buffer view is u8. After the current time of inserting the memory barrier and before the next time of inserting the memory barrier, the flag bit of the first u 8-pointing target data reading instruction is provided with a data invalid flag bit, and then the flag bits of all u 8-pointing target data reading instructions are not provided with the data invalid flag bit.

It is to be understood that the number of the destination buffer views may be multiple, and the compiler may set the flag bits of the destination data read instructions, which point to the destination buffer views by multiple first entries, as the data invalid flag bits.

For example, the destination buffer view includes u 0-u 8. The compiler sets the flag bit of the first target data reading instruction pointing to u0 as a data invalid flag bit, sets the flag bit of the first target data reading instruction pointing to u1 as a data invalid flag bit, and so on until the flag bit of the first target data reading instruction pointing to u8 is set as a data invalid flag bit.

In the embodiment of the present invention, the compiler may also not set the data invalid flag bit for the data reading instruction of the nth entry after the memory barrier is inserted until the next time the memory barrier is inserted. N is a positive integer and N is not less than 2.

In the embodiment of the present invention, if the memory access control unit detects that the target data read instruction has the data invalidation flag bit, the memory access control unit may invalidate the data corresponding to the target buffer view stored in the cache.

And 103, reading target data corresponding to the target buffer view from the memory and storing the target data in the cache.

In the embodiment of the present invention, the memory access control unit may read the destination data described in the destination buffer view from the memory, and store the read destination data in the buffer.

In a specific implementation, if the memory access control unit detects that the target data read instruction does not have the data invalid flag bit, the memory access control unit may determine that the data corresponding to the target buffer view is valid. At this time, the memory access control unit may search the cache for the destination data described by the destination buffer view. If the target data is searched from the cache, namely the target data exists in the cache, reading the target data from the cache; and if the target data is not searched from the cache, namely the target data does not exist in the cache, reading the target data described by the target buffer view from the memory, and storing the read target data in the cache.

It is understood that the destination data exists in the cache, i.e. it is a cache hit (hit); the cache is not provided with the target data, namely miss (miss).

Therefore, when receiving a target data reading instruction with a data invalidation flag bit, the memory access control unit invalidates data corresponding to a target buffer view indicated by the target data reading instruction in the cache, reads target data described by the target buffer view from the memory and stores the target data in the cache; when a target data reading instruction without a data invalid zone bit is received, searching in a cache, returning after the target data is found, and reading the target data from a memory if the target data is not found. When the memory coherence problem is solved, the cache function of the target buffer view does not need to be closed, and the coherence memory access performance can be effectively improved.

The data processing method provided in the above-described embodiment of the present invention is explained below by specific examples.

The Application author may insert a memory barrier behind the producer (the write operation of the coherent buffer) to ensure that write data for all cores of the GPU can reach memory. Corresponding to the specific implementation, the GPU implements the flushing of the cache data and the signal feedback to reach CHECK via FLUSH instruction and CHECK (CHECK) instruction. See the following instructions:

ST [r0.xy] r12.xy，u8

FLUSH

CHECK

the first memory write data command ST is the producer. u8 is a buffer view describing memory attributes, the view is labeled as 8, general registers r0 and r1 store memory addresses (addresses), and r12 and r13 are data to be written into the memory, and the data correspond to applications. The ST command is followed by a FLUSH command and a CHECK command, which are respectively used for sending data to the memory and returning a data write-in confirmation signal.

By adopting the technical scheme provided by the embodiment of the invention, the outdated cache data residing in a certain core is invalidated, and then the latest data is obtained from the external memory again. Specifically, the flag bit of the first destination data read instruction pointing to u8 is set as a data invalid flag bit, and invalid flag bits are set for other destination data read instructions pointing to u8 later, as shown in the following instructions:

LD.inv r16.xyzw，[r0.xy]，u8

LD r20.xyzw，[r2.xy]，u8

LD r24.xyzw，[r4.xy]，u8

the corresponding flag bits of the second LD instruction and the third LD instruction are invalid flag bits, so that the inv field can be considered to be absent.

Therefore, for u8, the cache is always in a working state, and the LD instruction behind the LD.inv can normally access the data corresponding to u8, so that the use performance of the cache is greatly improved.

Referring to fig. 2, an instruction compiling method according to an embodiment of the present invention is shown, and the following detailed description is provided through specific steps.

Step 201, a target data reading instruction is obtained.

In a specific implementation, the compiler may obtain the target data reading instruction, and set a flag bit of the target data reading instruction accordingly.

In step 202, it is detected that the target data reading instruction is a data reading instruction pointing to the target buffer view from the first entry after the target data reading instruction is inserted into the memory barrier, and a data invalid flag bit is set for the target data reading instruction.

In an embodiment of the present invention, the compiler may set a data invalid flag bit for the data read instruction of the first entry after the memory barrier is inserted.

In the embodiment of the present invention, during the process of compiling the target data reading instruction, the compiler may set a data invalid flag bit for the first target data reading instruction pointing to a certain target buffer view after inserting the memory barrier. For the data reading instruction of the nth entry after the data reading instruction of the first entry, the compiler does not need to set a data invalid flag bit for the data reading instruction.

For example, the data invalid flag bit is ". Inv". The instruction 1 is a destination data reading instruction of a first entry pointing to a destination buffer view u8 after the memory barrier is inserted; the instruction 2 is a destination data reading instruction pointing to a destination buffer view u8 in a second way after the memory barrier is inserted; instruction 1 and instruction 2 are as follows:

instruction 1: ld.inv r16.Xyzw, [ r0.Xy ], u8;

instruction 2: LD r16.Xyzw, [ r0.Xy ], u8;

In a specific implementation, the compiler may also select a suitable free position in the destination data reading instruction to define a bit field, and a value of the bit field is used to characterize whether a data invalid flag bit exists. When the value of the bit field is a first value, the representation target data reading instruction has a data invalid zone bit; and when the value of the bit field is a second value, the representation target data reading instruction does not have a data invalid zone bit.

In one embodiment of the present invention, this bit field is named "inv", which is abbreviated as invalidate. The length of the bit field can be 1bit, and can also be 2bit or other bits. When the value of an 'inv field' of a certain target data reading instruction is detected to be a first value, the target data reading instruction can be judged to have a data invalid zone bit; when the value of the 'inv field' of a certain target data reading instruction is detected to be a second value, the target data reading instruction can be judged not to have a data invalid flag bit.

If the length of the bit field is 1bit, when the value of the bit field is 1, the representation target data reading instruction has a data invalid zone bit; and when the value of the bit field is 0, the representation target data reading instruction does not have a data invalid zone bit. Or when the value of the bit field is 1, the representation target data reading instruction does not have a data invalid zone bit; when the value of the bit field is 0, the representation target data reading instruction has a data invalid zone bit.

It can be understood that the length of the bit field may be set according to the specific application scenario requirements, and the relationship between the value of the bit field and the validity of the flag bit may also be set according to the specific application scenario, which is not described in detail in the embodiments of the present invention.

In a specific implementation, after the compiler is inserted into the memory barrier, the compiler may set a data invalid flag bit for only a first destination data read instruction pointing to a certain destination buffer view (i.e. a ". Inv" flag is added to the first destination data read instruction), and a second and subsequent destination data read instructions pointing to the destination buffer view no longer set the data invalid flag bit until the next memory barrier is inserted.

It is to be understood that the number of the destination buffer views may be multiple, and the compiler may set the flag bit of the destination data read instruction pointing to the destination buffer views from multiple first entries to the data invalid flag bit.

In the embodiment of the present invention, the compiler may also set a data invalid flag bit for a data reading instruction of a first entry after the memory barrier is inserted, and a data invalid flag bit is not set any more for a destination data reading instruction of a destination buffer view after the first entry until the memory barrier is inserted next time.

In the embodiment of the present invention, a compiler sets a data invalidation flag bit for a target data reading instruction, and when the target data reading instruction with the data invalidation flag bit is processed by a memory access control unit, data corresponding to a target buffer view stored in a cache is invalidated, so that access to the cache by a subsequent target data reading instruction can be achieved.

Referring to fig. 3, an embodiment of the present invention further provides a graphics processor 30, including: a schedule execution unit 301 and a memory access control unit 302, wherein:

a scheduling execution unit 301, configured to send a destination data reading instruction;

a memory access control unit 302, configured to obtain a target data read instruction; detecting that the target data reading instruction has a data invalidation flag bit, and invalidating corresponding data in a target buffer view indicated by the target data reading instruction in a cache; and reading the target data corresponding to the target buffer view from the memory, and storing the target data in the cache.

In a specific implementation, the memory access control unit 301 may be further configured to detect that the target data read instruction does not have a data invalid flag bit, and search the target data from the memory; the target data exist in the memory, and the target data are read from the memory; and reading the target data from the memory and storing the target data in the memory when the target data does not exist in the memory.

In a specific implementation, the specific execution processes of the scheduling execution unit 301 and the memory access control unit 302 may correspond to the embodiment of the instruction processing method, which is not described herein again.

Referring to fig. 4, an embodiment of the present invention further provides an instruction compiling apparatus 40, including: an acquisition unit 401 and a setting unit 402, wherein:

an acquisition unit 401 configured to acquire a target data reading instruction;

a setting unit 402, configured to detect that the destination data read instruction is a data read instruction that points to a destination buffer view first after being inserted into a memory barrier, and set a data invalid flag bit for the destination data read instruction.

In a specific implementation, the specific execution processes of the obtaining unit 401 and the setting unit 402 may refer to the foregoing steps 201 to 202, which are not described herein again.

In an embodiment of the present invention, the instruction compiling apparatus 40 may correspond to the compiler provided in the above embodiment.

The embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium is a non-volatile storage medium or a non-transitory storage medium, and a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the data access method provided in the above steps 101 to 102 are also provided.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by instructing the relevant hardware through a program, which may be stored in a computer-readable storage medium, and the storage medium may include: ROM, RAM, magnetic or optical disks, and the like.

Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A method for accessing data in a multi-core graphics processor, comprising:

acquiring a target data reading instruction;

detecting that the target data reading instruction has a data invalidation flag bit, and invalidating data corresponding to a target buffer view indicated by the target data reading instruction in a cache;

and reading the target data corresponding to the target buffer view from the memory, and storing the target data in the buffer.

2. The method of claim 1, further comprising, prior to fetching the destination data fetch instruction:

receiving a write operation instruction and a refresh instruction;

writing the data indicated by the write operation instruction into the cache;

and writing the data indicated by the write operation instruction into an address field described by the destination buffer view according to the refresh instruction.

3. The data access method of a multicore graphics processor of claim 2, further comprising, after writing the data indicated by the write operation instruction into the address field of the destination buffer view description according to the flush instruction:

inserting a memory barrier;

and setting the data invalid zone bit for a first data reading instruction which points to the target buffer view after the memory barrier is inserted.

4. The data access method of a multi-core graphics processor of claim 3, further comprising:

for the Nth data reading instruction which points to the target buffer view after the memory barrier is inserted, the data invalid flag bit is not set; n is more than or equal to 2.

5. The data access method of a multi-core graphics processor of claim 4, further comprising:

when detecting that the target data reading instruction does not have the data invalid zone bit, searching the target data from the cache;

if the target data is found in the cache, reading the target data from the cache;

and if the target data is not found in the cache, reading the target data from the memory and storing the target data in the cache.

6. A graphics processor, comprising:

the scheduling execution unit is used for sending a target data reading instruction;

the memory access control unit is used for acquiring a target data reading instruction; detecting that the target data reading instruction has a data invalidation flag bit, and invalidating data corresponding to a target buffer view indicated by the target data reading instruction in a cache; and reading the target data corresponding to the target buffer view from the memory, and storing the target data in the buffer.

7. The graphics processor of claim 6, further comprising: the compiler is suitable for setting the data invalid zone bit for a first data reading instruction which points to the target buffer view after the memory barrier is inserted; the data invalid flag bit is not set for the Nth data reading instruction which points to the target buffer view after the memory barrier is inserted; n is more than or equal to 2.

8. The graphics processor as claimed in claim 7, wherein the memory access control unit is further configured to confirm that the data corresponding to the destination buffer view is valid when detecting that the destination data read instruction does not have the data invalid flag bit.

9. The graphics processor of claim 8, wherein the memory access control unit is further to: when detecting that the data corresponding to the target buffer view is valid, searching the target data from the buffer; the target data exists in the cache, and the target data is read from the cache; and the target data does not exist in the cache, reading the target data from the memory, and storing the target data in the cache.

10. A computer-readable storage medium, which is a non-volatile storage medium or a non-transitory storage medium, on which a computer program is stored, which when executed by a processor performs the steps of the data access method according to any one of claims 1 to 5.