CN113612863B

CN113612863B - Method, system, equipment and storage medium for optimizing address conversion in GPU

Info

Publication number: CN113612863B
Application number: CN202110785306.2A
Authority: CN
Inventors: 杜亚娟; 杨玉琦
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2021-07-12
Filing date: 2021-07-12
Publication date: 2022-07-26
Anticipated expiration: 2041-07-12
Also published as: CN113612863A

Abstract

The invention relates to an address conversion optimization method, a system, equipment and a storage medium in a GPU, wherein the method comprises the following steps: establishing a directory for each core of the GPU, wherein the directory records the sharing condition between private TLBs of each core of the GPU; if the current address translation request is not hit in the local core private TLB, the hit remote core private TLB is determined by searching the directory of the local core private TLB, and the physical address required by the current address translation request is obtained according to the response of the hit remote core private TLB, so as to complete address translation. The invention can effectively reduce the probability of accessing the shared TLB, reduce the access delay and improve the performance of address translation.

Description

Method, system, equipment and storage medium for optimizing address conversion in GPU

Technical Field

The present application relates to the field of memory management technologies, and in particular, to a method, a system, a device, and a storage medium for optimizing address translation in a GPU.

Background

At present, with the wide application of artificial intelligence and machine learning, there is an urgent need for rapid Processing of mass data, and the advantage of gpu (graphics Processing unit) performing parallel Processing on mass data opens the door for such application. Since the data is accessed according to the physical address, the virtual address of the data must be converted into the physical address before this, which is called address conversion; the performance of the address translation also greatly affects the performance of the entire program. Currently, the performance optimization of address translation in the GPU is far less sophisticated than that in the CPU, so that the optimization of address translation for the application program for the GPU is necessary.

As shown in fig. 1, in the address translation process in the conventional GPU, a core issues an address translation request, first looks up a private TLB (Page Table file) of the core, and if not found, looks up all shared TLBs of the core, but the delay of accessing the shared TLB is ten times that of accessing the private TLB, which obviously slows down performance. At present, most of research solutions at home and abroad are considered from the following three aspects: firstly, the capacity of the TLB is enlarged, and the size of the TLB is enlarged from the aspect of hardware, but because of the influence of price and delay, the capacity of the TLB is limited and cannot be enlarged all the time, so that the method has great limitation; secondly, considering from the coverage rate, a plurality of address translation entries in the page table are compressed into a TLB address translation entry by some methods and are filled into the TLB, but no matter what method is used for compression, the coverage rate is also limited and cannot be infinitely compressed all the time; thirdly, the access delay of the TLB is improved, but because the number of address translation entries of the shared TLB is more than that of the address translation entries of the private TLB, the delay optimization of the shared TLB is not ideal at present. The inventors therefore believe that further improvements are needed in the way address translation optimizations in GPUs.

Disclosure of Invention

In view of this, the present application provides a method, a system, a device and a storage medium for optimizing address translation in a GPU, so as to solve the technical problem of long address translation delay in the conventional GPU.

In order to solve the above problem, in a first aspect, the present invention provides a method for optimizing address translation in a GPU, where the method includes:

establishing a directory for each core of the GPU, wherein the directory records the sharing condition between private TLBs of each core of the GPU;

if the current address translation request is missed in the local core private TLB, determining the hit remote core private TLB by searching the directory of the local core private TLB, and acquiring the physical address required by the current address translation request according to the response of the hit remote core private TLB to complete address translation.

Optionally, a directory is established for each core of the GPU, specifically: and configuring a directory for the private TLB of each core, wherein the directory records the mapping information of each entry in the corresponding private TLB and the mapping information of each entry in the remote core private TLB closely connected with the corresponding private TLB, and the mapping information comprises the virtual address and the storage position information of the entry.

Optionally, the storage location information includes bitmap information of the entry and shared information, the bitmap information includes a core location and a core number stored in a physical address mapped correspondingly to a virtual address of the corresponding entry, and the shared information indicates whether the physical address mapped correspondingly to the corresponding entry is stored in the shared TLB.

Optionally, before completing the address translation according to the response of the remote core private TLB, the method further includes:

interconnecting each private TLB through an on-chip interconnection network technology;

determining a hit remote core private TLB by searching a directory of the local core private TLB, acquiring a physical address required by a current address translation request according to a response of the hit remote core private TLB, and completing address translation, wherein the address translation includes:

the current address translation request comprises a target virtual address, and the private TLB of a core, which is stored by physical address information mapped corresponding to the target virtual address, is determined according to bitmap information recorded by the matched entry by searching the virtual address of the entry successfully matched with the target virtual address in a directory of the local core private TLB and is used as a hit remote core private TLB;

based on an on-chip interconnection network among the private TLBs, sending the current address translation request to the hit off-site core private TLB;

and acquiring the request response, acquiring the physical address mapped corresponding to the target virtual address, and completing address conversion.

Optionally, the method further includes:

if the current address translation request is not hit in the local core private TLB, the hit remote core private TLB is not searched through the directory of the local core private TLB, the current address translation request is sent to the shared TLB, the shared TLB is searched, and if the current address translation request is not hit, the current address translation request is sent to a buffer traversed by the page table to search the page table in the internal memory.

Optionally, the method further includes:

when a target entry of the shared TLB needs to be placed into the private TLB to be filled, counting the number of entries shared and stored with the private TLB to be filled according to bitmap information of each entry recorded in a directory of the private TLB to be filled;

and judging whether the number of the shared and stored items reaches a preset threshold value, if so, filling the target items of the shared TLB into the private TLB to be filled, and updating the corresponding directory.

Optionally, the method further includes:

screening a plurality of items to be deleted from a target private TLB based on a preset MRU strategy;

acquiring bitmap information of a current item to be deleted by searching a directory of a target private TLB, and counting the number of cores shared and stored by physical addresses mapped corresponding to the current item to be deleted and taking the number as a sharing degree;

judging whether the sharing degree of the current item to be deleted reaches a preset sharing threshold value or not;

if not, taking the current entry to be deleted as an evicted entry, and deleting the entry from the target private TLB;

if yes, judging the next item to be deleted, and returning to execute the MRU strategy for circulating eviction until an evicted item is not screened after all items to be deleted are judged.

In a second aspect, the present application provides a system for optimizing address translation in a GPU, the system comprising:

the directory building module is used for building a directory for each core of the GPU, and the directory records the sharing condition between the private TLBs of each core of the GPU;

and the address translation module is used for determining a hit remote core private TLB by searching a directory of the local core private TLB if the current address translation request is missed in the local core private TLB, acquiring a physical address required by the current address translation request according to the response of the hit remote core private TLB, and completing address translation.

In a third aspect, the present application provides a computer device, which adopts the following technical solutions:

a computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the address translation optimization method in the GPU when executing the computer program.

In a fourth aspect, the present application provides a computer-readable storage medium, which adopts the following technical solutions:

a computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for address translation optimization in a GPU.

The beneficial effects of adopting the embodiment are as follows: establishing a directory for each core of the GPU, and recording the sharing condition between the private TLBs of each core of the GPU through the directory so as to link the access similarity between the private TLBs of each core; if the current address translation request is not hit in the local core private TLB, the hit remote core private TLB can be determined by searching the directory, and the physical address required by the current address translation request can be acquired according to the response of the hit remote core private TLB to complete address translation, so that the probability of accessing the shared TLB is effectively reduced, the access delay is reduced, and the address translation performance is improved.

Drawings

FIG. 1 is a schematic diagram of an application scenario of an address translation optimization system in a GPU according to the present invention;

FIG. 2 is a flowchart of a method for optimizing address translation in a GPU according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating another embodiment of a method for optimizing address translation in a GPU according to the present invention;

FIG. 4 is a diagram illustrating mapping information for an entry of a directory record of a private TLB provided by the present invention;

FIG. 5 is a schematic diagram of an on-chip internetwork-based access to each private TLB provided by the present invention;

FIG. 6 is a flowchart illustrating an embodiment of a method for optimizing address translation in a GPU, step S202;

FIG. 7 is a flowchart of a method for optimizing address translation in a GPU according to another embodiment of the present invention;

FIG. 8 is a flowchart illustrating a method for optimizing address translation in a GPU according to another embodiment of the present invention;

FIG. 9 is a schematic block diagram of an embodiment of a system for optimizing address translation in a GPU according to the present invention;

FIG. 10 is a schematic block diagram of an embodiment of a computer device provided by the present invention.

Detailed Description

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate preferred embodiments of the invention and together with the description, serve to explain the principles of the invention and not to limit the scope of the invention.

In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein may be combined with other embodiments.

The invention provides a method, a system, equipment and a storage medium for optimizing address conversion in a GPU (graphics processing unit), which are respectively explained in detail below.

Fig. 1 is a schematic view of a scenario of an address translation optimization system in a GPU according to an embodiment of the present disclosure, where the system may include a server 100, and the server 100 integrates the address translation optimization system in the GPU, such as the server in fig. 1.

In the embodiment of the present application, the server 100 is mainly used for:

if the current address translation request is missed in the local core private TLB, the hit remote core private TLB is determined by searching the directory, and the physical address required by the current address translation request is acquired according to the response of the hit remote core private TLB, so that the address translation is completed.

In this embodiment, the server 100 may be an independent server, or may be a server network or a server cluster composed of servers, for example, the server 100 described in this embodiment includes, but is not limited to, a computer, a network host, a single network server, multiple network server sets, or a cloud server composed of multiple servers. Among them, the Cloud server is constituted by a large number of computers or web servers based on Cloud Computing (Cloud Computing).

It will be appreciated that the terminal 200 used in the embodiments of the present application may be a device that includes both receiving and transmitting hardware, i.e. a device having receiving and transmitting hardware capable of performing two-way communication over a two-way communication link. Such a device may include: a cellular or other communication device having a single line display or a multi-line display or a cellular or other communication device without a multi-line display. The specific terminal 200 may be a desktop computer, a portable computer, a network server, a Personal Digital Assistant (PDA), a mobile phone, a tablet computer, a wireless terminal device, a communication device, an embedded device, and the like, and the embodiment does not limit the type of the terminal 200.

Those skilled in the art will understand that the application environment shown in fig. 1 is only one application scenario related to the present application, and does not constitute a limitation on the application scenario of the present application, and that other application environments may further include more or fewer terminals than those shown in fig. 1, for example, only 2 terminals are shown in fig. 1, and it can be understood that the address translation optimization system in the GPU may further include one or more other terminals, which is not limited herein.

In addition, referring to fig. 1, the address translation optimization system in the GPU may further include a memory 200 for storing data, such as directory information of each private TLB.

It should be noted that the scene diagram of the address translation optimization system in the GPU shown in fig. 1 is merely an example, and the address translation optimization system in the GPU and the scene described in the embodiment of the present application are for more clearly illustrating the technical solution of the embodiment of the present application, and do not form a limitation on the technical solution provided in the embodiment of the present application.

First, an embodiment of the present application provides a method for optimizing address translation in a GPU, where the method includes:

establishing a directory for each core of the GPU, wherein the directory records the sharing condition between private TLBs of each core of the GPU; if the current address translation request is not hit in the local kernel private TLB, the hit remote kernel private TLB is determined by searching the directory, and the physical address required by the current address translation request is obtained according to the response of the hit remote kernel private TLB, so as to complete the address translation.

Referring to fig. 2, a flowchart of an embodiment of a method for optimizing address translation in a GPU is provided in the present application, where the method for optimizing address translation in a GPU includes the following steps:

s201, establishing a directory for each core of the GPU, wherein the directory records the sharing condition between private TLBs of each core of the GPU;

s202, if the current address translation request is not hit in the local core private TLB, the hit remote core private TLB is determined by searching the directory of the local core private TLB, and the physical address required by the current address translation request is obtained according to the response of the hit remote core private TLB, so that the address translation is completed.

In this embodiment, a directory may be provided by hardware alongside the private TLB of each core. According to the difference of private TLBs of each core, the information recorded by the corresponding directory is different; the directory of the private TLB of one core mainly records the information of the local core private TLB and the information of the remote core private TLB with stronger relevance with the private TLB, thereby relating the private TLB with higher access similarity, and being convenient for improving the access efficiency.

The embodiment establishes a directory for each core of the GPU, and records the sharing condition between the private TLBs of each core of the GPU through the directory, so that the similarity of access between the private TLBs of each core is related; if the current address translation request misses in the local core private TLB, the hitting remote core private TLB can be determined by searching the directory, and the physical address required by the current address translation request can be obtained according to the response of the hitting remote core private TLB to complete address translation, so that the probability of accessing the shared TLB can be effectively reduced, the access delay is reduced, and the performance of address translation is improved.

Optionally, referring to fig. 3, the method for optimizing address translation in a GPU provided by the present application further includes:

if the current address translation request is not hit in the local core private TLB, the hit remote core private TLB is not searched through the directory of the local core private TLB, the current address translation request is sent to the shared TLB, and is searched in the shared TLB, and if the current address translation request is not hit, the current address translation request is sent to a buffer traversed by the page table, and the page table in the memory is searched.

Optionally, step S201 provided in the present application specifically includes:

and a directory is configured for the private TLB of each core, the directory records the mapping information of each entry in the corresponding private TLB and the mapping information of each entry in the remote core private TLB closely linked with the corresponding private TLB, and the mapping information comprises the virtual address and the storage position information of the entry.

Optionally, the storage location information includes bitmap information of the entry and shared information, the bitmap information includes a core location and a core number stored in a physical address mapped correspondingly to the virtual address of the corresponding entry, and the shared information indicates whether the physical address mapped correspondingly to the corresponding entry is stored in the shared TLB.

Specifically, referring to fig. 4, a mapping information diagram of an entry of a directory record is shown, in which a VPN represents a virtual address of a corresponding private TLB entry; l1-bitmap represents the core position stored by the physical address mapped corresponding to the current VPN, namely the private TLB position, and the bit number of the core position represents the number of cores in shared storage; l2-bit indicates whether the physical address of the current VPN mapping is stored in the shared TLB, which requires a bit to indicate.

Optionally, before completing the address translation according to the response of the remote core private TLB in step S202 provided by the present application, the optimization method of this embodiment further includes:

and the private TLBs are interconnected through an on-chip interconnection network technology.

In this embodiment, referring to fig. 5, all the private TLBs are connected through an on-chip internetworking technique, so that each private TLB may issue an address translation request to another private TLB, and specifically, a routing node in the on-chip network selects an appropriate route to send a request or a response to a routing node of the corresponding private TLB.

Optionally, referring to fig. 6, the present application provides a flowchart of a method of an embodiment of step S202, where the step S202 includes the following steps:

s601, the current address translation request comprises a target virtual address, and the private TLB of a core, which is stored by physical address information mapped corresponding to the target virtual address and is used as a hit remote core private TLB, is determined according to bitmap information recorded by a matched entry by searching a virtual address of an entry successfully matched with the target virtual address in a directory of the local core private TLB;

s602, based on the on-chip interconnection network among the private TLBs, sending the current address translation request to the hit remote core private TLB;

s603, obtaining the request response, obtaining the physical address mapped corresponding to the target virtual address, and completing address conversion.

Specifically, when the current address translation request cannot hit in the local core private TLB, according to a target virtual address carried in the current address translation request, a VPN (virtual address) matching the target virtual address in a directory of the local core private TLB is searched, and a hit remote core private TLB and a hit number can be determined; further, referring to fig. 4, the current address translation request is sent to the remote core private TLB to be accessed through the on-chip internet, a response is waited, and a request response is obtained through the on-chip internet, where the request response carries a physical address to which the target is virtually mapped, so that the current address translation request completes address translation in the local core private TLB.

Optionally, referring to fig. 7, the method for optimizing address translation in a GPU provided by the present application further includes:

s701, when a target entry of the shared TLB needs to be placed into the private TLB to be filled, counting the number of entries shared and stored with the private TLB to be filled according to bitmap information of each entry recorded in a directory of the private TLB to be filled;

s702, judging whether the number of the shared and stored items reaches a preset threshold value, if so, filling the target items of the shared TLB into the private TLB to be filled, and updating the corresponding directory.

Specifically, when a certain entry of the shared TLB needs to be filled into one private TLB, the number of entries shared with each core is counted according to a directory corresponding to the private TLB, and if a preset threshold is reached, the entries are simultaneously loaded into the private TLB to be filled, and new entry information is added to the directory of the private TLB to be filled. In this embodiment, the preset threshold is 2 to 5, specifically 2, 3, 4, or 5.

Optionally, referring to fig. 8, the method for optimizing address translation in a GPU provided by the present application further includes:

s801, screening a plurality of items to be deleted from a target private TLB based on a preset MRU strategy;

s802, obtaining bitmap information of the current item to be deleted by searching a directory of a target private TLB, and counting the number of cores shared and stored by physical addresses mapped corresponding to the current item to be deleted and taking the number as a sharing degree;

s803, judging whether the sharing degree of the current item to be deleted reaches a preset sharing threshold value;

s804, if not, taking the current item to be deleted as an evicted item, and deleting the evicted item from the target private TLB;

and S805, if yes, judging the next item to be deleted, and returning to execute the MRU strategy for circular eviction until an eviction item is not screened after all items to be deleted are judged.

The entry eviction strategy of the private TLB in this embodiment determines whether to evict an entry by searching a directory to determine whether the evicted entry reaches a sharing threshold on the basis of an original MRU eviction strategy, thereby further optimizing the eviction strategy. In this embodiment, the sharing threshold may take 2, 3, or 4.

The embodiment is provided with a directory for each private TLB, the sharing condition among cores is recorded through the directory, and when the local core private TLB is not hit, the corresponding directory is searched to determine the remote core private TLB stored by the physical address required by the address translation request, so that the probability of accessing the shared TLB can be effectively reduced, and the access delay is reduced; and the interconnection of the private TLB of each core is realized through the on-chip interconnection network, the address translation request which is not hit locally is sent to the hit remote core private TLB through the on-chip interconnection network, and the response is waited to complete the address translation. In addition, according to the sharing degree of the directory record, the influence of the sharing degree of the address translation entry is considered on the original MRU eviction strategy, and further eviction strategy optimization is carried out. Therefore, the method and the device can improve the hit rate of the private TLB on the whole, achieve the aim of improving the address conversion performance, and finally improve the performance of the GPU.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

The embodiment also provides an address translation optimization system in the GPU, which corresponds to the address translation optimization method in the GPU in the above embodiment one to one. As shown in fig. 9, the system for optimizing address translation in a GPU includes a directory building module 901 and an address translation module 902. The detailed description of each functional module is as follows:

a directory building module 901, configured to build a directory for each core of the GPU, where the directory records a sharing condition between private TLBs of each core of the GPU;

the address translation module 902 is configured to, if the current address translation request misses the local core private TLB, determine a hitting different-place core private TLB by looking up a directory of the local core private TLB, and obtain a physical address required by the current address translation request according to a response of the hitting different-place core private TLB, thereby completing address translation.

For the specific limitations of the address translation optimization system in the GPU, reference may be made to the above limitations on the address translation optimization method in the GPU, which are not described herein again. The modules in the address translation optimization system in the GPU may be implemented in whole or in part by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent of a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

The embodiment also provides a computer device, which may be a server, and the internal structure diagram of the computer device may be as shown in fig. 10. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store a directory or the like of each private TLB. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method for optimizing address translation in a GPU, and in one embodiment, the processor executes the computer program to implement the following steps:

The present embodiments also provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:

if the current address translation request is not hit in the local kernel private TLB, the hit remote kernel private TLB is determined by searching the directory, and the physical address required by the current address translation request is obtained according to the response of the hit remote kernel private TLB, so as to complete the address translation.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above.

Any reference to memory, storage, database or other medium used in the embodiments provided herein can include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), Rambus (Rambus) direct RAM (RDRAM), direct bused dynamic RAM (DRDRAM), and bused dynamic RAM (RDRAM).

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are also included in the scope of the present invention.

Claims

1. A method for optimizing address translation in a GPU (graphics processing Unit), comprising:

if the current address translation request is missed in the local core private TLB, determining a hit remote core private TLB by searching a directory of the local core private TLB, and acquiring a physical address required by the current address translation request according to a response of the hit remote core private TLB to complete address translation;

the method comprises the following steps of establishing a directory for each core of the GPU, specifically: a directory is configured for each core's private TLB; the directory records mapping information of each entry in the corresponding private TLB and mapping information of each entry in the different-place kernel private TLB closely linked with the corresponding private TLB, wherein the mapping information comprises virtual addresses and storage position information of the entries.

2. The method of claim 1, wherein the storage location information comprises bitmap information of the entry and shared information, the bitmap information comprises a core location and a core number stored in a physical address corresponding to a mapping of a virtual address of the corresponding entry, and the shared information indicates whether the physical address corresponding to the mapping of the corresponding entry is stored in the shared TLB.

3. The method of optimizing address translation in a GPU of claim 2, wherein prior to completing address translation in response to the heteronuclear private TLB, the method further comprises:

determining a hit remote core private TLB by searching the directory of the local core private TLB, acquiring a physical address required by a current address translation request according to the response of the hit remote core private TLB, and completing address translation, wherein the address translation method comprises the following steps:

4. The method of address translation optimization in a GPU as claimed in claim 1, wherein the method further comprises:

if the current address translation request is missed in the local core private TLB, the hit remote core private TLB is not found out through the directory of the local core private TLB, the current address translation request is sent to the shared TLB to be found out in the shared TLB, and if the current address translation request is missed, the current address translation request is sent to a buffer area traversed by a page table to find out the page table in the memory.

5. The method of address translation optimization in a GPU as claimed in claim 2, wherein the method further comprises:

6. The method of address translation optimization in a GPU as claimed in claim 1, wherein the method further comprises:

if not, taking the current entry to be deleted as an eviction entry, and deleting the current entry from the target private TLB;

if yes, judging the next item to be deleted, and returning to execute the MRU strategy for circular eviction until an evicted item is not screened after all items to be deleted are judged.

7. A system for optimizing address translation in a GPU, the system comprising:

the directory building module is used for building a directory for each core of the GPU, and the directory records the sharing condition between private TLBs of each core of the GPU;

the address translation module is used for determining a hit remote core private TLB by searching a directory of the local core private TLB if the current address translation request is missed in the local core private TLB, acquiring a physical address required by the current address translation request according to a response of the hit remote core private TLB, and completing address translation;

the method comprises the following steps of establishing a directory for each core of the GPU, specifically: a directory is configured for each core's private TLB; the directory records mapping information of each entry in the corresponding private TLB and mapping information of each entry in the different-place core private TLB closely connected with the corresponding private TLB, wherein the mapping information comprises virtual addresses and storage location information of the entries.

8. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor when executing the computer program implements the steps of the address translation optimization method in the GPU as claimed in any of claims 1 to 6.

9. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method for address translation optimization in a GPU as claimed in any of claims 1 to 6.