CN115422098A

CN115422098A - GPU (graphics processing Unit) memory access self-adaptive optimization method and device based on extended page table

Info

Publication number: CN115422098A
Application number: CN202210792723.4A
Authority: CN
Inventors: 李松林; 孟平凡; 时昊; 刘杨; 李然月
Original assignee: Moore Threads Technology Co Ltd
Current assignee: Moore Threads Technology Co Ltd
Priority date: 2022-02-15
Filing date: 2022-02-15
Publication date: 2022-12-02
Anticipated expiration: 2042-02-15
Also published as: CN114185818B; CN115422098B; CN114185818A

Abstract

The invention discloses a GPU access self-adaptive optimization method and a device based on an extended page table, wherein the method is applied to a GPU and comprises the following steps: the logic operation unit receives a command which is sent by GPU application and contains task type information and is required by completing a task, executes corresponding operation, and initiates a virtual address request to the memory management unit when accessing the memory; the memory management unit converts the virtual address into a physical address through the extended page table and sends the physical address to the address conversion unit, and the task type identifier corresponding to the GPU application is found and sent to the mapping scheme configuration register; the mapping scheme configuration register finds the corresponding optimal access mapping mode and sends the optimal access mapping mode to the address conversion unit; and the address conversion unit maps the physical address into a new address according to the optimal access mapping mode and sends the new address to the memory subsystem for access. The method and the device ensure the correctness of the memory access, and realize the targeted memory access optimization of the application when the GPU runs.

Description

GPU (graphics processing Unit) memory access self-adaptive optimization method and device based on extended page table

The application is a divisional application of Chinese patent application with the name of 'GPU access adaptive optimization method and device based on extended page table' filed by China patent office at No. 202210135055.8 and filed at No. 15.02/2022.

Technical Field

The invention relates to the technical field of memory access optimization, in particular to a GPU memory access self-adaptive optimization method and device based on an extended page table.

Background

This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

Modern computer memory systems mainly have two optimization goals, namely reducing memory access delay and improving memory access bandwidth. The optimization target of the GPU memory system emphasizes on improving the access bandwidth, and has higher tolerance to access delay. For this reason, a modern GPU memory system is divided into a plurality of memory channels that can be accessed in parallel, and different memory addresses are mapped to different memory channels by a certain address-channel mapping rule to provide parallel access and memory capability, and increase GPU bandwidth, as shown in fig. 1. Generally, the memory addresses of a series of consecutive memory accesses change less frequently in the upper bits and change more frequently in the lower bits, so the lower bits of the memory addresses are generally selected for selecting the memory channel. However, in practical applications, due to the address sequence of a specific GPU, access may be concentrated on the same memory channel, and further channel access conflicts are generated, as shown in fig. 2, when the memory is 8 channels and the address interval of the access is also an integer multiple of 8, all requests are concentrated on one channel, and the other 7 channels are in an idle state, so that the effective bandwidth is 1/8 of the maximum bandwidth, which results in a large reduction in the actual memory access bandwidth.

In order to solve the problem of channel access conflict caused by continuous mapping of addresses to different channels, a plurality of bits (channel bits) are selected for determining a memory channel, and meanwhile, a new address is obtained for determining the memory channel after exclusive or operation is carried out on other bits in the memory address and the channel bits. The XOR operation enables the memory channel mapped by each access request to have a pseudo-random characteristic, and therefore better channel load balance can be achieved in a statistical sense.

The memory address interleaving (pseudo-random interleaving) method is described in detail below.

Table 1 and table 2 show the access conflict situation under a specific memory arrangement. Each column in the table corresponds to one memory channel, each column can be read and written in parallel, and one clock cycle is needed for reading and writing one unit in each pair of columns. Table 1 shows the situation without pseudo-random interleaving, and it can be seen that the access requests of [0,8,16,24,32,40,48] shown in fig. 2 are all concentrated in channel 1, so that 7 read and write cycles are required to complete the access. And table 2 shows that the requests are scattered to different channels through pseudo-random interleaving, and can be read and written in parallel. The most operations are the 4 th column, and three read-write operations (40,48,8) need to be completed, so that three read-write cycles are required to complete all accesses.

Table 1 shows the memory arrangement without pseudo-random interleaving

0	1	2	3	4	5	6	7
								8	9	10	11	12	13	14	15
16	17	18	19	20	21	22	23
								24	25	26	27	28	29	30	31
32	33	34	35	36	37	38	39
								40	41	42	43	44	45	46	47
48	49	50	51	52	53	54	55
								56	57	58	59	60	61	62	63

Table 2 shows the memory allocation situation where the requests are spread to different channels by pseudo-random interleaving

43	37	9	34	44	28	23	30
								38	55	53	40	24	21	63	15
1	36	16	49	5	32	60	19
								57	59	56	48	3	0	12	13
35	6	27	8	29	25	50	2
								51	41	7	46	14	62	47	45
10	33	52	58	39	22	11	31
								42	61	17	4	20	18	54	26

The main method for solving the access conflict of the memory channels at present is to map the GPU original access mode into an access mode distributed more uniformly in the memory channels by the above-mentioned common memory address interleaving (pseudo-random interleaving) manner, so as to improve the utilization rate of the memory channels and further improve the throughput rate of the GPU memory system.

However, as the task types executed by the modern GPU gradually increase and become more complex, the fixed memory address interleaving manner cannot meet the requirements of all task types. The pseudo-random interleaving characteristic enables memory access requests in a period of time to be relatively balanced among different channels under most conditions, but different memory address interleaving maps memory spaces, so that the memory address interleaving modes are overlapped due to the coexistence of multiple memory address interleaving modes, memory access errors are generated, and once the GPU is started, the memory address interleaving modes cannot be modified.

Disclosure of Invention

The embodiment of the invention provides a GPU access memory adaptive optimization method based on an extended page table, which is used for solving the technical problem that the prior art cannot be extended to a scheme for adaptively selecting multiple applications due to errors caused by memory address overlapping when multiple mappings exist simultaneously or different mapping modes are selected for multiple applications, and the method is applied to a GPU and comprises the following steps:

the method comprises the following steps that a logic operation unit receives a task completion required instruction which is sent by a GPU application and comprises task type information, executes corresponding operation according to the task completion required instruction, and initiates a virtual address request to a memory management unit when accessing a memory;

the memory management unit converts a virtual address into a physical address through an extended page table according to the virtual address request, sends the physical address to an address conversion unit, finds the task type identifier corresponding to the GPU application based on task type information, and sends the task type identifier to a mapping scheme configuration register, wherein the task type identifier is marked in the extended page table of the memory management unit;

the mapping scheme configuration register finds the corresponding optimal memory access mapping mode according to the task type identifier and sends the optimal memory access mapping mode to an address conversion unit;

and the address conversion unit maps the physical address into a new mapped address according to the optimal memory access mapping mode, sends the new mapped address to the memory subsystem, and accesses the memory subsystem based on the new mapped address.

The embodiment of the invention also provides a GPU memory access adaptive optimization device based on the extended page table, which is used for solving the technical problem that the prior art cannot be extended to a scheme for adaptively selecting multiple applications due to errors caused by memory address overlapping when multiple mappings exist simultaneously or different mapping modes are selected for multiple applications, and the device is applied to a GPU and comprises the following steps:

the logic operation unit is used for receiving a task completion required instruction which is sent by the GPU application and comprises task type information, executing corresponding operation according to the task completion required instruction, and initiating a virtual address request to the memory management unit when accessing the memory;

the memory management unit is used for converting a virtual address into a physical address through an extended page table according to the virtual address request, sending the physical address to the address conversion unit, finding the task type identifier corresponding to the GPU application based on task type information, and sending the task type identifier to a mapping scheme configuration register, wherein the task type identifier is marked in the extended page table of the memory management unit;

the mapping scheme configuration register is used for finding the corresponding optimal memory access mapping mode according to the task type identifier and sending the optimal memory access mapping mode to the address conversion unit;

and the address conversion unit is used for mapping the physical address into a new mapped address according to the optimal memory access mapping mode, sending the new mapped address to the memory subsystem, and accessing the memory subsystem based on the new mapped address.

The embodiment of the invention also provides computer equipment which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein when the processor executes the computer program, the GPU access self-adaptive optimization method based on the extended page table is realized.

The embodiment of the invention also provides a computer readable storage medium, wherein a computer program is stored in the computer readable storage medium, and when the computer program is executed by a processor, the GPU access adaptive optimization method based on the extended page table is realized.

The embodiment of the invention also provides a computer program product which comprises a computer program, wherein when the computer program is executed by a processor, the GPU access self-adaptive optimization method based on the extended page table is realized.

In the embodiment of the invention, compared with the technical scheme that in the prior art, when multiple mappings exist simultaneously or different mapping modes are selected for multiple applications, memory address overlapping can cause errors, so that the prior art cannot be expanded to a scheme for adaptively selecting multiple applications, a command required by a task completion and comprising task type information, which is sent by a GPU application, is received through a logic operation unit, corresponding operation is executed according to the command required by the task completion, and a virtual address request is initiated to a memory management unit when a memory is accessed; the memory management unit converts a virtual address into a physical address through an extended page table according to the virtual address request, sends the physical address to an address conversion unit, finds the task type identifier corresponding to the GPU application based on task type information, and sends the task type identifier to a mapping scheme configuration register, wherein the task type identifier is marked in the extended page table of the memory management unit; the mapping scheme configuration register finds the corresponding optimal memory access mapping mode according to the task type identifier and sends the optimal memory access mapping mode to an address conversion unit; and the address conversion unit maps the physical address into a new mapped address according to the optimal memory access mapping mode, sends the new mapped address to the memory subsystem, and accesses the memory subsystem based on the new mapped address. The invention can ensure that each application can obtain a customized memory access mapping mode by expanding the page table content and adding the task type mark, and can realize the targeted memory access optimization of the application when the GPU runs while ensuring the memory access correctness.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts. In the drawings:

FIG. 1 is a schematic diagram of a memory channel;

FIG. 2 is a diagram illustrating a memory channel with channel access conflicts;

FIG. 3 is a first schematic diagram illustrating a GPU access adaptive optimization method based on an extended page table according to an embodiment of the present invention;

FIG. 4 is a first schematic diagram illustrating a first operation flow of selecting an optimal access mapping manner according to an embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating a second operation flow of selecting an optimal access mapping manner according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of an operation flow of information entropy corresponding to GPU access requests after different mappings according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a second method for adaptive optimization of GPU memory access based on an extended page table according to an embodiment of the present invention;

FIG. 8 is a third schematic diagram of a GPU access adaptive optimization method based on an extended page table according to an embodiment of the present invention;

FIG. 9 is a first schematic structural diagram of a device for adaptive optimization of memory access of a GPU based on an extended page table according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of a GPU access adaptive optimization apparatus based on an extended page table according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention are further described in detail below with reference to the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.

Interpretation of terms:

memory address: the memory management unit is generally responsible for the conversion of virtual addresses to physical addresses.

According to the technical scheme, the data acquisition, storage, use, processing and the like meet relevant regulations of national laws and regulations.

In order to solve the problems of limitation of a single address mapping scheme and incapability of coexistence of different address mapping schemes in the prior art, the invention provides a GPU (graphics processing unit) memory access self-adaptive optimization method based on an extended page table, which realizes targeted memory access optimization of application when a GPU runs and realizes a software configurable mapping mode while ensuring the memory access correctness. Meanwhile, the method also expands the support that different address mapping schemes cannot coexist, and provides self-adaptive selection of the access and storage optimization method.

In the embodiment of the present invention, fig. 3 is a schematic diagram of a first method for adaptive optimization of GPU access based on an extended page table in the embodiment of the present invention, as shown in fig. 3, the method is applied to a GPU, and includes:

step 301: the method comprises the following steps that a logic operation unit receives a task completion required instruction which is sent by a GPU application and comprises task type information, executes corresponding operation according to the task completion required instruction, and initiates a virtual address request to a memory management unit when accessing a memory;

step 302: the memory management unit converts a virtual address into a physical address through an extended page table according to the virtual address request, sends the physical address to an address conversion unit, finds the task type identifier corresponding to the GPU application based on task type information, and sends the task type identifier to a mapping scheme configuration register, wherein the task type identifier is marked in the extended page table of the memory management unit;

step 303: the mapping scheme configuration register finds the corresponding optimal memory access mapping mode according to the task type identifier and sends the optimal memory access mapping mode to an address conversion unit;

step 304: and the address conversion unit maps the physical address into a new mapped address according to the optimal memory access mapping mode, sends the new mapped address to the memory subsystem, and accesses the memory subsystem based on the new mapped address.

Specifically, the page table (containing all page table entries): in order to run a plurality of programs on the same GPU without generating memory address conflict, GPU software uniformly uses virtual addresses starting from 0 when accessing a memory and converts the virtual addresses into actual physical addresses through a page table. Specifically, the physical memory is divided into different pages in units of 4KB, and one 4KB page at a virtual address is mapped to one 4KB address space at a physical address by a page table. In addition, as the memory space of modern computers increases, a page table in units of 4KB causes an excessively large amount of page table entries (one record for recording information (physical address, task type, and other attributes required) of one page), and further, there are pages divided in larger units, that is, large pages. For large pages, a corresponding page table is also needed, mapping large pages of virtual addresses to large pages of physical addresses.

Specifically, when the GPU application is started, the instruction required to complete the task is sent to the GPU, and the task type information is also carried. The GPU Core (i.e., the logical operation unit) executes operations according to the instructions, and initiates a virtual address request to the memory management unit when it needs to access the memory. The memory management unit converts the virtual address into a physical address, finds out a task type identifier (stored in a corresponding page table entry of the memory management unit) corresponding to the memory access request according to the task type information, transmits the task type identifier to a mapping scheme configuration register (namely an on-chip register of the GPU), finds out a corresponding address mapping scheme, and transmits the mapping scheme to the address conversion module. And after receiving the physical address and the mapping scheme, the address conversion module maps the physical address and the mapping scheme into a new mapped address according to rules, sends the new mapped address to the memory subsystem, and then accesses the memory subsystem based on the new mapped address.

When the GPU program is started, a task type identifier is assigned to the program according to the application type of the program (for example, a type value 0 is assigned to a scientific calculation type application, a type value 1 is assigned to a graphics rendering application, and the like, and a type value 2 may also be assigned to another graphics rendering application having a larger difference from the former access mode), and the task type identifier is recorded in a page table entry assigned to the program. When a program initiates a memory access request, a virtual address is translated into a physical address through a page table entry, a task type value is taken out and transmitted into an address pseudo-random interleaving module, and an application original memory access mode is mapped into an optimized optimal memory access mode of the application.

Specifically, the access mapping method corresponds to a memory address mapping rule, for example, the access inside the cache uses the addressing mode of the set and the way, and the set and the way are also calculated by the actual memory address.

In the embodiment of the present invention, as shown in fig. 4, the optimal access mapping manner corresponding to the GPU application is determined as follows:

step 401: acquiring a GPU access request of GPU application;

step 402: performing different memory address mapping on the GPU access request by using different memory access mapping modes to obtain GPU access requests after different mappings;

step 403: determining information entropies corresponding to the GPU access requests after different mappings are performed;

step 404: and taking the access mapping mode corresponding to the maximum information entropy as the optimal access mapping mode of the GPU application.

Specifically, the access mapping mode includes, but is not limited to, a mapping mode based on an exclusive or control vector, and may also be a method of multiplying an address by a prime number and then modulo by a channel number, which corresponds to different prime number types.

For example, for an application with original access concentrated on channel 1,2, a portion of the requests may be distributed to the 3,4 channel by perturbing the high order bits of the selected channel. Accordingly, if one concentrates on 1,3 lanes, one may disturb the low bit so that a portion of the request does not fall on 2,4 lanes.

Specifically, as shown in fig. 5, the step 403 determines information entropies corresponding to the GPU access requests after different mappings, including:

step 501: determining a time period according to the length of the maximum waiting queue of the GPU;

step 502: determining a plurality of statistical time periods based on the time period;

step 503: determining distribution information entropies of GPU access requests in each statistical time period on different access channels;

step 504: and determining the distribution information entropy average value of a plurality of statistical time periods according to the distribution information entropy in each statistical time period, and taking the distribution information entropy average value as the information entropy corresponding to the mapped GPU access request.

Specifically, as shown in fig. 6, for different GPU applications, a corresponding optimization method needs to be found. For an actual application a (or a set of applications of a particular type), the application is run on a GPU emulator (development phase) or GPU chip (subsequent optimization) and a sequence of memory accesses (access request records) of the GPU is grabbed. Then, for all possible access mapping modes (embodied as different exclusive or control vectors), based on the information entropy, the distribution uniformity of the memory access requests mapped by each mapping mode on the channel is evaluated, and a mapping with the most uniform distribution (namely the highest information entropy) is selected from the distribution uniformity, so that the optimal access mapping mode for the application A (or a group of specific types of applications) is obtained.

The information entropy is described below.

The information entropy E is a quantity for measuring information carried by a group of messages, and is expressed by the following formula:

E＝sum(-p_i×log2(p_i))。

i.e. the sum of the negative logarithms of the probabilities p _ i for each possible message i in the message transmission.

For example, 1bit transmission, if the probability of transmission data being 0 and 1 is each bit 1/2, E =1/2+1/2=1, that is, each transmission brings 1bit of information. Conversely, if the transmission data is always 0 or always 1, E =0, i.e. the data transmission does not carry an information amount. Generally, the more information a set of received deterministic results can carry when the transmitted data uncertainty is stronger.

The following describes the access distribution uniformity evaluation based on the information entropy.

Each memory access request of the record is (t, a), where t is the time when the memory access request is initiated, and a is the address corresponding to the memory access request (here, only the part for selecting the channel is selected). Determining a time period T according to the length of a GPU maximum waiting queue (a GPU is internally configured, the GPU allows a plurality of memory requests to be continuously issued before the memory requests are replied, but the number is limited), counting the distribution of GPU memory access requests in different channels in each time period according to 0-1T, 1T-2T … (N-1) T-NT sections, calculating the entropy of the distribution, and finally averaging the information entropy of all the time periods in the memory access sequence to be used as the evaluation of the uniformity of the memory access sequence. Where NT = total length of time taken into account is not directly linked to the actual application run time, but cannot exceed it.

For each mapping mode f, the sequence of the original request (t, a) is mapped to a new sequence consisting of (t, a _ f). The advantages and disadvantages of an access and memory mapping mode are evaluated by calculating the information entropy of a new sequence brought by each mapping mode (the higher the information entropy of the sequence after mapping, the better the mapping mode).

For the information entropy of the memory address sequence mentioned in the present invention, a sequence (e.g. bit0_ t0, bit0_ t1, bit0_ t2 …) composed of each memory address bit (e.g. bit0, a memory address usually includes a plurality of bits) in a fixed time window is counted, and then the probability that each bit is 0 and 1 in the time window is obtained, and the information entropy of the bit sequence is calculated by applying the above formula.

For example, if 4 memory addresses are recorded within a time window (0000, 0001, 0011, 0111), the sequence of bit0 is (0,1,1,1), the sequence of bit1 is (0,0,1,1), the sequence of bit2 is (0,0,0,1), and the sequence of bit3 is (0,0,0,0). Thus, the information entropy of bit0, E _0= -p _0 (log 2 (p _ 0) -p _1 (log 2 (p _ 1) = -0.25 × log2 (0.25) +0.75 × log2 (0.75) =0.811, and similarly, E _1= -1, E \ u 2= -0.811, E _ u 3= -0, is calculated.

In the embodiment of the present invention, as shown in fig. 7, the method further includes:

step 701: and the mapping scheme configuration register is configured with an optimal access mapping mode corresponding to the GPU application.

Specifically, when the mapping scheme configuration is updated or a specific user needs to perform customized optimization, when the user uses the GPU, the obtained optimal mapping scheme can be written into the on-chip register by the GPU optimization software (software mode), so as to implement dynamic update and customization of the GPU mapping scheme. Namely, the value of the configuration register of the mapping scheme (namely the exclusive-or control vector) is modified, and the dynamic updating and customization of the GPU mapping scheme are realized.

The invention realizes the updating of the memory mapping scheme through software after the product is released through the register configuration channel accessible by the software.

Specifically, when the mapping scheme configuration is updated or a specific user needs to perform customized optimization, the optimal memory access mapping mode corresponding to the GPU application can be dynamically configured to the mapping scheme configuration register through a GPU driver or firmware (hardware mode) (which is specifically included in the GPU driver or firmware is dependent on the specific implementation mode of the GPU), so as to implement dynamic update and customization of the GPU mapping scheme.

Example 1 configuration method of a single pseudo-random mapping scheme.

When a specific pseudo-random mapping scheme is configured, a corresponding exclusive-or control vector (determined according to the optimal memory access mapping mode) a can be written into an on-chip mapping scheme configuration register through firmware or a driver, and an address conversion module can convert an original physical address into an address which is actually used for accessing a memory subsystem after mapping according to the exclusive-or control vector written into the on-chip mapping scheme configuration register. The specific method for performing address translation by applying the memory control vector in the address translation mode is as follows:

the definition is for a memory address B, which is composed of a plurality of bits, and the plurality of bits constituting the address are defined as B _1, B _2, … B _ n from low to high respectively.

The pseudo-random mapping is performed by performing an exclusive or operation on a plurality of bits b _ i in a selected memory address and a bit b _ c (channel bit) to be used for selecting a memory channel (i.e., a memory address), so as to obtain a new bit (channel bit) b _ c' used for selecting the memory channel:

b_c′＝b_c^b_1^b_2^b_3…；

however, the location of b _ i in the memory address is different for different applications. For example, for one mapping scheme, there may be

b_c′＝b_c^b_1^b_4^b_6…；

For another scheme, it may be

b_c′＝b_c^b_2^b_3^b_7…；

In order to realize dynamic configuration, the bit to participate in XOR is controlled by an XOR control vector a, and the length of the XOR control vector a is the same as the total number of all bits which possibly participate in XOR. When pseudo-random mapping is carried out, firstly, the exclusive-or vector a and all bits which possibly participate in exclusive-or are subjected to AND operation, and then, the result and the channel bit (b _ c) are subjected to exclusive-or operation, so that when one element a _ i =0 in a, the corresponding a _ i & b _ i are also 0, the exclusive-or result is not changed, and the pseudo-random interleaving result is not influenced. The selection mode of all bits that may participate in the exclusive or is arbitrary, such as random assignment, or trying to select the optimal combination of all possible bits, or selecting bits with variations according to the actual access mode, etc. (for example, in the access mode of some applications, some bits have no variation at all, and then they have no effect when participating in the exclusive or).

b_c′＝b_c^(b_1&a_1)^(b_2&a_2)^…。

Example 2 configuration method of multiple pseudo-random mapping schemes.

For a plurality of pseudo-random mapping schemes, a register for storing an exclusive-or control vector in the GPU is expanded to be capable of accommodating N groups of exclusive-or control vectors (corresponding to an optimal memory access mapping mode), a task type identifier is added to a page table entry corresponding to each large page, each page table entry finds the corresponding exclusive-or control vector in the register according to the task type identifier recorded in the page table (the corresponding relation between the task type identifier and the exclusive-or control vector is stored in the register), and address pseudo-random interleaving is performed. Because the partition of the pages provides memory space isolation among different pages, each page can freely apply a memory address mapping scheme different from other pages in the memory space corresponding to the page without mutual influence. As shown in table 3, the three fonts correspond to three different mapping schemes, the bold is no mapping, the standard font is pure random mapping, and the italic is column exchange mapping. Page 1 executes task a to select mapping mode 1, page 2 and page 4 execute task B (or execute tasks B, D of the same type) to select mapping mode 2, page 3 executes task C, and mapping mode 3 is selected. They can all get proprietary optimizations for the type of task they perform. Meanwhile, random interleaving is only performed in the basic block corresponding to each large page, so that storage space overlapping of the cross-memory blocks cannot be generated in different mapping modes, and data security is ensured.

TABLE 3 different memory address mapping schemes

In the embodiment of the present invention, as shown in fig. 8, the method further includes:

step 801: and the on-chip performance register acquires a memory access record of the GPU application, and updates the task type identifier of the GPU application according to the memory access record of the GPU application.

Specifically, the application selects the corresponding xor control vector by the task type identifier assigned in the page table. For the common tasks, the common tasks are subjected to targeted optimization, so that the task type identifications which are optimized and selected in advance can be distributed to the common tasks through a driving program, and then the corresponding exclusive-or control vectors are selected, so that the performance optimization is achieved. However, for an unusually used task type or a task which is not subjected to targeted optimization, the method can also adaptively select a more reasonable memory mapping scheme (embodied as selecting another more appropriate task type identifier) at the next starting according to the feedback result collected by the on-chip performance counter and the application requirement and performance, namely the method also supports dynamic learning and adjustment of the memory mapping mode through the real-time working load condition of the chip obtained by the on-chip performance counter. Specifically, the function dynamically analyzes the actual bandwidth requirement of the application by using a performance counter (such as statistical information of bandwidth, delay, power consumption and the like) existing in the GPU chip, and records the actual bandwidth requirement in an on-chip nonvolatile memory unit or in a computer hard disk storage space, so that when the next access is made, the corresponding information is read out by a driver, and an appropriate mapping scheme (embodied as allocating an appropriate task type identifier) is selected for the application according to the record.

The method is different from a configuration method of a plurality of pseudo-random mapping schemes in that the configuration method of the plurality of pseudo-random mapping schemes describes that type identifications are manually allocated to each application, but GPUs are applied in a large number and cannot be manually optimized one by one, so that the method can automatically adjust the applicable scenes of the applications. For example, for a compute intensive application, the on-chip performance counter detects that its memory access requests are not intensive, and may choose not to map, or to map to fewer channels, to reduce the number of active channels and reduce overall power consumption. Another example is that some specific applications have already made channel access homogenization in advance, and then pseudo-random mapping is performed on the channel access homogenization, which may destroy the access mode that is originally and carefully optimized, and further cause performance degradation. At the moment, a scheme without any mapping is selected by counting the balance of the GPU original memory access request, so that the benefit of original application optimization is kept. The above two examples are only specific applications, and practical applications include, but are not limited to, the above two cases.

In addition, a switch for pseudo-random interleaving and dynamic selection of a mapping scheme can be realized through a configuration register, so that more flexible GPU memory-channel mapping is realized.

The embodiment of the invention also provides a GPU memory access self-adaptive optimization device based on the extended page table, which is described in the following embodiment. Because the principle of the device for solving the problems is similar to the GPU memory access adaptive optimization method based on the extended page table, the implementation of the device can refer to the implementation of the GPU memory access adaptive optimization method based on the extended page table, and repeated parts are not described again.

Fig. 9 is a schematic structural diagram of a GPU access adaptive optimization device based on an extended page table in an embodiment of the present invention, as shown in fig. 9, the device is applied to a GPU, and includes:

the system comprises a logic operation unit (namely GPU Core) and a memory management unit, wherein the logic operation unit (namely GPU Core) is used for receiving a task completion required instruction which is sent by GPU application and comprises task type information, executing corresponding operation according to the task completion required instruction and initiating a virtual address request to the memory management unit when accessing a memory;

In an embodiment of the present invention, the mapping scheme configuration register is further configured to: and configuring the optimal access mapping mode corresponding to the GPU application.

In the embodiment of the present invention, the optimal access mapping mode corresponding to the GPU application is determined as follows:

acquiring a GPU access request of GPU application;

carrying out different memory address mapping on the GPU access request by using different memory access mapping modes to obtain GPU access requests after different mappings;

determining information entropies corresponding to the GPU access requests after different mappings are performed;

and taking the access mapping mode corresponding to the maximum information entropy as the optimal access mapping mode of the GPU application.

In the embodiment of the present invention, determining information entropies corresponding to GPU access requests after different mappings includes:

determining a time period according to the length of the GPU maximum waiting queue;

determining a plurality of statistical time periods based on the time period;

determining distribution information entropies of GPU access requests in each statistical time period on different access channels;

and determining the distribution information entropy average value of a plurality of statistical time periods according to the distribution information entropy in each statistical time period, and taking the distribution information entropy average value as the information entropy corresponding to the mapped GPU access request.

In the embodiment of the invention, the optimal access mapping mode corresponding to the GPU application is configured as follows:

writing the XOR control vector corresponding to the optimal memory access mapping mode into a mapping scheme configuration register, wherein the length of the XOR control vector is the same as the total number of bits selected for XOR in the memory address bits.

In the embodiment of the invention, the optimal access mapping mode corresponding to the GPU application is configured according to the following modes:

and marking the task type identifier corresponding to the GPU application in a corresponding page table of a memory management unit while writing the XOR control vector corresponding to the optimal memory access mapping mode into a mapping scheme configuration register, wherein the task type identifier and the XOR control vector have a corresponding relation.

In the embodiment of the present invention, as shown in fig. 9, the method further includes:

and the on-chip performance register is used for acquiring the memory access record of the GPU application and updating the task type identifier of the GPU application according to the memory access record of the GPU application.

In the embodiment of the invention, compared with the technical scheme that the prior art cannot be expanded to a scheme for adaptively selecting multiple applications due to errors caused by memory address overlapping when multiple mappings exist simultaneously or different mapping modes are selected for multiple applications, the method receives a task completion required instruction which is sent by GPU application and comprises task type information through a logic operation unit, executes corresponding operation according to the task completion required instruction, and initiates a virtual address request to a memory management unit when accessing a memory; the memory management unit converts a virtual address into a physical address through an extended page table according to the virtual address request, sends the physical address to an address conversion unit, finds the task type identifier corresponding to the GPU application based on task type information, and sends the task type identifier to a mapping scheme configuration register, wherein the task type identifier is marked in the extended page table of the memory management unit; the mapping scheme configuration register finds the corresponding optimal memory access mapping mode according to the task type identifier and sends the optimal memory access mapping mode to an address conversion unit; and the address conversion unit maps the physical address into a new mapped address according to the optimal memory access mapping mode, sends the new mapped address to the memory subsystem, and accesses the memory subsystem based on the new mapped address. The invention can obtain the following beneficial effects:

1. the GPU access sequence is captured, the information entropy of different mapping schemes is counted, the information entropy is used for quantitatively evaluating the advantages and disadvantages of the memory address mapping scheme, an optimal mapping mode can be found without full-system simulation, a customized memory pseudo-random interleaving mapping scheme can be provided for each GPU application, the exploration complexity of a design space is reduced, the performance of the GPU under a specific task is improved in a targeted mode, and meanwhile the capacity of the GPU for executing various tasks is reserved.

2. By utilizing the XOR control vector and the on-chip configurable register, the dynamic configuration of the memory address mapping scheme is realized under the condition of keeping the hardware resources unchanged, and the flexibility of the GPU in different application scenes is improved.

3. The register configuration channel accessible by software realizes the updating of the memory mapping scheme through the software after the product is released under the condition of keeping the hardware resources unchanged, and improves the flexibility of the GPU in different application scenes.

4. Statistical information of an on-chip performance counter is supported, and after being collected by software, appropriate task type identification is automatically distributed to the application, so that the self-adaptive optimization of the memory mapping scheme is realized.

5. The selectable memory access mapping mode is realized through an on-chip address conversion unit and a register supporting the storage of various GPU memory access mapping schemes.

6. By expanding the content of the page table entry and adding the task type mark, each application can obtain a customized memory mapping scheme, and the targeted memory access optimization of the application is realized when the GPU runs;

7. by utilizing the page table to divide the memory space, the conflict problem when various memory access mapping schemes coexist is solved, and the memory access accuracy is ensured.

The method has strong universality, and can be applied to any storage structure (such as on-chip cache and the like) with parallel access capability in a GPU memory hierarchy. In fact, although the above scheme is for mapping physical memory addresses, the mapping can be performed in the same manner for virtual address space, so that a flexible address mapping scheme can be implemented from software. The method can be applied to any module which can obtain performance gain when a series of memory access requests are evenly distributed on a plurality of independent units in a GPU memory hierarchy. For example, in the cache, the requests are evenly distributed among different groups to improve the utilization rate of the cache; in Dram, requests are spread to different banks to increase overall access bandwidth, and so on.

Specifically, as shown in fig. 10, before the memory management unit maps the virtual address to the physical address, according to the method including, but not limited to, a process ID (a unique number internally allocated to each process by the GPU), a virtual address range (for example, 0 to 4GB applies one scheme, and 4 to 8GB applies another scheme), configuration of the process ID and the task type is performed by using a register or the like, and a task type flag bound to the process ID is used to apply different address translation policies specifically, which should belong to a modification of the present invention under a specific design and should also receive protection of the present invention.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A GPU memory access adaptive optimization method is applied to a GPU and is characterized by comprising the following steps:

the method comprises the following steps that a logic operation unit receives a task completion required instruction which is sent by a GPU application and comprises task type information, executes corresponding operation according to the task completion required instruction, and initiates a virtual address request to a memory management unit or an address conversion unit when accessing a memory; wherein the virtual address request comprises a virtual address;

the mapping scheme configuration register searches an optimal memory access mapping mode corresponding to the GPU application according to the task type identifier corresponding to the task type information, and sends the optimal memory access mapping mode to an address conversion unit;

and the address conversion unit maps the virtual address or the physical address corresponding to the virtual address into a mapped address according to the optimal memory access mapping mode, wherein the mapped address is used for accessing a memory subsystem.

2. The GPU memory access adaptive optimization method of claim 1, wherein, when initiating a virtual address request to a memory management unit when accessing a memory, an address translation unit maps a physical address corresponding to the virtual address to a mapped address according to the optimal memory access mapping manner, the optimization method further comprising:

the memory management unit converts the virtual address into a physical address, sends the physical address to an address conversion unit, finds a task type identifier corresponding to the GPU application based on the task type information, and sends the task type identifier to a mapping scheme configuration register;

and the address conversion unit sends the mapped address to the memory subsystem and accesses the memory subsystem based on the mapped address.

3. The GPU memory access adaptive optimization method of claim 2, wherein the memory management unit converting the virtual address to a physical address comprises: and the memory management unit converts the virtual address into a physical address through an extended page table according to the virtual address request, wherein the task type identifier is marked in the extended page table of the memory management unit.

4. The GPU memory access adaptive optimization method of claim 1, wherein, when initiating a virtual address request to an address translation unit when accessing a memory, the address translation unit maps the virtual address to a mapped address according to the optimal memory access mapping manner, the optimization method further comprising:

the address conversion unit sends the mapped address to a memory management unit; wherein the mapped address is a mapped virtual address;

and the memory management unit converts the mapped address into a physical address and accesses the memory subsystem based on the physical address.

5. The GPU memory access adaptive optimization method of claim 1, wherein, when initiating a virtual address request to an address translation unit when accessing a memory, the address translation unit maps the virtual address to a mapped address according to the optimal memory access mapping manner, the optimization method further comprising:

and the address conversion unit accesses the memory subsystem according to the mapped address.

6. The GPU memory access adaptive optimization method according to any one of claims 1 to 5, characterized in that the optimal memory access mapping mode corresponding to the GPU application is determined as follows:

acquiring a GPU access request of GPU application;

and taking the memory access mapping mode corresponding to the maximum information entropy as the optimal memory access mapping mode of the GPU application.

7. The GPU access adaptive optimization method of claim 6, wherein determining the information entropy corresponding to the differently mapped GPU access requests comprises:

determining a time period according to the length of the maximum waiting queue of the GPU;

determining a plurality of statistical time periods based on the time period;

determining the distribution information entropy of any mapped GPU access request on different access channels in each statistical time period according to the information entropy corresponding to any mapped GPU access request; wherein the different access channels correspond to different memory subsystems;

and determining a distribution information entropy average value corresponding to any one of the mapped GPU access requests in a plurality of statistical time periods according to the distribution information entropy in each statistical time period, and taking the distribution information entropy average value corresponding to any one of the mapped GPU access requests as the information entropy corresponding to any one of the mapped GPU access requests.

8. The GPU memory access adaptive optimization method of any of claims 1 to 7, wherein the optimal memory access mapping mode corresponding to the GPU application is configured as follows:

9. The GPU memory access adaptive optimization method of claim 8, wherein the optimal memory access mapping mode corresponding to the GPU application is configured as follows:

10. The GPU memory access adaptive optimization method of any of claims 1-9, further comprising:

and the on-chip performance register acquires a memory access record of the GPU application, and updates the task type identifier of the GPU application according to the memory access record of the GPU application.

11. A GPU memory access adaptive optimization device is applied to a GPU and is characterized by comprising:

the system comprises a logic operation unit, a memory management unit and an address conversion unit, wherein the logic operation unit is used for receiving a task completion required instruction which is sent by GPU application and comprises task type information, executing corresponding operation according to the task completion required instruction, and initiating a virtual address request to the memory management unit or the address conversion unit when accessing a memory; wherein the virtual address request comprises a virtual address;

the mapping scheme configuration register is used for searching the optimal memory access mapping mode corresponding to the GPU application according to the task type identifier corresponding to the task type information and sending the optimal memory access mapping mode to an address conversion unit;

and the address conversion unit is used for mapping the virtual address or the physical address corresponding to the virtual address into a mapped address according to the optimal memory access mapping mode, and the mapped address is used for accessing a memory subsystem.

12. The GPU memory access adaptive optimization device of claim 11, wherein, when initiating a virtual address request to a memory management unit when accessing a memory, an address translation unit is configured to map a physical address corresponding to the virtual address into a mapped address according to the optimal memory access mapping manner;

the memory management unit is used for converting the virtual address into a physical address, sending the physical address to the address conversion unit, finding a task type identifier corresponding to the GPU application based on task type information, and sending the task type identifier to a mapping scheme configuration register;

and the address conversion unit is used for sending the mapped address to the memory subsystem and accessing the memory subsystem based on the mapped address.

13. The GPU memory access adaptive optimization of claim 12, wherein the translating the virtual address to a physical address comprises: and converting the virtual address into a physical address through an extended page table according to the virtual address request, wherein the task type identifier is marked in the extended page table of the memory management unit.

14. The GPU memory access adaptive optimization device of claim 11, wherein, in case of initiating a virtual address request to an address translation unit when accessing a memory, the address translation unit is configured to map the virtual address to a mapped address according to the optimal memory access mapping manner;

the address conversion unit is used for sending the mapped address to the memory management unit; wherein the mapped address is a mapped virtual address;

and the memory management unit is used for converting the mapped address into a physical address and accessing the memory subsystem based on the physical address.

15. The GPU memory access adaptive optimization device of claim 11, wherein, in case of initiating a virtual address request to an address translation unit when accessing a memory, the address translation unit is configured to map the virtual address to a mapped address according to the optimal memory access mapping manner;

and the address conversion unit is used for accessing the memory subsystem according to the mapped address.

16. A GPU memory access adaptive optimization as defined in any of claims 11 to 15, wherein the optimal memory access mapping manner corresponding to the GPU application is determined as follows:

acquiring a GPU access request of GPU application;

performing different memory address mapping on the GPU access request by using different memory access mapping modes to obtain GPU access requests after different mappings;

17. The GPU access adaptive optimization device of claim 16, wherein determining the information entropy corresponding to the differently mapped GPU access requests comprises:

determining a plurality of statistical time periods based on the time period;

18. A GPU memory access adaptive optimization device as claimed in any of claims 11 to 17, characterised in that the optimal memory access mapping mode corresponding to the GPU application is configured as follows:

19. The GPU memory access adaptive optimization device of claim 18, wherein the optimal memory access mapping mode corresponding to the GPU application is configured as follows:

20. A GPU memory access adaptive optimization as defined in any of claims 11-19, further comprising: