CN115237605A

CN115237605A - Data transmission method between CPU and GPU and computer equipment

Info

Publication number: CN115237605A
Application number: CN202211134216.8A
Authority: CN
Inventors: 章毅; 祝生乾; 胡俊杰; 余程嵘; 段兆航
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2022-09-19
Filing date: 2022-09-19
Publication date: 2022-10-25
Anticipated expiration: 2042-09-19
Also published as: CN115237605B

Abstract

The application relates to the technical field of data processing, and discloses a data transmission method between a CPU and a GPU and computer equipment, wherein the method comprises the following steps: acquiring a first data set required to be transmitted by a CPU, wherein the first data set comprises a plurality of class data with the same class name; performing attribute merging on class data in the first data set to obtain a second data set; based on the attribute value arrangement sequence in the second data set, sequentially establishing an address mapping relation between the memory address of the corresponding attribute value in the first data set in the CPU and the memory address of the GPU; and transmitting the first data set to the GPU for storage based on the address mapping relation. The method and the device solve the problem that in the data transmission process of the existing CPU and the GPU, the data storage mode after transmission is unchanged, so that the data reading efficiency of the GPU is low.

Description

Data transmission method between CPU and GPU and computer equipment

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method for transmitting data between a CPU and a GPU and a computer device.

Background

For a complex neural network, the calculation by using a Central Processing Unit (CPU) is not efficient, and because the neural network is highly parallel, the calculation efficiency of the neural network can be effectively improved by adopting a Graphics Processing Unit (GPU) suitable for parallel calculation to process parallel calculation tasks. With the continuous development of artificial intelligence, the hardware requirements of the GPU which is adept at large-scale parallel operation are higher and higher, and the GPU still needs to complete the calculation task under the instruction control of the CPU in the normal work flow, so that data transmission is often performed between the CPU and the GPU.

In addition, the object-oriented programming is a mainstream programming design method, which has the advantages of high readability, easy expansion, convenient modeling and the like, programmers often use the object-oriented programming method to design a GPU program, and the data needs to be transmitted from the CPU to the GPU so as to realize large-scale parallel computing by using the parallel computing capability of the GPU. Specifically, a typical GPU program implementation flow is as follows: firstly, allocating memory space for data on a GPU, then calculating the mapping relation between the memory address of a data CPU and the allocated address of the GPU, copying the data on the memory of the CPU to the memory of the GPU, submitting access transaction acquisition data to the memory of the GPU by a thread bundle in each calculation unit of the GPU, calculating, and then transmitting the result back to the memory of the CPU from the memory of the GPU.

In the data transmission process of the existing CPU and GPU, the storage modes of the data in the CPU and the GPU are unchanged, namely the storage structure of the data in the GPU is the same as that in the CPU, and the data are transmitted and stored in the form of class data. However, because the CPU and the GPU have different access modes to the memory data, the existing data transmission and storage mode is not favorable for the GPU to read the data, and meanwhile, the memory bandwidth of the GPU is greatly limited, which also causes a great deal of waste for the GPU cache.

Disclosure of Invention

Based on the technical problems, the application provides a data transmission method between a CPU and a GPU and computer equipment, and solves the problem that the data reading efficiency of the GPU is low due to the fact that the data storage mode after transmission is unchanged in the data transmission process of the existing CPU and the GPU.

In order to solve the technical problems, the technical scheme adopted by the application is as follows:

a method for data transmission between a CPU and a GPU comprises the following steps:

acquiring a first data set required to be transmitted by a CPU, wherein the first data set comprises a plurality of class data with the same class name;

carrying out attribute combination on class data in the first data set to obtain a second data set;

based on the attribute value arrangement sequence in the second data set, sequentially establishing an address mapping relation between the memory address of the corresponding attribute value in the first data set in the CPU and the memory address of the GPU;

and transmitting the first data set to the GPU for storage based on the address mapping relation.

Further, attribute merging is performed on the class data in the first data set, and obtaining a second data set includes:

acquiring an attribute list of the class data, wherein the attribute list comprises a plurality of attribute names of the class data;

sequentially extracting attribute names in the attribute list, and sequentially extracting attribute values of corresponding attribute names in the first data set based on the attribute names;

and sequencing the extracted attribute values to obtain a second data set.

Further, the GPU stores the attribute values in the first data set in the storage space of the GPU sequentially according to the arrangement order of the attribute values in the second data set.

Further, before sequentially establishing an address mapping relationship between the memory address of the corresponding attribute value in the first data set in the CPU and the memory address of the GPU based on the attribute value arrangement order in the second data set, the method further includes:

calculating the size of a memory required to be occupied by the first data set;

based on the memory size, the GPU allocates memory space for storing the first set of data.

Further, an address mapping relation is stored, and the address mapping relation is based on the class name of the class data in the first data set and used as an index.

Further, after acquiring the first data set that the CPU needs to transmit, the method further includes:

acquiring a class name of class data in a first data set;

searching whether the class name has a corresponding stored address mapping relation;

and if the class name has a corresponding stored address mapping relation, transmitting the first data set to the GPU for storage based on the stored address mapping relation.

Further, if the class name does not have a corresponding stored address mapping relationship, the method goes to a step of performing attribute merging on the class data in the first data set to obtain a second data set.

A computer device comprising a CPU, a GPU and an address management module, the address management module comprising:

the data reading unit is used for acquiring a first data set required to be transmitted by the CPU, and the first data set comprises a plurality of class data with the same class name;

the attribute merging unit is used for performing attribute merging on the class data in the first data set to obtain a second data set;

the address mapping unit is used for sequentially establishing the address mapping relation between the memory address of the corresponding attribute value in the first data set in the CPU and the memory address of the GPU based on the attribute value arrangement sequence in the second data set;

and the data transmission unit is used for transmitting the first data set to the GPU for storage based on the address mapping relation.

Compared with the prior art, the beneficial effect of this application is:

according to the method and the device, the storage structure of the object-oriented programming program data in the GPU is synchronously changed by changing the structure of the data transmitted from the CPU to the GPU, so that GPU memory access transactions are reduced, GPU memory access efficiency and memory bandwidth are improved, meanwhile, the waste of L2 cache in the GPU is reduced, and the cache utilization rate of the L2 is improved.

In addition, the address mapping relation of the data transmission between the CPU and the CPU can be stored, the memory address mapping relation of the data between the CPU and the GPU is recorded after the CPU transmits the data to the GPU for the first time, and when the same data is encountered in the follow-up process, the step of calculating the address mapping relation is omitted for the follow-up data transmission process, CPU calculation resources are saved, and the data transmission efficiency is accelerated.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. Wherein:

fig. 1 is a schematic flow chart of a data transmission method between a CPU and a GPU.

Fig. 2 is a schematic flow chart illustrating a process of performing attribute merging on class data in a first data set to obtain a second data set.

Fig. 3 is a schematic flowchart of allocating memory space by the GPU.

FIG. 4 is a flowchart illustrating a process of searching whether data has a corresponding stored address mapping relationship.

Fig. 5 is a diagram of a conventional hardware architecture of a CPU and a GPU.

Fig. 6 is a schematic diagram of a storage structure of data on a conventional GPU.

Fig. 7 is a schematic diagram illustrating reading of data on a conventional GPU.

Fig. 8 is a schematic diagram of a storage structure of data on a GPU according to the present application.

Fig. 9 is a schematic diagram illustrating reading of data on a GPU according to the present application.

Fig. 10 is a block diagram schematically illustrating the structure of the computer device.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings of the embodiments of the present disclosure. It is to be understood that the described embodiments are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the disclosure without any inventive step, are within the scope of protection of the disclosure.

It should be understood that "system", "apparatus", "unit" and/or "module" as used in this specification is a method for distinguishing different components, elements, parts, portions or assemblies of different levels. However, other words may be substituted by other expressions if they accomplish the same purpose.

As used in this specification and the appended claims, the terms "a," "an," "the," and/or "the" are not intended to be inclusive in the singular, but rather are intended to be inclusive in the plural, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that steps and elements are included which are explicitly identified, that the steps and elements do not form an exclusive list, and that a method or apparatus may include other steps or elements.

Flow charts are used in this description to illustrate operations performed by a system according to embodiments of the present description. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, the various steps may be processed in reverse order or simultaneously. Meanwhile, other operations may be added to the processes, or a certain step or several steps of operations may be removed from the processes.

Referring to fig. 1, in some embodiments, a method for data transmission between a CPU and a GPU includes:

s101, acquiring a first data set required to be transmitted by a CPU, wherein the first data set comprises a plurality of class data with the same class name;

s102, carrying out attribute combination on class data in the first data set to obtain a second data set;

s103, sequentially establishing address mapping relations between memory addresses of corresponding attribute values in the first data set in the CPU and GPU memory addresses based on the attribute value arrangement sequence in the second data set;

and S104, transmitting the first data set to the GPU for storage based on the address mapping relation.

In this embodiment, it is known in the prior art that the general flow of the GPU program of the graphics processor includes four steps:

1. memory address calculation: firstly, allocating a memory space by a GPU, and determining the mapping relation between the memory address of the CPU and the memory address of the GPU;

2. data transmission: the CPU transmits data to the GPU through the transmission bus according to the address mapping relation;

3. data acquisition: the GPU computing core accesses a GPU memory by utilizing a thread bundle to acquire data;

4. calculating and returning a result: and the GPU calculation core calculates by using the data acquired by the thread bundle to obtain a result and transmits the result back to the CPU.

Referring to FIG. 5, it can be seen from the prior art hardware architecture diagram of the CPU and the GPU that the CPU transfers data to the GPU, essentially the data is transferred directly from the CPU memory to the GPU memory through the transfer bus (note: the data is transferred from the CPU memory to the GPU memory, which is referred to as the global memory of the GPU because there are many types of memory in the GPU.

Firstly, a CPU sends a memory allocation instruction to a GPU to allocate memory for data to be transmitted, and then the mapping relation between the memory address of the data in the CPU and the memory address of the GPU is calculated, wherein the mapping relation can be used in the next data transmission. And the CPU starts to transmit data to the GPU through the data transmission bus by using the address mapping relation calculated in the previous step, and stores the data into a GPU memory.

After the data is transmitted from the CPU memory to the GPU memory, the GPU computational core may request access to the global memory to obtain the data for the next stage of computation. As is known, the large-scale computation capability of a GPU is realized by a plurality of hardware devices SM, where SMs are integrated units of data processing in the GPU, and one SM can support hundreds of threads to execute concurrently.

And the thread bundle is the most basic execution unit of the SM, and each thread bundle is composed of 32 continuous threads. All threads in the same Thread bundle are executed in a Single Instruction Multiple Thread (SIMT) manner, that is, all threads in the same Thread bundle need to execute the same Instruction at the same time, and each Thread performs calculation on its own private data separately.

For a memory access (data to access GPU memory) instruction, each thread in the bundle has its own memory address, and each thread submits a memory access transaction with a memory address. The global memory is the memory with the slowest access speed and the largest capacity in the GPU, and any SM device can access the global memory.

In a GPU program designed by a current common object-oriented programming method, data having the same attribute is often defined as a class, a name of the class is defined as a class name, and each class has many attributes, such as defining an image class, the class name is image, and the image has attributes of length, width, and height.

Assuming that there is a class with a class name S, where S has 4 attributes a, b, c, and d (each attribute occupies a byte of storage space), and a batch of data consists of 32S, according to the existing data transmission method between the CPU and the GPU, the storage structure of the batch of data on the GPU hardware device is shown in fig. 6. In fig. 6, the memory address is in bytes as a basic unit, each cell is 1 byte, each S occupies 4 bytes, and 32S occupies 128 bytes.

In combination with the above-mentioned GPU memory access principle, the basic unit of the GPU accessing the memory is a thread bundle (that is, when a certain thread in the thread bundle needs to access the memory, other threads in the thread bundle also need to access the memory at the same time, no matter whether they need to access the memory or not), and each time the memory is accessed, it is called a memory access transaction. Assuming there are 32 threads in the bundle and each thread reads 1 byte, then 32 bytes need to be accessed for a memory access transaction.

As shown in fig. 7, assuming that a thread bundle needs to obtain the attribute a of all data for calculation, the thread bundle can only access 32 bytes of GPU memory at a time, and the attribute a of the data is distributed in the memory with addresses of 0 to 128, which means that the thread bundle needs to access 4 times of the memory to obtain the complete data.

In this embodiment, when the CPU transmits data to the GPU, the storage manner of the data is changed by an attribute merging manner, and still taking data with a class name of S as an example, the improved storage structure is shown in fig. 8, where attributes of 32S are respectively merged according to a, b, c, and d. Referring to fig. 9, for a new storage structure, the thread bundle needs only 1 memory access transaction to obtain complete data. The storage structure before improvement uses 4 times of memory access affairs, and only 1 time of memory access affairs after improvement, thereby saving GPU memory access resources and accelerating the whole calculation process.

In addition, the GPU also has a caching mechanism. The L2 cache (second level cache) of the GPU is a storage device with a faster access speed and a smaller capacity than the global memory, and when an SM unit of the GPU needs to read data, the needed data is first searched in the L2 cache, and if the SM unit is found to be directly read, the SM unit is not found, and the SM unit needs to be obtained from the global memory. The L2 cache works according to the spatial locality principle, and when the SM unit accesses a certain data in the global memory, the data and its neighboring data are loaded into the L2 cache. I.e. a certain data in the memory is requested, data adjacent to the data may also be requested.

Because the speed of acquiring data from the L2 cache is far higher than the speed of acquiring data from the memory, the GPU accesses the L2 cache first when acquiring data, and immediately acquires target data if the cache exists, accesses the memory if the cache does not have the target data, and puts the accessed data into the cache after accessing the memory so as to be possibly used next time.

Therefore, by using the data stored in the GPU memory by the existing data transmission method of the CPU and the GPU, for the memory access transaction of the thread bundle, each time the memory access transaction accesses the data which is not needed, such as the attributes b, c, and d, in the L2 cache, the attributes b, c, and d are loaded into the L2 cache along with the attribute a, and the useless data occupy a large amount of the L2 cache, which causes a large amount of cache waste and causes low cache utilization rate.

The data transmission method of the embodiment enables the data stored in the GPU memory to be stored according to the structure after attribute merging, and the improved data storage structure does not need to load unused b, c, d attributes into the L2 cache, thereby greatly reducing the cache waste of the L2 and improving the cache utilization rate.

In summary, the data transmission method between the CPU and the GPU of this embodiment performs attribute merging on a batch of data belonging to the same class, changes the storage structure of the data in the GPU, improves the global memory bandwidth of the GPU, reduces L2 cache waste of the GPU, and improves the cache utilization rate.

Referring to fig. 2, preferably, the attribute merging of the class data in the first data set to obtain the second data set includes:

s201, acquiring an attribute list of class data, wherein the attribute list comprises a plurality of attribute names of the class data;

s202, sequentially extracting attribute names in the attribute list, and sequentially extracting attribute values of corresponding attribute names in the first data set based on the attribute names;

s203, arranging the extracted attribute values in sequence to obtain a second data set.

Preferably, the GPU stores the attribute values in the first data set in the storage space of the GPU in sequence according to the order of the attribute values in the second data set.

Referring to fig. 3, preferably, before the address mapping relationship between the memory address of the corresponding attribute value in the CPU and the memory address of the GPU in the first data set is sequentially established based on the arrangement order of the attribute values in the second data set, the method further includes:

s301, calculating the size of a memory occupied by the first data set;

s302, based on the memory size, the GPU allocates a memory space for storing the first data set.

In some embodiments, an address mapping relationship is stored, the address mapping relationship being indexed based on a class name of class data in the first data set.

Referring to fig. 4, preferably, after acquiring the first data set that the CPU needs to transmit, the method further includes:

s401, acquiring a class name of class data in a first data set;

s402, searching whether the class name has a corresponding stored address mapping relation;

and S403, if the class name has the corresponding stored address mapping relation, transmitting the first data set to the GPU for storage based on the stored address mapping relation.

S404, if the class name does not have the corresponding stored address mapping relation, the step of carrying out attribute combination on the class data in the first data set to obtain a second data set is carried out.

In this embodiment, in the process of first transmitting data from the CPU to the GPU, the address mapping relationship between the data CPU memory and the GPU memory is retained, so that when the same type of data is transmitted later, the address mapping relationship may not be calculated any more, data transmission may be performed directly according to the existing address mapping relationship, the address mapping relationship between the data CPU memory and the GPU memory does not need to be recalculated, and by recording the mapping relationship between the data CPU memory and the GPU memory address, the address mapping relationship is prevented from being recalculated every time the CPU transmits data to the GPU, and computing resources are saved.

In addition, because the address mapping relation is clear, data can be transmitted in parallel by using multiple processes, and the data transmission speed is accelerated. And starting calculation after all threads acquire corresponding data, and transmitting the result back to the CPU by the GPU after the calculation is finished.

Referring to fig. 10, in some embodiments, there is also disclosed a computer device comprising a CPU, a GPU and an address management module, the address management module comprising:

In this embodiment, the data structure is changed by using the address management module when the CPU transmits data to the GPU, so that the storage structure of the data transmitted to the GPU is also changed, thereby greatly improving the memory bandwidth of the GPU and reducing the waste of the GPU cache.

The above is an embodiment of the present application. The embodiments and specific parameters in the embodiments are only used for clearly illustrating the verification process of the application and are not used for limiting the patent protection scope of the application, which is defined by the claims, and all the equivalent structural changes made by using the contents of the specification and the drawings of the application should be included in the protection scope of the application.

Claims

A method for data transmission between a CPU and a GPU, comprising:

acquiring a first data set required to be transmitted by a CPU, wherein the first data set comprises a plurality of class data with the same class name;

performing attribute merging on class data in the first data set to obtain a second data set;

based on the attribute value arrangement sequence in the second data set, sequentially establishing an address mapping relation between the memory address of the corresponding attribute value in the first data set in the CPU and the memory address of the GPU;

and transmitting the first data set to the GPU for storage based on the address mapping relation.
2. The method according to claim 1, wherein the attribute merging of the class data in the first data set to obtain a second data set comprises:

acquiring an attribute list of the class data, wherein the attribute list comprises a plurality of attribute names of the class data;

sequentially extracting attribute names in the attribute list, and sequentially extracting attribute values of corresponding attribute names in the first data set based on the attribute names;

and arranging the extracted attribute values in sequence to obtain the second data set.
3. The method of claim 2, wherein the method comprises:

and the GPU stores the attribute values in the first data set in a storage space of a GPU memory according to the attribute value arrangement sequence in the second data set.
4. The method according to claim 1, before sequentially creating, based on the order of the attribute values in the second data set, an address mapping relationship between a memory address of the corresponding attribute value in the first data set in the CPU and a memory address of the GPU, the method further comprising:

calculating the size of a memory required to be occupied by the first data set;

based on the memory size, the GPU allocates memory space for storing the first set of data.
5. The method of claim 1, wherein the method comprises:

and storing the address mapping relation, wherein the address mapping relation is based on the class name of the class data in the first data set and used as an index.
6. The method according to claim 5, further comprising, after obtaining the first set of data that the CPU needs to transmit:

acquiring a class name of class data in the first data set;

searching whether the class name has a corresponding stored address mapping relation;

and if the class name has a corresponding stored address mapping relation, transmitting the first data set to the GPU for storage based on the stored address mapping relation.
7. The method of claim 6, wherein the method comprises:

and if the class name does not have the corresponding stored address mapping relation, performing attribute merging on the class data in the first data set to obtain a second data set.
8. A computer device comprising a CPU, a GPU and an address management module, the address management module comprising:

the data reading unit is used for acquiring a first data set required to be transmitted by the CPU, and the first data set comprises a plurality of class data with the same class name;

the attribute merging unit is used for performing attribute merging on the class data in the first data set to obtain a second data set;

the address mapping unit is used for sequentially establishing the address mapping relation between the memory address of the corresponding attribute value in the first data set in the CPU and the memory address of the GPU based on the arrangement sequence of the attribute values in the second data set;

a data transmission unit, configured to transmit the first data set to the GPU for storage based on the address mapping relationship.