CN116243845A - CUDA-based data processing method, computing device and storage medium - Google Patents

CUDA-based data processing method, computing device and storage medium Download PDF

Info

Publication number
CN116243845A
CN116243845A CN202111485569.8A CN202111485569A CN116243845A CN 116243845 A CN116243845 A CN 116243845A CN 202111485569 A CN202111485569 A CN 202111485569A CN 116243845 A CN116243845 A CN 116243845A
Authority
CN
China
Prior art keywords
task
storage space
processed
input data
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111485569.8A
Other languages
Chinese (zh)
Inventor
林晓露
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Jingtai Technology Co Ltd
Original Assignee
Shenzhen Jingtai Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Jingtai Technology Co Ltd filed Critical Shenzhen Jingtai Technology Co Ltd
Priority to CN202111485569.8A priority Critical patent/CN116243845A/en
Priority to PCT/CN2021/142846 priority patent/WO2023103125A1/en
Publication of CN116243845A publication Critical patent/CN116243845A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0643Management of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0644Management of space entities, e.g. partitions, extents, pools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application discloses a CUDA-based data processing method, a computing device and a storage medium. Wherein the method comprises the following steps: acquiring input data of at least one task to be processed, wherein the input data comprises data of at least two different data types; determining the total storage space required by the input data of the at least one task to be processed; according to the total storage space, applying for a first storage space to a memory of the computing device and applying for a second storage space to a video memory of the computing device; the input data of the at least one task to be processed are stored in a first storage space after being serialized; copying the serialized input data from the first storage space to the second storage space for storage; and performing deserialization on the input data stored in the second storage space, and performing task calculation on the at least one task to be processed by utilizing the input data obtained by the deserialization. The invention can reduce the number of data copying and improve the data access efficiency and the calculation efficiency.

Description

CUDA-based data processing method, computing device and storage medium
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a CUDA-based data processing method, a computing device, and a storage medium.
Background
In the development process of HPC (High Performance Computing ) software based on CUDA (Compute Unified Device Architecture, unified computing device architecture), data cannot be directly loaded into a video memory, input data need to be applied for corresponding memory space in the memory in advance, input data need to be applied for video memory space with the same size in the video memory, and the input data need to be stored for access during computing of a video card.
The general calculation flow is that firstly, input data is copied from a memory to a video memory, a video card calculates, then an output result is stored in the video memory, and then the result is copied from the video memory to the memory.
At present, when a memory and a video memory space are applied for input data, the application is required to be separated according to different data types, and data copying between the memory and the video memory is also performed separately according to different data types.
The operation mode mainly has the following defects:
1. and applying for the memory and the video memory for multiple times. The application of the video memory is a relatively time-consuming expense, and according to the traditional development mode, different data types need to apply for resources of corresponding types independently.
2. And increasing the copy times between the memory and the video memory. The data copying between the memory and the video memory is also a time-consuming overhead, and the number of times of data copying has to be increased due to the defect 1.
3. And is unfavorable for the combined access of the video memories. Therefore, the application of the independent partition easily leads to the scattered storage of data, the bandwidth advantage of video memory access can not be well utilized when the video card calculates, the data can be acquired from a plurality of places, and the video memory access efficiency is low.
Disclosure of Invention
In order to solve or partially solve the problems in the related art, the application provides a data processing method, a computing device and a storage medium based on CUDA, which can reduce the number of data copying and improve the data access efficiency and the computing efficiency.
A first aspect of the present application provides a CUDA-based data processing method, the method being applied to a computing device, the method comprising:
acquiring input data of at least one task to be processed, wherein the input data comprises data of at least two different data types;
determining the total storage space size required by the input data of the at least one task to be processed;
applying for a first storage space to a memory of the computing device and applying for a second storage space to a video memory of the computing device according to the total storage space size;
the input data of the at least one task to be processed are stored in the first storage space after being serialized;
copying the serialized input data from the first storage space to the second storage space for storage;
and performing deserialization on the input data stored in the second storage space, and performing task calculation on the at least one task to be processed by utilizing the input data obtained by the deserialization.
Preferably, when the at least one task to be processed includes a plurality of tasks to be processed, the determining a total storage space size required for input data of the at least one task to be processed includes:
determining the task type of each task to be processed according to a preset task calculation algorithm;
when the plurality of tasks to be processed are all of the same task type, calculating the size of a storage space required by input data of a target task to be processed in the plurality of tasks to be processed;
and calculating the total storage space size required by the input data of the plurality of tasks to be processed according to the task number of the plurality of tasks to be processed and the storage space size required by the input data of the target tasks to be processed.
Preferably, the method further comprises:
and when the task types of the plurality of tasks to be processed are different, calculating the size of the storage space required by the input data of each task to be processed, and summing to obtain the total size of the storage space required by the input data of the plurality of tasks to be processed.
Preferably, the first storage space and the second storage space are equal in size and are both greater than or equal to the total storage space.
Preferably, the method further comprises:
after performing task calculation on the at least one task to be processed, respectively obtaining corresponding output data;
according to the output data of the at least one task to be processed, applying for a third storage space to a memory of the computing device and applying for a fourth storage space to a video memory of the computing device;
storing output data of the at least one task to be processed in the fourth storage space;
and copying the output data of the at least one task to be processed from the fourth storage space to the third storage space for storage.
Preferably, the determining the total storage space required by the input data of the at least one task to be processed includes:
determining the data type of the output data of each task to be processed according to a preset task calculation algorithm;
and calculating the total storage space size required by the input data and the output data of the at least one task to be processed.
Preferably, the method further comprises:
after performing task calculation on the at least one task to be processed, respectively obtaining corresponding output data;
storing output data of the at least one task to be processed in the second storage space;
and copying the output data of the at least one task to be processed from the second storage space to the first storage space for storage.
Preferably, the serializing the input data of the at least one task to be processed and storing the serialized input data in the first storage space includes:
determining a first address of the first storage space and a basic pointer for indicating the first address;
obtaining a storage address of each input data through transformation and offset of the basic pointer according to the data type and the data size of the input data of the at least one task to be processed;
and storing each input data in the first storage space according to the storage address of each input data.
A second aspect of the present application provides a computing device comprising:
the data acquisition module is used for acquiring input data of at least one task to be processed, wherein the input data comprises data of at least two different data types;
the space determining module is used for determining the total storage space size required by the input data of the at least one task to be processed;
the space application module is used for applying a first storage space to the memory of the computing equipment and applying a second storage space to the video memory of the computing equipment according to the total storage space size;
the data storage module is used for serializing the input data of the at least one task to be processed and storing the serialized input data in the first storage space;
the data copying module is used for copying the serialized input data from the first storage space to the second storage space for storage;
and the task calculation module is used for performing deserialization on the input data stored in the second storage space and performing task calculation on the at least one task to be processed by utilizing the input data obtained by the deserialization.
A third aspect of the present application provides a computing device comprising:
a processor; and
a memory having executable code stored thereon which, when executed by the processor, causes the processor to perform the method as described above.
A fourth aspect of the present application provides a computer readable storage medium having stored thereon executable code which, when executed by a processor of a computing device, causes the processor to perform a method as described above.
According to the technical scheme, after input data of different data types of a task to be processed are obtained, the total storage space size required by the input data can be obtained first, and a continuous storage space is applied for each of the memory and the video memory of the computing device according to the total storage space size; after serializing the input data, storing the input data in a storage space of a memory, and copying the input data into the storage space of a video memory for storage; further, the input data can be used for task calculation after being deserialized in the video memory. The method and the device can fuse input data of different data types, store the input data on a continuous memory and video memory, not only reduce the application times of the memory space of the memory and the video memory, improve the cache utilization rate, reduce the cost brought by the application of the memory and the video memory, but also reduce the data copying times, and further improve the data access efficiency and the task computing efficiency.
Drawings
FIG. 1 is a schematic flow chart of a CUDA-based data processing method according to an embodiment of the present invention;
fig. 2 is a schematic diagram of memory arrangement in a data serialization process according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of memory arrangement in another data serialization process according to an embodiment of the present invention;
FIG. 4 is a schematic flow chart of another CUDA-based data processing method according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a computing device according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of another computing device according to an embodiment of the present invention.
Detailed Description
Embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The embodiment of the application provides a CUDA-based data processing method, which can be applied to computing equipment such as a computer. As shown in fig. 1, the method may include the steps of:
s110, acquiring input data of at least one task to be processed, wherein the input data comprises data of at least two different data types.
In the embodiment of the application, one task to be processed may include one or more input data, and data types of different input data may be different.
For example, in order to calculate the energy of one chemical bond, it is necessary to know which two atoms (ID, integer) constitute the chemical bond, the coordinates of these two atoms (3D, floating point type), the elastic coefficient k of the chemical bond (floating point type), and the bond length balance value b0 (floating point type). The input data of the task to be processed are IDs (integers) of two atoms, respectively, coordinates { floating point number, floating point number }, k { floating point number }, b0{ floating point number }. That is, the input data of the task to be processed includes both integer type and floating point type data.
It will be appreciated that the number of data types included in the input data of the task to be processed is related to the task to be processed, and the data types may have 2, 3, 4, 5 or other values, which are not limited in this embodiment of the present application.
S120, determining the total storage space required by the input data of the at least one task to be processed.
In the embodiment of the application, when there is only one task to be processed, the storage space size required by the input data of different data types of the task to be processed can be determined first, and then the total storage space size required by all the input data of the task to be processed can be obtained through summation. The size of the storage space occupied by the input data of different data types is different, and the size of each data type can be obtained through sizeof (type). When a plurality of tasks to be processed exist, the total storage space size required by the input data of all the tasks to be processed can be calculated according to the task types of the tasks to be processed.
In an alternative embodiment, the task type of each task to be processed may be determined according to a preset task calculation algorithm; when the plurality of tasks to be processed are all of the same task type, calculating the size of a storage space required by input data of a target task to be processed in the plurality of tasks to be processed; and calculating the total storage space size required by the input data of the plurality of tasks to be processed according to the task number of the plurality of tasks to be processed and the storage space size required by the input data of the target tasks to be processed.
Wherein the calculation algorithm for each task to be processed may be set in advance, such as calculating the energy of chemical bonds, calculating the angle between two chemical bonds, calculating the distance between atoms, and so on. When the calculation algorithms of the plurality of tasks to be processed are the same, if the calculation algorithms are all used for calculating the energy of a single chemical bond, the tasks to be processed can be classified into the same task type. The input data of the tasks to be processed of the same task type have the same data type, so that the storage space required by the input data of the tasks to be processed of the same task type is the same. When the calculation algorithms of two tasks to be processed are different, such as one calculation of the energy of the chemical bond and one calculation of the angle between the chemical bonds, the two tasks to be processed belong to different task types.
When the task types of all the tasks to be processed are the same, one or any one of the tasks to be processed can be obtained as a target task to be processed, the storage space size n required by the input data of the target task to be processed is calculated, and the total storage space size m x n required by the input data of all the tasks to be processed is obtained according to the number m of the tasks to be processed. Thus, for a plurality of tasks to be processed of the same type, the total storage space size required by the tasks can be calculated in a rapid batch mode.
In an alternative embodiment, when the task types of the plurality of tasks to be processed are different, the storage space size required by the input data of each task to be processed is calculated, and the sum is performed to obtain the total storage space size required by the input data of the plurality of tasks to be processed.
The types of data included in the input data of the tasks to be processed of different task types are generally different, so that the required storage space is also different. The size of the storage space required by the input data of each task to be processed can be calculated respectively, and then the size of the storage space required by each task to be processed is added to obtain the total size of the storage space.
It will be appreciated that when some tasks to be processed are the same in task type and some tasks are different, the same method in the previous embodiment may be used to calculate the size of the storage space, and different methods in the latter embodiment may be used to calculate the size of the storage space, and finally the sum is performed to obtain the total size of the storage space.
S130, applying for a first storage space to a memory of the computing device and applying for a second storage space to a video memory of the computing device according to the total storage space.
In this embodiment of the present application, according to the total storage space size required by all the input data, a continuous storage space may be applied to each of the memory and the video memory of the computing device, for storing the input data. Compared with the prior art that when the memory and the video memory space are applied for input data, the application is required to be separated according to different data types, and the application of the memory and the video memory space is combined, so that the cost brought by the application of the memory and the video memory space can be reduced. In addition, all input data are stored in a continuous space, so that the cache utilization rate can be improved compared with the case of separately storing the input data.
The first storage space applied in the memory and the second storage space applied in the video memory can be equal in size and are both larger than or equal to the total storage space required by the input data.
And S140, serializing the input data of the at least one task to be processed and storing the serialized input data in the first storage space.
In an alternative embodiment, a first address of the first memory space and a base pointer for indicating the first address may be determined; according to the data type and the data size of the input data of the at least one task to be processed, obtaining the storage address of each input data through transformation and offset of the basic pointer; and storing each input data in the first storage space according to the storage address of each input data.
Wherein a pointer may be utilized to represent a memory address. Because the input data with different data types are required to be stored in the same storage space, the input data with different data types can be converted into the minimum data unit, and then the data with different types can be stored through pointer transformation and offset.
Specifically, the first address of the first storage space may be represented by a basic pointer, and since the memory is a continuous addressing space in bytes, the basic pointer may be a char pointer. According to the data type and the size of the input data to be stored, for example, the char type occupies 1 byte, the int type occupies 4 bytes, the float type occupies 4 bytes, the double type occupies 8 bytes and the like, the starting position of each input data is obtained through conversion of the first address pointer and offset calculation, and the starting position is stored in the corresponding data pointer, so that the data serialization process is completed.
As shown in fig. 2, through the above operation, it is achieved that one int type data (data_1) and one double type data (data_2) are simultaneously stored in one continuous memory space. When each data is stored, the data is linearly distributed and arranged, and the data size can be stored first, and then the specific content of the data can be stored. When there are too many tasks to be processed, a task ID may be added before the data size in order to distinguish input data of different tasks. By means of data serialization, different types of data can be fused and stored in a continuous storage space.
And S150, copying the serialized input data from the first storage space to the second storage space for storage.
When the input data is copied from the memory to the video memory, the input data stored in the continuous storage space in the memory may be copied to the video memory for storage at one time and stored in a block of continuous storage space applied for in the video memory. Compared with the traditional method of separate storage and multiple copies, the method and the device only need to perform one-time copy operation, and improve data operation efficiency.
S160, performing deserialization on the input data stored in the second storage space, and performing task calculation on the at least one task to be processed by using the input data obtained by the deserialization.
In the embodiment of the application, when the computing device enters the task computing stage, only the first address of the second storage space is required to be acquired, and each input data is read through data deserialization to perform task computing.
The continuous video memory is utilized, input data of adjacent tasks are put together, so that the video memory can be accessed in a combined mode, and data is not required to be fetched from a plurality of places, thereby fully utilizing the video memory bandwidth and improving the video memory access efficiency. Input data of a plurality of tasks are stored in a continuous space, batch (high flux) calculation can be achieved, and task calculation efficiency is improved.
According to the CPU and GPU operation mechanisms of the computing device, more contents are preloaded into the cache for the continuous memory and the video memory. According to the mechanism of NVIDIA video card operation, the access efficiency can be improved by putting data together.
In the embodiment of the application, the output data obtained by task calculation can be stored in the continuous storage space already applied for by the memory and the video memory, and can also be stored in the application storage space in the memory and the video memory.
In an alternative embodiment, output data obtained by task calculation may be stored separately, and the implementation manner may include: after performing task calculation on the at least one task to be processed, respectively obtaining corresponding output data; according to the output data of the at least one task to be processed, applying for a third storage space to a memory of the computing device and applying for a fourth storage space to a video memory of the computing device; storing output data of the at least one task to be processed in a fourth storage space; and copying the output data of the at least one task to be processed from the fourth storage space to the third storage space for storage.
After the output data of all tasks are obtained by calculation, the total storage space of all the output data can be calculated, and according to the total storage space, a continuous storage space is applied to the memory and the video memory again, the output data is stored in the video memory first, and is copied to the memory again.
In an alternative embodiment, the input data and the output data may be stored together, so that the size of the storage space required for the output data needs to be taken into account when applying for memory and video memory space. At this time, the implementation manner of determining the total storage space size required for the input data of the at least one task to be processed in step S120 may include: determining the data type of the output data of each task to be processed according to a preset task calculation algorithm; and calculating the total storage space size required by the input data and the output data of the at least one task to be processed.
For example, the task to be processed is to calculate the energy of one chemical bond, and the input data includes the IDs (integers) of two atoms constituting the chemical bond, the coordinates { floating point number, floating point number }, the elasticity coefficient k (floating point number) of the chemical bond, and the bond length balance value b0 (floating point number). Firstly, calculating the space distance between two atoms according to the coordinates of the two atoms to obtain b, and then obtaining the energy of the chemical bond according to a preset task calculation algorithm k (b-b 0) x (b-b 0), namely, outputting data as energy { floating point number }.
The method for calculating the total storage space required by the input data and the output data of all the tasks to be processed may refer to the foregoing method for calculating the total storage space required by the input data, which is not described herein again.
Further, after performing task calculation on the at least one task to be processed, respectively obtaining corresponding output data; storing output data of the at least one task to be processed in a second storage space; and copying the output data of the at least one task to be processed from the second storage space to the first storage space for storage.
As shown in fig. 3, when the task to be processed is to calculate the energy of one chemical bond, input data including id (int type) of two atoms, coordinates (double type) of two atoms, force field parameters k (double type) and b0 (double type) may be stored in a continuous storage space, and then output data energy value (double type) is stored. The storage manner shown in fig. 3 is merely an example, and in practical application, the arrangement order of the input data may be exchanged. When the data of a plurality of tasks to be processed need to be stored, the input data of each task to be processed can be stored first, and finally, a block of space is reserved for uniformly storing the output data.
By combining input data and output data for storage, the application times of the storage space can be further reduced, and the application cost is reduced.
For example, as shown in fig. 4, the input data of all the tasks to be processed are stored in a disk in the form of a file, and the input file is parsed to obtain the total number of the tasks to be processed and a preset task calculation algorithm (the general task calculation algorithm is specified by the user). And calculating the sizes of the memory and the video memory storage space required by the task, and respectively applying for a continuous storage space in the memory and the video memory according to the sizes of the memory space. And after serializing, storing the input data in a storage space of the memory, and copying the input data from the memory to a storage space of the video memory for storage at one time. When the GPU starts task calculation, firstly, the input data are deserialized in the video memory, and after the task calculation is performed through the GPU, the output data are obtained. The output data is stored in the video memory (the applied storage space in the video memory can be used for storing the output data, or another applied storage space can be used for storing the output data), then the output data is copied from the video memory into the memory for storage (the applied storage space in the memory can be used for storing the output data, or another applied storage space can be used for storing the output data), the output data is converted into an output format appointed by a user, and the output data is output from the memory to the disk, so that an output file is obtained.
The method provided by the embodiment of the application can be used for fusing input data with different data types and storing the input data in a continuous memory and video memory, so that the application times of the memory space of the memory and the video memory are reduced, the cache utilization rate is improved, the cost brought by the application of the memory and the video memory is reduced, the data copying times are reduced, and the data access efficiency and the task computing efficiency can be further improved.
The embodiment of the application provides a computing device which can be used for executing the CUDA-based data processing method provided by the embodiment. As shown in fig. 5, the computing device may include:
a data obtaining module 510, configured to obtain input data of at least one task to be processed, where the input data includes data of at least two different data types;
a space determining module 520, configured to determine a total storage space required for the input data of the at least one task to be processed;
a space application module 530, configured to apply for a first storage space to a memory of the computing device and apply for a second storage space to a video memory of the computing device according to the total storage space size;
the data storage module 540 is configured to sequence the input data of the at least one task to be processed and store the input data in the first storage space;
a data copying module 550, configured to copy the serialized input data from the first storage space to the second storage space for storage;
the task calculation module 560 is configured to deserialize the input data stored in the second storage space, and perform task calculation on the at least one task to be processed by using the input data obtained by deserializing.
Alternatively, when the at least one task to be processed includes a plurality of tasks to be processed, the space determination module 520 may include:
the task determination submodule is used for determining the task type of each task to be processed according to a preset task calculation algorithm;
the first computing sub-module is used for computing the size of a storage space required by input data of a target task to be processed in the plurality of tasks to be processed when the plurality of tasks to be processed are all of the same task type;
and the second calculation sub-module is used for calculating the total storage space size required by the input data of the plurality of tasks to be processed according to the task number of the plurality of tasks to be processed and the storage space size required by the input data of the target tasks to be processed.
Optionally, the space determination module 520 may further include:
and the third calculation sub-module is used for calculating the size of the storage space required by the input data of each task to be processed when the task types of the plurality of tasks to be processed are different, and summing the storage space to obtain the total size of the storage space required by the input data of the plurality of tasks to be processed.
Alternatively, the first storage space and the second storage space may be equal in size and both are greater than or equal to the total storage space size.
Optionally, the computing device shown in fig. 5 may further include:
the first data generating module is configured to obtain corresponding output data after the task calculating module 560 performs task calculation on the at least one task to be processed;
the first data processing module is used for applying a third storage space to the memory of the computing equipment and applying a fourth storage space to the video memory of the computing equipment according to the output data of the at least one task to be processed; storing output data of the at least one task to be processed in a fourth storage space; and copying the output data of the at least one task to be processed from the fourth storage space to the third storage space for storage.
Optionally, the space determination module 520 includes:
the data determining submodule is used for determining the data type of the output data of each task to be processed according to a preset task computing algorithm;
and the fourth calculation sub-module is used for calculating the total storage space size required by the input data and the output data of the at least one task to be processed.
Accordingly, the computing device shown in fig. 5 may further include:
the second data generating module is configured to obtain corresponding output data after the task calculating module 560 performs task calculation on the at least one task to be processed;
the second data processing module is used for storing the output data of the at least one task to be processed in a second storage space; and copying the output data of the at least one task to be processed from the second storage space to the first storage space for storage.
Optionally, the data storage module 540 includes:
an address determination submodule for determining a first address of the first memory space and a basic pointer for indicating the first address;
the address acquisition sub-module is used for acquiring the storage address of each input data through transformation and offset of the basic pointer according to the data type and the data size of the input data of the at least one task to be processed;
and the data storage sub-module is used for storing each input data in the first storage space according to the storage address of each input data.
The computing device provided by the embodiment of the application can fuse input data with different data types, store the input data on a continuous memory and video memory, reduce the application times of the memory space of the memory and the video memory, improve the cache utilization rate, reduce the cost brought by the memory and the video memory application, reduce the data copying times, and further improve the data access efficiency and the task computing efficiency.
The embodiment of the application also provides a computing device which can be used for executing the CUDA-based data processing method provided by the embodiment. As shown in fig. 6, the computing device 600 may include: a memory 610 and a processor 620.
The processor 620 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Memory 610 may include various types of storage units, such as system memory, read Only Memory (ROM), and persistent storage. Where the ROM may store static data or instructions that are required by the processor 620 or other modules of the computer. The persistent storage may be a readable and writable storage. The persistent storage may be a non-volatile memory device that does not lose stored instructions and data even after the computer is powered down. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the persistent storage may be a removable storage device (e.g., diskette, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as dynamic random access memory. The system memory may store instructions and data that are required by some or all of the processors at runtime. Furthermore, memory 610 may include any combination of computer-readable storage media including various types of semiconductor memory chips (e.g., DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic disks, and/or optical disks may also be employed. In some implementations, memory 610 may include readable and/or writable removable storage devices such as Compact Discs (CDs), digital versatile discs (e.g., DVD-ROMs, dual layer DVD-ROMs), blu-ray discs read only, super-density discs, flash memory cards (e.g., SD cards, min SD cards, micro-SD cards, etc.), magnetic floppy disks, and the like. The computer readable storage medium does not contain a carrier wave or an instantaneous electronic signal transmitted by wireless or wired transmission.
The memory 610 has stored thereon executable code that, when processed by the processor 620, causes the processor 620 to perform some or all of the steps of the methods described above.
Because the present application is CUDA platform-based, computing device 600 includes heterogeneous computers (not shown) in addition to memory 610 and processor 620. The heterogeneous calculator may be a GPU (Graphics Processing Unit, graphics processor), may also be an FPGA, or the like. In one embodiment, the processor 620 may include the heterogeneous computer, e.g., the processor 620 may include both a CPU and a GPU.
Accordingly, computing device 600 also contains a memory for storing data needed for GPU computing. In one embodiment, the memory 610 may include both memory and video memory structures.
Furthermore, the method according to the present application may also be implemented as a computer program or computer program product comprising computer program code instructions for performing part or all of the steps of the above-described method of the present application.
Alternatively, the present application may also be embodied as a computer-readable storage medium (or non-transitory machine-readable storage medium or machine-readable storage medium) having stored thereon executable code (or a computer program or computer instruction code) that, when executed by a processor of a computing device (or server, etc.), causes the processor to perform some or all of the steps of the above-described methods according to the present application.
The embodiments of the present application have been described above, the foregoing description is exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (11)

1. A CUDA-based data processing method, the method being applied to a computing device, the method comprising:
acquiring input data of at least one task to be processed, wherein the input data comprises data of at least two different data types;
determining the total storage space size required by the input data of the at least one task to be processed;
applying for a first storage space to a memory of the computing device and applying for a second storage space to a video memory of the computing device according to the total storage space size;
the input data of the at least one task to be processed are stored in the first storage space after being serialized;
copying the serialized input data from the first storage space to the second storage space for storage;
and performing deserialization on the input data stored in the second storage space, and performing task calculation on the at least one task to be processed by utilizing the input data obtained by the deserialization.
2. The method of claim 1, wherein when the at least one task to be processed includes a plurality of tasks to be processed, the determining the total memory size required for the input data of the at least one task to be processed includes:
determining the task type of each task to be processed according to a preset task calculation algorithm;
when the plurality of tasks to be processed are all of the same task type, calculating the size of a storage space required by input data of a target task to be processed in the plurality of tasks to be processed;
and calculating the total storage space size required by the input data of the plurality of tasks to be processed according to the task number of the plurality of tasks to be processed and the storage space size required by the input data of the target tasks to be processed.
3. The method according to claim 2, wherein the method further comprises:
and when the task types of the plurality of tasks to be processed are different, calculating the size of the storage space required by the input data of each task to be processed, and summing to obtain the total size of the storage space required by the input data of the plurality of tasks to be processed.
4. The method of claim 1, wherein the first storage space and the second storage space are equal in size, each being greater than or equal to the total storage space size.
5. The method according to claim 1, wherein the method further comprises:
after performing task calculation on the at least one task to be processed, respectively obtaining corresponding output data;
according to the output data of the at least one task to be processed, applying for a third storage space to a memory of the computing device and applying for a fourth storage space to a video memory of the computing device;
storing output data of the at least one task to be processed in the fourth storage space;
and copying the output data of the at least one task to be processed from the fourth storage space to the third storage space for storage.
6. The method of claim 1, wherein determining the total memory size required for the input data of the at least one task to be processed comprises:
determining the data type of the output data of each task to be processed according to a preset task calculation algorithm;
and calculating the total storage space size required by the input data and the output data of the at least one task to be processed.
7. The method of claim 6, wherein the method further comprises:
after performing task calculation on the at least one task to be processed, respectively obtaining corresponding output data;
storing output data of the at least one task to be processed in the second storage space;
and copying the output data of the at least one task to be processed from the second storage space to the first storage space for storage.
8. The method according to any one of claims 1-7, wherein the serializing the input data of the at least one task to be processed and storing the serialized input data in the first storage space includes:
determining a first address of the first storage space and a basic pointer for indicating the first address;
obtaining a storage address of each input data through transformation and offset of the basic pointer according to the data type and the data size of the input data of the at least one task to be processed;
and storing each input data in the first storage space according to the storage address of each input data.
9. A computing device, comprising:
the data acquisition module is used for acquiring input data of at least one task to be processed, wherein the input data comprises data of at least two different data types;
the space determining module is used for determining the total storage space size required by the input data of the at least one task to be processed;
the space application module is used for applying a first storage space to the memory of the computing equipment and applying a second storage space to the video memory of the computing equipment according to the total storage space size;
the data storage module is used for serializing the input data of the at least one task to be processed and storing the serialized input data in the first storage space;
the data copying module is used for copying the serialized input data from the first storage space to the second storage space for storage;
and the task calculation module is used for performing deserialization on the input data stored in the second storage space and performing task calculation on the at least one task to be processed by utilizing the input data obtained by the deserialization.
10. A computing device, comprising:
a processor; and
a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any of claims 1-8.
11. A computer readable storage medium having stored thereon executable code which when executed by a processor of a computing device causes the processor to perform the method of any of claims 1-8.
CN202111485569.8A 2021-12-07 2021-12-07 CUDA-based data processing method, computing device and storage medium Pending CN116243845A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202111485569.8A CN116243845A (en) 2021-12-07 2021-12-07 CUDA-based data processing method, computing device and storage medium
PCT/CN2021/142846 WO2023103125A1 (en) 2021-12-07 2021-12-30 Cuda-based data processing method, computing device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111485569.8A CN116243845A (en) 2021-12-07 2021-12-07 CUDA-based data processing method, computing device and storage medium

Publications (1)

Publication Number Publication Date
CN116243845A true CN116243845A (en) 2023-06-09

Family

ID=86628240

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111485569.8A Pending CN116243845A (en) 2021-12-07 2021-12-07 CUDA-based data processing method, computing device and storage medium

Country Status (2)

Country Link
CN (1) CN116243845A (en)
WO (1) WO2023103125A1 (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9019286B2 (en) * 2012-09-04 2015-04-28 Massimo J. Becker Remote GPU programming and execution method
CN104199927B (en) * 2014-09-03 2016-11-30 腾讯科技(深圳)有限公司 Data processing method and data processing equipment
CN105183562B (en) * 2015-09-09 2018-09-11 合肥芯碁微电子装备有限公司 A method of rasterizing data are carried out based on CUDA technologies to take out rank
CN107818118B (en) * 2016-09-14 2019-04-30 北京百度网讯科技有限公司 Date storage method and device
CN109213745B (en) * 2018-08-27 2022-04-22 郑州云海信息技术有限公司 Distributed file storage method, device, processor and storage medium

Also Published As

Publication number Publication date
WO2023103125A1 (en) 2023-06-15

Similar Documents

Publication Publication Date Title
US8880815B2 (en) Low access time indirect memory accesses
US7233335B2 (en) System and method for reserving and managing memory spaces in a memory resource
KR102289095B1 (en) Method, High-Bandwidth Memory and High-Bandwidth Memory System for Processing In-memory Command
TWI688922B (en) System, storage medium and apparatus for non-volatile storage for graphics hardware
JP6137582B2 (en) Electronic devices, memory controllers, equipment
JP2010529545A (en) Memory allocation mechanism
CN112199040B (en) Storage access method and intelligent processing device
US20200184002A1 (en) Hardware accelerated convolution
CN106233258B (en) Variable-width error correction
CN114942831A (en) Processor, chip, electronic device and data processing method
CN110309912B (en) Data access method and device, hardware accelerator, computing equipment and storage medium
US20240078112A1 (en) Techniques for decoupled access-execute near-memory processing
JP2008047124A (en) Method and unit for processing computer graphics data
US20060026328A1 (en) Apparatus And Related Method For Calculating Parity of Redundant Array Of Disks
CN108959105B (en) Method and device for realizing address mapping
FR2873466A1 (en) METHOD FOR PROGRAMMING A DMA CONTROLLER IN A CHIP SYSTEM AND ASSOCIATED CHIP SYSTEM
US20200293452A1 (en) Memory device and method including circular instruction memory queue
TWI609265B (en) Polarity based data transfer function for volatile memory
CN116243845A (en) CUDA-based data processing method, computing device and storage medium
US20220188380A1 (en) Data processing method and apparatus applied to graphics processing unit, and electronic device
US7100017B2 (en) Method and apparatus for performing distributed processing of program code
CN110046132B (en) Metadata request processing method, device, equipment and readable storage medium
US11157330B2 (en) Barrier-free atomic transfer of multiword data
US11681527B2 (en) Electronic device and multiplexing method of spatial
TWI707272B (en) Electronic apparatus can execute instruction and instruction executing method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination