WO2023207295A1 - 数据处理方法、数据处理单元、系统及相关设备 - Google Patents

数据处理方法、数据处理单元、系统及相关设备 Download PDF

Info

Publication number
WO2023207295A1
WO2023207295A1 PCT/CN2023/078189 CN2023078189W WO2023207295A1 WO 2023207295 A1 WO2023207295 A1 WO 2023207295A1 CN 2023078189 W CN2023078189 W CN 2023078189W WO 2023207295 A1 WO2023207295 A1 WO 2023207295A1
Authority
WO
WIPO (PCT)
Prior art keywords
model training
data
dpu
training
operations
Prior art date
Application number
PCT/CN2023/078189
Other languages
English (en)
French (fr)
Inventor
罗先强
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202210934646.1A external-priority patent/CN117011117A/zh
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023207295A1 publication Critical patent/WO2023207295A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining

Definitions

  • This application relates to the field of data processing technology, and in particular, to a data processing method, data processing unit, system and related equipment.
  • a model training processor can usually be configured in a computing node, and the model training processor is used to process the model training business on the computing node, thereby offloading the computing burden of the central processing unit (CPU) in the computing node.
  • CPU central processing unit
  • the computing node can be configured with a graphics processing unit (GPU), where the GPU can be used to train artificial intelligence (artificial intelligence, AI) models provided by the CPU, etc.
  • GPU graphics processing unit
  • model training business on the computing node may have higher requirements on business processing latency, which makes the efficiency of the model training processor in training the model difficult to meet the needs of some application scenarios.
  • embodiments of the present application provide a data processing method.
  • the data processing method is executed by a DPU.
  • the DPU is coupled to the CPU and the model training processor through a system bus respectively.
  • the DPU and the model training processor are one training unit.
  • the training card and the CPU can perform data communication on the training card; during data processing, the DPU obtains image data, which includes multiple encoded images, and then the DPU performs image operations on the image data
  • the operations in the set are used to obtain model training data, where the operations to process the image data include operations in the image processing operation set performed by the DPU and operations in the training operation set performed by non-DPU, and at least in the image processing operation set Including image decoding operations, and the training operation set at least includes model training operations, so that the DPU outputs model training data to the model training processor, and the model training data is used for the model training processor to perform operations in the training operation set; or, the DPU
  • the CPU outputs model training data, and the model training data is used by the CPU and the model training processor to perform operations in the training operation set.
  • model training processor is typically primarily good at performing model training operations within a set of training operations
  • using the DPU to decode image data (and perform other operations) will be faster than using the model training processor to decode images (and perform other operations). and perform other operations) faster, thereby using the DPU to improve the efficiency of processing model training services and accelerate AI model training.
  • using the DPU to perform image decoding and other operations on image data can reduce the temporary data generated in the process of training the AI model from occupying the limited memory space of the model training processor, which allows the model training processor to have sufficient memory. Space trains AI models, thereby also improving the efficiency of training AI models.
  • a model training processor (such as a GPU) usually needs to perform multiple operations on image data, and these multiple operations form a set.
  • This solution divides this set into multiple subsets, and offloads one of the subsets to the DPU for execution; the remaining subsets are executed by the model training processor, or jointly executed by the model training processor and the CPU.
  • This can save the resources occupied by the model training processor and allow the model training processor to focus its resources on model training instead of occupying a large amount of resources for operations such as image decoding.
  • the location of the DPU is often closer to the user.
  • the user data first passes through the DPU (such as a DPU with integrated network functions), and then reaches the model training processor after being processed by the DPU. This can reduce the computational burden of the model training processor, thereby improving model training.
  • the processor can have more computing power to accelerate AI model training.
  • operations offloaded to the DPU are operations that the DPU is better at than the model training processor.
  • the DPU and the model training processor perform the same operation, the DPU is more efficient than the model training processor, or the DPU consumes less power than the model training processor, or the DPU takes up less time than the model training processor.
  • the DPU includes a network interface, so when the DPU obtains image data, it may specifically obtain image data for training the AI model through a wired network or a wireless network based on the network interface. In this way, access to remote storage devices can be supported to obtain image data, thereby enabling accelerated training of AI models in cloud scenarios.
  • the wired network may be an Ethernet or a wireless bandwidth network.
  • the DPU can be connected to a storage device, so that the DPU can obtain image data from the storage device based on the connection. In this way, accelerated training of AI models can be implemented locally.
  • the storage device may be, for example, one or more of a hard disk drive (HDD), a flash media drive, a shingled magnetic recording SMR, a storage array, and a storage server. In actual applications, it may also be in other ways. accomplish.
  • HDD hard disk drive
  • flash media drive a flash media drive
  • SMR shingled magnetic recording SMR
  • storage array a storage array
  • storage server a storage server
  • the storage device may be a volatile memory, for example, or may be a non-volatile memory.
  • the communication protocol between the storage device and the DPU may include: Small Computer System Interface SCSI protocol, Serial Connection Small Computer System Interface SAS protocol, Peripheral Component Fast Interconnect Protocol Connect one or more of the PCIe protocol, Universal Serial Bus USB protocol, and NVMe fast non-volatile memory protocol.
  • the DPU can implement data communication with the storage device based on any one or more communication protocols.
  • the DPU, CPU and model training processor are located on the same server.
  • the DPU can be located on a different server from the model training processor, so that the AI model accelerated training service can be provided to the remote model training processor.
  • the model training processor is one or more of a graphics processor (GPU), a neural network processor (NPU), and a tensor processor (TPU).
  • GPU graphics processor
  • NPU neural network processor
  • TPU tensor processor
  • system bus used to couple the DPU, CPU and model training processor, or the system bus used to couple the CPU and the training card may include a peripheral component fast interconnection PCIe bus, a computing fast link One or more of the CXL bus and the fast non-volatile memory NVMe bus.
  • the image processing operation set also includes an image data transformation operation.
  • the image data change operation may be, for example, one or more of a center cropping operation, a size adjustment operation, a data enhancement operation, and a normalization operation. or other types of operations, etc., so that when the DPU performs operations in the image processing operation set, it can first perform an image decoding operation on the image data to obtain matrix data, and then perform an image data transformation operation on the matrix data. Get the model training data output to the CPU or output to the model training processor.
  • the training operation set also includes an image data transformation operation
  • the model training data is used by the CPU to perform the image data transformation operation and obtain temporary data
  • the temporary data is used by the model training processor to execute the model. type training operation. That is, after the DPU performs an image decoding operation on the image data, the obtained model training data is output to the CPU; the CPU performs an image data transformation operation on the model training data, and outputs the obtained temporary data to the model training processor; model training processing The server uses this temporary data to train the AI model. In this way, the computing power of DPU and CPU can be used to further accelerate the training of AI models.
  • the model training processor can output the AI model to the DPU, so that the DPU can send the AI model to a local or remote storage device, and the AI
  • the model can store the AI model in the storage device through file format or key-value KV format. In this way, the AI model can be saved locally or in the cloud.
  • the DPU may specifically output the model training data to other DPUs, where the other DPUs are respectively coupled with other CPUs and other model training processors through a system bus, or the other DPUs are coupled with other model
  • the training processor is a different chip on the training card.
  • other DPUs can output the model training data to other model training processors, and the model training data is used by other model training processors to perform operations in the training operation set.
  • the DPU not only can the DPU be used to accelerate the training of AI models, but the model training data processed by the DPU can be shared with other DPUs, so that other DPUs can accelerate the model training of other model training processors based on the shared model training data.
  • embodiments of the present application also provide a data processing method.
  • the data processing method is executed by a DPU.
  • the DPU is coupled to a CPU and multiple model training processors through a system bus, or the DPU is coupled to multiple model training processors.
  • the processor is a different chip on a training card; then, when performing data processing, the DPU obtains image data, which includes multiple encoded images; then, the DPU performs operations in the image processing operation set on the image data to obtain model training data, wherein the operations for processing image data include operations in an image processing operation set and operations in a training operation set, the image processing operation set at least includes image decoding operations, and the training operation set at least includes model training operations, so that the DPU converts the model
  • the training data is written to a shared cache that is accessed by multiple model training processors.
  • the model training data in the shared cache is used by multiple model training processors to perform operations in the training operation set, or the model training in the shared cache
  • the data is used for the CPU and multiple model training processors to perform operations in the training operation set.
  • the DPU can decode image data (and perform other operations) faster than the model training processor can decode images (and perform other operations), the DPU can be used to improve the efficiency and implementation of model training services. Accelerate AI model training.
  • using the DPU to perform image decoding and other operations on image data can reduce the temporary data generated in the process of training AI models that occupy the limited memory space of multiple model training processors, which allows multiple model training processors to It has sufficient memory space to train AI models, thereby being able to train multiple model processors to improve the efficiency of training AI models.
  • embodiments of the present application also provide a data processing method.
  • the data processing method is executed by a target DPU.
  • the target DPU is coupled to the CPU and the model training processor through the system bus, or the target DPU is coupled to the model training processor.
  • the target DPU obtains image data, which includes multiple encoded images; then, the target DPU performs operations in the image processing operation set on the image data to obtain model training data, wherein the operation of processing image data includes operations in the image processing operation set and operations in the training operation set, the image processing operation set at least includes image decoding operations, and the training operation set at least includes model training operations, so that the target DPU will
  • the model training data is written into a shared cache pool built based on caches in multiple DPUs, which include the target DPU.
  • the model training data in the shared cache pool is used in a set of training operations performed by the model training processor. operations, or the model training data in the shared cache pool is used for the CPU and the model training processor to perform operations in the training operation set.
  • using one or more DPUs can Improve the efficiency of processing model training business and accelerate AI model training.
  • using one or more DPUs to perform image decoding and other operations on image data can reduce the temporary data generated in the process of training the AI model and occupy the limited memory space of the model training processor, which allows the model training processor to Having sufficient memory space to train the AI model can improve the efficiency of training the AI model for the model training processor.
  • inventions of the present application provide a first data processing unit DPU.
  • the first DPU is coupled to the first CPU and the first model training processor respectively through a system bus, or the first DPU is coupled to the first DPU through a system bus.
  • the model training processor is a different chip on a training card.
  • the DPU includes: a communication interface for acquiring image data, which includes multiple encoded images; and a processing chip for performing image processing on the image data.
  • Operate operations in the set to obtain model training data wherein the operation of processing the image data includes operations in the image processing operation set and operations in the training operation set, and the image processing operation set at least includes an image decoding operation, and
  • the training operation set at least includes model training operations; an output interface circuit is used to output the model training data, and the model training data is used for the first model training processor to perform operations in the training operation set, or,
  • the model training data is used for the first CPU and the first model training processor to perform operations in the training operation set.
  • the image processing operation set further includes an image data transformation operation
  • the processing chip is configured to: perform the image decoding operation on the image data to obtain matrix data;
  • the image data transformation operation is performed on the data to obtain the model training data.
  • the training operation set further includes an image data transformation operation
  • the model training data is used by the first CPU to perform the image data transformation operation and obtain temporary data, and the temporary data For the first model training processor to perform the model training operation.
  • the communication interface is used to: obtain the artificial intelligence AI model output by the first model training processor; send the AI model to a local or remote storage device, the AI model
  • the AI model is stored in the storage device in a file format or a key-value KV format.
  • the output interface circuit is also used to output the model training data to a second DPU; the second DPU is coupled to the second CPU and the second model training processor through a system bus respectively. , or the second DPU and the second model training processor are different chips on the same training card, and the second DPU is used to: receive the model training data; output to the second model training processor The model training data is used by the second model training processor to perform operations in the training operation set.
  • the DPU provided in the fourth aspect corresponds to the data processing method provided in the first aspect
  • the technical effects of the fourth aspect and each embodiment in the fourth aspect can be found in the corresponding first aspect and each embodiment in the first aspect. The technical effects of the embodiments will not be described in detail here.
  • inventions of the present application provide a data processing unit DPU.
  • the DPU is coupled to a central processing unit (CPU) and multiple model training processors through a system bus, or the DPU is coupled to the multiple model training processors. They are different chips on a training card.
  • the data processing device includes: a communication interface for acquiring image data, the image data including a plurality of encoded images; a processing chip for performing a set of image processing operations on the image data.
  • model training data wherein the operation of processing the image data includes operations in the image processing operation set and operations in the training operation set, the image processing operation set at least includes an image decoding operation, and the training operation
  • the set at least includes a model training operation; a data read and write interface for writing the model training data into a shared cache accessed by the multiple model training processors, and the model training data in the shared cache is used for the Multiple model training processors perform operations in the training operation set, or the model training data in the shared cache is used for the CPU and the multiple model training processors.
  • a model training processor performs operations in the set of training operations.
  • embodiments of the present application provide a target data processing unit DPU.
  • the target DPU is coupled to the central processing unit CPU and the model training processor respectively through a system bus, or the target DPU and the model training processor are Different chips on a training card, the target DPU includes: a communication interface, used to obtain image data, the image data includes a plurality of encoded images; a processing chip, used to perform an image processing operation set on the image data Operate to obtain model training data, wherein the operation of processing the image data includes operations in an image processing operation set and operations in a training operation set, the image processing operation set at least includes an image decoding operation, and the training operation set at least It includes a model training operation; a data reading and writing interface for writing the model training data into a shared cache pool constructed based on caches in multiple DPUs, the multiple DPUs including the target DPU, and the shared cache pool
  • the model training data in is used for the model training processor to perform operations in the training operation set, or the model training data in the shared cache pool is used
  • the technical effects of the implementation of the sixth aspect can be found in the technical effects of the corresponding implementation of the third aspect. This will not be described in detail.
  • embodiments of the present application provide a DPU, which is used to execute the data processing method executed by the DPU in any implementation manner of the first to third aspects.
  • embodiments of the present application provide a data processing system, which includes the DPU, CPU, and model training processor described in any implementation of the first to third aspects.
  • inventions of the present application provide a chip system.
  • the chip system includes a power supply circuit and a processing circuit.
  • the power supply circuit is used to supply power to the processing circuit.
  • the processing circuit is used to perform the above-mentioned first aspect. to the data processing method executed by the DPU in any implementation of the third aspect.
  • the power supply circuit may be located in the same chip as the processing circuit, or the power supply circuit may be located in another chip other than the chip where the processing circuit is located.
  • the power supply circuit includes but is not limited to at least one of the following: a power supply subsystem, a power management chip, a power management processor, or a power management control circuit.
  • embodiments of the present application further provide a computer-readable storage medium in which programs or instructions are stored.
  • the computer-readable storage medium When run on a computer, the computer-readable storage medium causes the above-described first to third aspects.
  • the data processing methods described in any implementation are executed.
  • embodiments of the present application also provide a computer program product containing instructions that, when run on a computer, cause the computer to execute the data described in any one of the above implementations of the first to third aspects. Approach.
  • Figure 1a is a schematic structural diagram of an exemplary data processing system provided by an embodiment of the present application.
  • Figure 1b is a schematic structural diagram of another exemplary data processing system provided by an embodiment of the present application.
  • Figure 2 is a schematic flow chart of a data processing method provided by an embodiment of the present application.
  • Figure 3 is a schematic flow chart of another data processing method provided by an embodiment of the present application.
  • Figure 4 is a schematic structural diagram of another data processing system provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of another data processing system provided by an embodiment of the present application.
  • Figure 6 is a schematic flow chart of another data processing method provided by an embodiment of the present application.
  • Figure 7 is a schematic structural diagram of another data processing system provided by an embodiment of the present application.
  • Figure 8 is a schematic structural diagram of yet another data processing system provided by an embodiment of the present application.
  • Figure 9 is a schematic structural diagram of a DPU provided by an embodiment of the present application.
  • Figure 10 is a schematic structural diagram of another DPU provided by an embodiment of the present application.
  • Figure 11 is a schematic diagram of the hardware structure of a DPU provided by an embodiment of the present application.
  • Figure 12 is a schematic structural diagram of a chip system provided by an embodiment of the present application.
  • the data processing system 100 may include a data processing unit (DPU) 101, a CPU 102, and a model training processor 103.
  • DPU101, CPU102, and model training processor 103 can be coupled through a system bus.
  • the system bus is used to connect components inside the computer, such as peripheral component interconnect express (PCIE) bus, computing Any one or more of the fast link (compute express link, CXL) bus, fast non-volatile memory express (NVMe) bus, or other possible buses, this application does not Make restrictions.
  • PCIE peripheral component interconnect express
  • CXL compute express link
  • NVMe fast non-volatile memory express
  • DPU is a board card that can be inserted into the motherboard (such as server motherboard, etc.) through the PCIe interface or other interface slots.
  • the DPU can be integrated on the same board as the model training processor (such as GPU), or it can be an independent board.
  • the DPU has network interfaces, such as 100GB Ethernet network interface and infiniband network interface.
  • the DPU can also have the function of accessing the local storage media of the computer node, such as accessing the solid state drive (SSD) through the PCIe protocol.
  • SSD solid state drive
  • DPU101 and CPU102 can be deployed on the same computing node as the model training processor 103, such as computing node 1 as shown in Figure 1a.
  • the computing node 1 may be, for example, a terminal or a server, or may be other devices with computing capabilities.
  • the DPU 101, the CPU 102 and the model training processor 103 can also be deployed on different computing nodes.
  • the DPU 101 is deployed on one computing node, and the CPU 102 and the model training processor 103 are deployed on another computing node.
  • the model training processor 103 may be any one or more of a GPU, a neural network processing unit (NPU), a tensor processing unit (TPU), or Other types of processors are possible.
  • NPU neural network processing unit
  • TPU tensor processing unit
  • the data processing system 100 may also include a storage device 104, and the computing node 1 communicates with the storage device 104 through the DPU 101, as shown in Figure 1a. Specifically, the computing node 1 accesses the storage device 104 through the DPU 101, such as reading the image data stored in the storage device 104 for training the model through the DPU 101, or sending the trained model to the storage device 104 through the DPU 101 for storage. wait.
  • the DPU 101 can be connected to the storage device 104, such as establishing a connection through an interface or a bus.
  • the communication protocols between the storage device 104 and the DPU 101 include: small computer system interface (SCSI) protocol, serial attached small computer system interface (SAS) protocol, PCIe protocol, one or more of the universal serial bus (USB) protocol (such as USB 3.0, USB 2.0, etc.), NVMe protocol, or other applicable communications protocol.
  • SCSI small computer system interface
  • SAS serial attached small computer system interface
  • PCIe protocol PCIe protocol
  • USB universal serial bus
  • NVMe protocol or other applicable communications protocol.
  • the DPU 101 may include a network interface, so that the DPU 101 may communicate with the storage device 104 based on a wired network or a wireless network through the network interface.
  • the wired network may be, for example, Ethernet or an infinite bandwidth network (infiniband).
  • the storage device 104 may be, for example, one or more of a hard disk drive (HDD), a flash media drive, a shingled magnetic recording (SMR), a storage array, and a storage server.
  • the storage device 104 may be a non-volatile memory (NVM), such as read-only memory (ROM), flash memory (flash memory), storage-level memory ( storage class memory (SCM), etc.; alternatively, the storage device 104 can be a volatile memory (volatile memory), such as a random-access memory (random-access memory, RAM), etc.
  • NVM non-volatile memory
  • ROM read-only memory
  • flash memory flash memory
  • SCM storage-level memory
  • the storage device 104 can be a volatile memory (volatile memory), such as a random-access memory (random-access memory, RAM), etc.
  • RAM random-access memory
  • the CPU 102 can instruct the model training processor 103 to train one or more AI models on the computing node 1, such as instructing the model training processor 103 to train a target recognition model, a face recognition model, and a face recognition model. Detection models, etc.
  • model training processor 103 can access the image data as training samples stored in the storage device 104, store the accessed image data in the local memory, and then perform the training on the local memory.
  • the image data performs operations such as decoding, image data transformation (such as data enhancement, etc.), and training AI models, and saves the temporary data generated by performing these operations in local memory. Due to the limited memory space of the GPU, it is difficult to store the temporary data generated by the GPU processing all image data. This will not only result in a smaller batch size for iterative training of the AI model, but also leads to the process of multiple rounds of iterative training of the AI model. , the GPU needs to repeatedly read and decode the same training data set in the storage device 104 multiple times, which increases the time it takes for the GPU to iteratively train the AI model, resulting in low efficiency for the GPU to train the AI model.
  • the embodiment of the present application provides a data processing method that uses the DPU 101 to cooperate with the model training processor 103 to process services, thereby improving the training efficiency of the AI model on the computing node 1 .
  • the DPU 101 can obtain the image data required for training the AI model (i.e., the training samples of the AI model) stored in the storage device 104.
  • the image data includes multiple Encoding images, wherein the operation of processing image data includes a variety of operations, which can be classified into a set of image processing operations performed by the DPU and performed by the model training processor 103 (or the model training processor 103+CPU 102)
  • a training operation set, and the image processing operation set at least includes an image decoding operation
  • the training operation set at least includes a model training operation.
  • the DPU 101 can perform operations in the image processing operation set on the acquired image data, obtain model training data, and output the model training data.
  • the model training processor 103 can continue to perform operations in the training operation set on the model training data to implement training of the AI model.
  • the CPU 102 and the model training processor 103 sequentially execute different operations in the training operation set on the model training data to implement training of the AI model.
  • model training processor 103 Since the model training processor 103 is usually good at executing model training operations in the training operation set, it is relatively not good at operations such as image decoding. Therefore, the speed of using the DPU 101 to decode the image data (and perform other operations) will be faster than the speed of using the model training processor 103 to decode the image (and perform other operations). Therefore, using the DPU 101 can improve the processing of model training services. Efficiency, realizing accelerated AI on computing node 1 Model training.
  • using the DPU 101 to perform image decoding and other operations on image data can reduce the temporary data generated in the process of training the AI model from occupying the limited memory space of the model training processor 103, which allows the model training processor 103 to have sufficient
  • the memory space is used to train the AI model, which can also improve the efficiency of training the AI model, thereby improving the overall efficiency of processing the model training business on computing node 1.
  • the DPU 101 can read the image data required for training the AI model from the storage device 104, and sequentially perform decoding and resizing operations (resize) on the image data in the memory of the DPU 101 , data enhancement operations, etc. (these operations are the operations in the above image processing operation set) to obtain model training data; then, DPU101 can output the model training data to the GPU.
  • the GPU can store the model training data in local memory and use the model training data to train the AI model. In this way, during the process of processing the AI model training business, the memory space occupied by the temporary data generated by decoding operations, resizing operations, and data enhancement operations on image data is the memory space of the DPU.
  • the GPU can also have enough memory space to store the model training data output by the DPU101 (that is, the data obtained through the data enhancement operation), so that the GPU can use a larger batch size to iteratively train the AI model to achieve The training of AI models is accelerated.
  • the GPU can directly read the model training data from the memory of the DPU 101, so that there is no need to repeatedly read the image data as training samples in the storage device 104 and compare the images.
  • the data is decoded, resized, and data enhanced, which not only reduces resource consumption, but also further speeds up the processing efficiency of the AI model training business.
  • the DPU101 can perform operations such as decoding, size adjustment, and data enhancement of other image data in parallel, so that the GPU can obtain the next step in time after completing an AI model process.
  • a batch size of training samples continues to train the AI model, which can further speed up the training efficiency of the AI model and get rid of the input and output constraints (IO bound) and computing power constraints of the GPU training AI model.
  • the system architecture shown in Figure 1a is only an example and is not intended to limit its specific implementation to this example.
  • the DPU 101 and the model training processor 103 are different chips on the training card 200, and the CPU and the training card 200 can perform data communication on the training card 200, and On the training card 200, the DPU 101 and the model training processor 103 can be connected through an on-chip bus.
  • the computing node 1 in Figure 1 may also include a larger number or more types of model training processors.
  • the DPU 101 can be connected to multiple computing nodes at the same time, thereby enabling accelerated model training business processing for multiple computing nodes, which is not limited in this embodiment.
  • the data processing system 100 including the above-mentioned DPU 101, CPU 102, model training processor 103, and storage device 104 can be suitable for centralized storage application scenarios or distributed storage application scenarios.
  • one or more computing nodes may form a central node, and all data processing services of the entire data processing system 100 are centrally deployed on this central node.
  • a disk-control separation architecture can be adopted between the computing node 1 and the storage device 104 , that is, the computing node 1 and the storage device 104 can be deployed independently; or an integrated disk-control architecture can be adopted between the computing node 1 and the storage device 104 , that is, The computing node 1 may have a slot, and the storage device 104 is placed in the computing node 1 through the slot and deployed integrated with the computing node 1 .
  • the data in the data processing system 100 can be distributed and stored on multiple independent storage devices 104 .
  • the computing node 1 can be integrated and deployed with the storage device 104, so that the computing node has both computing capabilities and storage capabilities, and a virtual machine can be created on the computing node 1, or a virtual machine may not be created.
  • a storage-computation separation architecture may be adopted between the computing node 1 and the storage device 104, that is, the computing node 1 and the storage device 104 are independently deployed and communicate through a network.
  • the storage device 104 may include one or more different storage media, which is not limited in this embodiment.
  • FIG 2 it is a schematic flowchart of a data processing method in an embodiment of the present application.
  • This method can be applied to the data processing system 100 shown in Figure 1a or Figure 1b. In practical applications, this method can also be applied to other applicable data processing systems.
  • the following is an example of an application to the data processing system 100 shown in Figure 1a.
  • the method may specifically include:
  • the DPU 101 accesses the storage device 104 to obtain image data used for training the AI model.
  • the image data includes multiple encoded images.
  • the storage device 104 can store image data as AI model training samples.
  • the image data is used to train one or more AI models on the computing node 1, such as target recognition models, face detection models, etc., so that When processing a model training task, the DPU 101 can access the storage device 104 to obtain the image data stored thereon for training the AI model.
  • the DPU 101 can access the storage device 104 under the control of the CPU 102 .
  • the CPU 102 may provide an external client, which may be a web browser, for example, or may be an application running on the user side for interaction with the user, so that the CPU 102 may use the The client receives the user's instruction information for training the AI model. Then, the CPU 102 can determine the storage location of the image data used to train the AI model in the storage device 104 based on the instruction information, and generate a training instruction including the storage location, so that the CPU 102 can issue the training instruction to the DPU 101 .
  • the DPU 101 responds to the training instruction and accesses the image data in the storage device 104 according to the storage location in the training instruction.
  • the DPU 101 can also be triggered to access the storage device 104 through other methods, which is not limited in this embodiment.
  • the DPU 101 performs operations in the image processing operation set on the accessed image data to obtain model training data, where the operation of processing the image data includes operations in the image processing operation set and operations in the training operation set, and the image processing operation
  • the set at least includes image decoding operations
  • the set of training operations at least includes model training operations.
  • S203 The DPU 101 outputs the model training data to the model training processor 103.
  • the image data acquired by the DPU 101 is a plurality of encoded images (commonly referred to as encoded images). Therefore, in the process of using the image data to train the AI model, the operations performed on the image data It can include image decoding operations and model training operations, etc. Among them, the image decoding operation is used to decode the multiple encoded images, and usually data in matrix form can be obtained. In some application scenarios, the model training processor 103 can directly use the data in matrix form as the input of the AI model to train the AI. Model (i.e. perform model training operations).
  • the operation performed on the image data may also be an image data transformation operation.
  • the image data transformation operation may be, for example, a center crop operation, a size adjustment operation, and data enhancement.
  • the center cropping operation refers to cropping image data (or matrix data obtained after decoding the image data) into image data of a preset size to meet the size requirements for image data in the model training business.
  • the resizing operation refers to scaling the image data (or matrix data obtained after decoding the image data) to adjust the size of the image data.
  • Data enhancement operation refers to changing the size, pixel value, or perspective of the image data (or the matrix data obtained after decoding the image data) (such as image flipping, rotation, translation, scaling, etc.).
  • Normalization operation refers to normalizing the data of each color channel in the image data (or the matrix data obtained after decoding the image data), such as The pixel value of each pixel on the channel is subtracted from the average pixel value of the color channel, and divided by the variance of the pixel value of the color channel, etc.
  • the above-mentioned multiple operations performed on the image data may be divided in advance into operations performed by the DPU 101 and operations performed by other processors (such as the model training processor 103).
  • the operations performed by the DPU 101 among the plurality of operations are classified into the image processing operation set, and the operations performed by other processors are classified into the training operation set.
  • the DPU 101 can perform operations in the image processing operation set on it and generate corresponding data, which is hereinafter referred to as model training data for ease of differentiation.
  • the multiple operations performed on the image data include image decoding operations and model training operations.
  • the DPU 101 can perform the image decoding operation on the image data, and the resulting matrix data is the model training data.
  • the model training processor 103 can perform a model training operation on the model training data.
  • the model training data can be used as the input of the AI model to train the AI model.
  • multiple operations performed on image data include, in addition to image decoding operations and model training operations, image data transformation operations, such as the above-mentioned center cropping operation, resizing operation, data enhancement operation, reduction operation, etc.
  • image data transformation operations such as the above-mentioned center cropping operation, resizing operation, data enhancement operation, reduction operation, etc.
  • One or more of the unified operations include, in addition to image decoding operations and model training operations, image data transformation operations, such as the above-mentioned center cropping operation, resizing operation, data enhancement operation, reduction operation, etc.
  • the number of operations that the DPU 101 needs to perform can be determined based on the computing power of the DPU 101. For example, assume that a plurality of operations performed on image data include image decoding operations, resizing operations, data augmentation operations, and model training operations. When DPU101 has higher computing power and more memory resources, it indicates that DPU101 has stronger computing power. At this time, image decoding operations, size adjustment operations, and data enhancement operations can be included in the set of image processing operations performed by DPU101.
  • the model training operations are classified into a set of training operations executed by the model training processor 103, thereby reducing the computational load of the model training processor 103.
  • the image processing operation set includes multiple types of operations.
  • model training data output to the model training processor 103 can be obtained.
  • the computing power of DPU101 is low or the memory resources are small, it indicates that the computing power of DPU101 is poor.
  • the image decoding operation and the size adjustment operation can be classified into the image processing operation set, and the data enhancement operation and model
  • the training operations are classified into the training operation set, and based on the computing power of the DPU 101, the computational burden of the model training processor 103 is reduced as much as possible.
  • the number of operations that the DPU 101 needs to perform can also be determined based on the current load size of the model training processor 103 (for example, the model training processor 103 can train multiple AI models at the same time, etc.). For example, when the load of the model training processor 103 is large, the DPU 101 can perform image decoding operations, size adjustment operations, and data enhancement operations, and the model training processor 103 performs model training operations to reduce the time required for the model training processor 103 to train the AI model. The amount of calculation required in the process can avoid excessive load on the model training processor 103.
  • the DPU 101 can only perform the image decoding operation, or only perform the image decoding operation and the size adjustment operation, and the model training processor 103 performs the remaining operations, which can make the model training processor 103
  • the resource utilization rate can reach a high level, thereby avoiding waste of resources caused by excessive resources on the model training processor 103 being idle.
  • the model training processor 103 performs operations in the training operation set on the model training data.
  • the DPU 101 and the model training processor 103 collaboratively process the model training service. Therefore, after the DPU 101 outputs the model training data, the model training processor 103 can continue to process the model training data, thereby completing the model training. Processing of business.
  • the model training processor 103 can directly The model training data output by DPU101 can be used for model training.
  • the model training data can be input into the AI model, and the parameters in the AI model can be updated based on the inference results output by the AI model, thereby completing a single update of the AI model. training process.
  • the model training processor 103 can first perform the image data transformation operation on the model training data to obtain temporary data; then, the model training processor 103 performs the image data transformation operation according to the The temporary data performs model training operations to implement training of the AI model.
  • the model training processor 103 can also return the trained AI model to the CPU 102, so that the CPU 102 feeds back the AI model to the upper-layer application or the client that interacts with the CPU 102.
  • the model training processor 103 may return the trained AI model to the DPU 101 so that the DPU 101 sends the AI model to the storage device 104 for storage, etc., which is not limited in this embodiment.
  • the model training processor 103 when the model training processor 103 feeds back the AI model to the CPU 102, the CPU 102 can write the AI model into a local storage area, such as a hard disk stored on the computing node 1.
  • the model training processor 103 feeds back the AI model to the DPU 101, if the DPU 101 and the storage device 104 conduct data communication through a wired network or a wireless network, the DPU 101 can generate a corresponding file based on the AI model and send it to the remote server through the network interface on the DPU 101.
  • the remote storage device 104 sends the file, so that the AI model can be stored in the storage device 104 based on the file format; or, after the DPU 101 sends the AI model to the remote storage device 104, the storage device 104 can use the key-value pair (key- Save the AI model in the form of value), where key is the key created by the storage device 104 and value is the AI model. If the DPU 101 and the storage device 104 are connected through a PCIe bus or other means, the DPU 101 can send the AI model to the local storage device 104 based on the file format or KV format.
  • the DPU 101 performs some of the operations required to process image data such as image decoding, the operations required to be performed by the model training processor 103 can be reduced. This can not only use the hardware of the DPU 101 to accelerate the processing of model training business data, but also, The model training data generated by the DPU 101 when executing operations in the image processing operation set occupies the memory space of the DPU 101 . This allows the model training processor 103 to have sufficient memory space even if the local memory space of the model training processor 103 is limited. The memory space trains the AI model according to the model training data, thereby preventing the efficiency of the AI model training by the model training processor 103 from being affected by the limited memory space in the model training processor 103, thereby achieving accelerated AI model efficiency.
  • model training data generated by DPU101 can be stored in the memory space of DPU101, so that when the model training processor 103 needs the model training data again (such as reusing the same data set to iteratively train the AI model, etc.), It can be read directly from the memory space of the DPU 101 without the need for the DPU 101 to re-read the image data from the storage device 104 and re-perform image decoding and other operations, thus further accelerating the training of the AI model.
  • the DPU 101 and the model training processor 103 jointly process the model training service are used as an example for illustration.
  • the DPU 101 and the model training processor 103 can also be combined.
  • the comprehensive computing power of the CPU 102 and the model training processor 103 accelerates the training of the AI model.
  • Figure 3 shows a schematic flow chart of another data processing method according to an embodiment of the present application. As shown in Figure 3, the method may specifically include:
  • the DPU 101 accesses the storage device 104 to obtain image data used for training the AI model.
  • the image data includes multiple encoded images.
  • the DPU 101 performs the operations in the image processing operation set on the data accessed to obtain the model training service, and obtains the model training data.
  • the operation of processing the image data includes the operations in the image processing operation set and the training operation set. and the image processing operation set at least includes image decoding operations, and the training operation set at least includes image data transformation operations and model training operations.
  • steps S301 to S302 please refer to the relevant descriptions of step S201 and step S202 in the foregoing embodiments, and will not be described again here.
  • DPU101 outputs model training data to CPU102.
  • the DPU 101 after the DPU 101 outputs the model training data, it can be handed over to the CPU 102 to continue processing the model training data.
  • S304 The CPU 102 performs the image data transformation operation in the training operation set on the model training data to obtain temporary data.
  • the image data transformation operation may be, for example, one or more of a center cropping operation, a size adjustment operation, a data enhancement operation, and a normalization operation.
  • the CPU 102 can sequentially execute the multiple operations based on the model training data to obtain temporary data for output to the model training processor 103.
  • the model training processor 103 performs the model training operation in the training operation set on the temporary data.
  • the model training processor 103 completes the training of the AI model using temporary data, if the AI model meets the model training termination conditions, such as the number of iterative training reaches a preset number or the AI model converges, etc., the model training processor 103
  • the trained AI model can also be returned to the CPU 102 so that the CPU 102 can feed back the AI model to an upper-layer application or a client that interacts with the CPU 102 .
  • the model training processor 103 may return the trained AI model to the DPU 101 so that the DPU 101 sends the AI model to the storage device 104 for storage, etc., which is not limited in this embodiment.
  • the computing power of the DPU 101 and the model training processor 103 is sufficient to meet the computing power required to train the AI model, so that only the DPU 101 and the model training processor 103 can be used to train the AI model. And the training efficiency of the AI model can reach a high level.
  • the computing power required to train the AI model is relatively high.
  • the computing power of the DPU 101, the CPU 102, and the model training processor 103 can be combined to achieve accelerated training of the AI model. This enables the training efficiency of the AI model to reach a higher level.
  • DPU 101 and CPU 102
  • the service of accelerating AI model training can be provided to multiple model training processors 103 at the same time.
  • DPU101, CPU102, and the multiple model training processors 103 can be deployed on the same computing node, as shown in Figure 4.
  • the DPU101 can communicate with the multiple model training processors 103 in the computing node through the system bus. Coupling is performed and model training data is provided to the plurality of model training processors 103.
  • DPU101, CPU102, and multiple model training processors 103 can be deployed on different computing nodes. As shown in Figure 5, DPU101, CPU102, and at least one model training processor 103 are deployed on computing node 1, and the remaining model training The processor 103 may be deployed on the computing node 2 and the computing node 3 respectively, and the computing node 2 and the computing node 3 respectively also include a CPU and other hardware (not shown in FIG. 5 ). At this time, the DPU 101 may be coupled with the model training processors 103 in multiple computing nodes through at least one of the PCIe bus, the CXL bus, and the NVMe bus (or other buses). Moreover, the multiple model training processors 103 may be processors of the same type, such as GPUs, etc.; or, the multiple model training processors 103 may include multiple processors of different types, such as GPU, NPU, TPU etc.
  • the method may specifically include:
  • DPU101 receives the training instruction for the AI model.
  • the DPU 101 acquires image data as model training samples from the storage device 104 according to the training instruction.
  • the image data includes multiple coded images.
  • the CPU on any computing node can issue training instructions to the DPU 101, or there can be an agent node in multiple computing nodes shown in Figure 5, and the agent node can be responsible for communicating with the system. For example, it can receive instructions for training the AI model sent by the user through the client, and present the trained AI model to the client, so that the CPU in the agent node issues training instructions to the DPU101.
  • the model training processors 103 in multiple computing nodes can implement distributed training of the same AI model, or different model training processors 103 are responsible for training different AI models.
  • the training instruction received by the DPU 101 may instruct the use of heterogeneous processors on multiple computing nodes, so that the DPU 101 provides image data that has undergone image data transformation operations to the multiple heterogeneous processors indicated by the training instruction.
  • the training instruction may also indicate the storage location of the image data used as the AI model training sample on the storage device 104, so that the DPU 101 accesses the image data based on the storage location in the training instruction.
  • DPU101 performs an image decoding operation on the acquired image data to obtain matrix data.
  • the DPU 101 performs image data transformation operations such as size adjustment operations and data enhancement operations on the matrix data to obtain model training data (that is, matrix data after image data transformation operations).
  • DPU101 stores the model training data into the shared cache in DPU101.
  • the DPU 101 may be configured with a shared cache, and the shared cache may be accessed by multiple computing nodes. In this way, after the DPU 101 writes the matrix data after the image data transformation operation into the shared cache, the model training processors 103 in multiple computing nodes can obtain the matrix data as model input from the shared cache.
  • the DPU 101 may also send the matrix data after the image data transformation operation to the computing nodes one by one, so that the model training processor 103 in each computing node obtains the matrix data after the image data transformation operation.
  • matrix data there is no limitation on the specific implementation manner in which the model training processor 103 of each computing node obtains the matrix data after the image data transformation operation.
  • the model training processors 103 in multiple computing nodes perform the training operation of the AI model based on the matrix data in the shared cache, and complete the training business of the AI model.
  • the DPU 101 can not only reduce the time required to perform the operations, but also the memory space occupied by the temporary data generated by performing these operations is the memory space of the DPU 101.
  • the model training processor 103 can have enough memory space to store the matrix data provided by the DPU 101 and use the matrix data to train the AI model, thereby The model training processor 103 can use a larger batch size to iteratively train the AI model to accelerate the training of the AI model.
  • model training data generated by performing image decoding operations and image data transformation operations on image data can be stored in the shared cache of the DPU 101, which allows the model training processor 103 in each computing node to iteratively train the AI model.
  • the model training processor 103 can directly read the model training data from the shared cache of the DPU 101 during each iteration, thereby eliminating the need for the DPU 101 to repeatedly read the image data in the storage device 104 and perform image decoding and decoding of the image data.
  • Image data transformation operation which can not only reduce resource consumption, but also further accelerate the AI Processing efficiency of model training business.
  • the DPU 101 can read the image data in batches and perform corresponding image decoding operations and image data transformation operations.
  • the model training processor 103 uses the current batch of image data corresponding to
  • the DPU 101 can continue to read the next batch of image data from the storage device 104 and perform image decoding and image data transformation operations, so that the model training processor 103 can, after completing an AI model process,
  • the next batch of model training data can be obtained in time to continue training the AI model, so that by parallelizing the image decoding operation, image data transformation operation and model training operation, the training efficiency of the AI model can be further accelerated.
  • multiple model training processors 103 can be deployed on the same computing node, so that the matrix data in the shared cache in the DPU 101 can be accessed by multiple model training processors 103 for implementation.
  • a shared cache pool with larger storage space can be built based on multiple DPUs 101.
  • the shared cache pool can include shared caches in multiple DPUs 101, and any DPU 101 can obtain
  • the obtained model training data can be written into the shared cache pool for storage, and supports the model training processors 103 in multiple computing nodes to access the share.
  • the model training data in the cache pool is used to further expand the computing power of the model training processor 103 based on the multiple DPUs 101 to further improve the efficiency of AI model training.
  • the data processing system 700 shown in Figure 7 is exemplified by taking the DPU to be deployed independently of the computing node as an example.
  • each computing node may include a CPU, a model training processor and at least One DPU can further accelerate the AI model training process of each model training processor 103 based on the computing power provided by multiple DPUs in multiple computing nodes.
  • a shared cache pool can be built across computing nodes.
  • multiple DPUs 101 shown in Figure 7 can be located in the same computing node, such as computing node 1, etc., so that multiple DPUs 101 in computing node 1 can be used not only to compute the AI model training process of node 1, but also through
  • the shared cache pool provides computing node 2 and computing node 3 with services to accelerate AI model training.
  • each DPU 101 can be responsible for providing at least one model training processor 103 with services to accelerate AI model training.
  • DPU 101-1 is used to train the model in the computing node 1.
  • the processor 103 provides services
  • the DPU 101-2 is used to provide services for the model training processor 103 in the computing node 2
  • the DPU 101-3 is used to provide services for the model training processor 103 in the computing node 3, etc., and between different DPUs 101 Data interaction is possible.
  • the obtained model training data can be shared with DPU101-1 and DPU101-3.
  • DPU 101-2 when DPU 101-2 should output the model training data to the model training processor 103 in computing node 2 for AI model training, DPU 101-1 can output the model training data to the model training processor 103 in computing node 1. , the DPU 101-3 can output the model training data to the model training processor 103 in the computing node 3, so as to accelerate the training of the AI model in each computing node.
  • multiple DPUs and multiple model training processors can be deployed in the same computing node, and each DPU can be responsible for providing accelerated AI model training to at least one model training processor. services. Take the data processing system including DPU1, DPU2, model training processor 1 and model training processor 2 as an example.
  • the DPU 1 provides the service of accelerating AI model training to the model training processor 1, it can perform operations such as image decoding (and image data transformation) on the acquired image data to obtain model training data.
  • DPU1 can not only output the model training data to the model training processor 1, so that the model training processor 1 trains the AI model based on the model training data, Moreover, DPU1 can also output the model training data to DPU2, and DPU2 provides the model training data to the model training processor 2, so that the model training processor 2 can also train the AI model on it based on the model training data.
  • the data processing system also includes hardware such as CPU, which will not be described in detail here.
  • FIG. 4, FIG. 5, FIG. 7 and FIG. 8 are only provided as some exemplary descriptions of the embodiments of the present application and are not intended to limit the specific implementation of the data processing system.
  • FIG. 9 is a schematic structural diagram of a DPU900 provided by an embodiment of the present application.
  • the DPU900 shown in Figure 9 can be, for example, the DPU101 in the above embodiments.
  • the DPU900 communicates with the first CPU and the first model training processor respectively. System bus coupling, or the DPU900 and the first model training processor are different chips on the same training card.
  • DPU900 includes:
  • Communication interface 901 used to obtain image data, where the image data includes multiple encoded images
  • the processing chip 902 is configured to perform operations in an image processing operation set on the image data to obtain model training data, where the operations of processing the image data include operations in the image processing operation set and operations in the training operation set,
  • the image processing operation set at least includes image decoding operations, and the training operation set at least includes model training operations;
  • the output interface circuit 903 is used to output the model training data.
  • the model training data is used for the first model training processor to perform operations in the training operation set, or the model training data is used for the first model training processor to perform operations in the training operation set.
  • the first CPU and the first model training processor perform operations in the training operation set.
  • the image processing operation set also includes image data transformation operations, and the processing chip 902 is used for:
  • the training operation set further includes an image data transformation operation
  • the model training data is used by the first CPU to perform the image data transformation operation and obtain temporary data, and the temporary data For the first model training processor to perform the model training operation.
  • the communication interface 901 is used for:
  • the AI model is sent to a local or remote storage device, and the AI model is stored in the storage device in a file format or a key value KV format.
  • the output interface circuit 903 is also used to output the model training data to the DPU 910; the DPU 910 is coupled with the second CPU and the second model training processor respectively through the system bus, or the The DPU910 and the second model training processor are different chips on the same training card.
  • the DPU910 is used to: receive the model training data; output the model training data to the second model training processor, so The model training data is used by the second model training processor to perform operations in the training operation set.
  • the embodiment of the present application also provides another DPU, as shown in Figure 10.
  • the DPU 1000 shown in Figure 10 is coupled to the central processing unit CPU and the multiple model training processors through the system bus, or the DPU 1000 and the multiple model training processors are different chips on the same training card.
  • Data processing device 1000 includes:
  • Communication interface 1001 used to obtain image data, where the image data includes multiple encoded images
  • the processing chip 1002 is used to perform operations in the image processing operation set on the image data to obtain model training data.
  • Data wherein the operation of processing the image data includes operations in an image processing operation set and operations in a training operation set, the image processing operation set at least includes an image decoding operation, and the training operation set at least includes a model training operation;
  • the operations in the training operation set are executed, or the model training data in the shared cache is used for the CPU and the multiple model training processors to execute the operations in the training operation set.
  • the DPU 1000 shown in Figure 10 includes:
  • Communication interface 1001 used to obtain image data, where the image data includes multiple encoded images
  • the processing chip 1002 is configured to perform operations in an image processing operation set on the image data to obtain model training data, where the operations of processing the image data include operations in the image processing operation set and operations in the training operation set,
  • the image processing operation set at least includes image decoding operations
  • the training operation set at least includes model training operations;
  • the model training data in the shared cache pool It is used for the model training processor to perform operations in the training operation set, or the model training data in the shared cache pool is used for the CPU and the model training processor to perform the training operation set. operations in.
  • the embodiment of the present application also provides a data processing device.
  • the data processing device is applied to a DPU, such as the DPU 101 in the above embodiments.
  • the DPU communicates with the CPU and the model training processor through the system bus respectively. Coupling, or the DPU and model training processor are different chips on the same training card.
  • Data processing equipment includes:
  • a communication module used to obtain image data, where the image data includes a plurality of encoded images
  • a processing module configured to perform operations in an image processing operation set on the image data to obtain model training data, wherein the operation of processing the image data includes operations in the image processing operation set and operations in the training operation set, so The image processing operation set at least includes image decoding operations, and the training operation set at least includes model training operations;
  • An output module configured to output the model training data.
  • the model training data is used for the model training processor to perform operations in the training operation set, or the model training data is used for the CPU and The model training processor performs operations in the set of training operations.
  • the image processing operation set also includes image data transformation operations, and the processing module is used to:
  • the training operation set further includes an image data transformation operation
  • the model training data is used to perform the image data transformation operation by the CPU and obtain temporary data
  • the temporary data is used to The model training operation is performed by the model training processor.
  • the communication module is used for:
  • the AI model is sent to a local or remote storage device, and the AI model is stored in the storage device in a file format or a key value KV format.
  • the output module is also used to output the model training data to other DPUs; the other DPUs are respectively coupled with other CPUs and other model training processors through the system bus, or the other
  • the DPU and the other model training processors are different chips on the same training card, and the other DPU is used to: receive the model Training data; outputting the model training data to the other model training processor, the model training data being used to perform operations in the training operation set by the other model training processor.
  • the embodiment of the present application also provides another data processing device, which is applied to the data processing unit DPU.
  • the DPU is coupled to the central processing unit CPU and multiple model training processors through the system bus, or the DPU is coupled to the central processing unit CPU and multiple model training processors through the system bus.
  • the DPU and the multiple model training processors are different chips on one training card, and the data processing device includes:
  • a communication module used to obtain image data, where the image data includes a plurality of encoded images
  • a processing module configured to perform operations in an image processing operation set on the image data to obtain model training data, wherein the operation of processing the image data includes operations in the image processing operation set and operations in the training operation set, so The image processing operation set at least includes image decoding operations, and the training operation set at least includes model training operations;
  • a data writing module configured to write the model training data into a shared cache accessed by the multiple model training processors, and the model training data in the shared cache is used for execution by the multiple model training processors.
  • the operations in the training operation set, or the model training data in the shared cache are used for the CPU and the multiple model training processors to execute the operations in the training operation set.
  • the data processing device is applied to the target data processing unit DPU, which is coupled to the central processing unit CPU and the model training processor respectively through the system bus, or the target DPU and the model training processor are Different chips on a training card, the data processing device includes:
  • a communication module used to obtain image data, where the image data includes a plurality of encoded images
  • a processing module configured to perform operations in an image processing operation set on the image data to obtain model training data, wherein the operation of processing the image data includes operations in the image processing operation set and operations in the training operation set, so The image processing operation set at least includes image decoding operations, and the training operation set at least includes model training operations;
  • a data writing module configured to write the model training data into a shared cache pool constructed based on caches in multiple DPUs, where the multiple DPUs include the target DPU, and the model training data in the shared cache pool It is used for the model training processor to perform operations in the training operation set, or the model training data in the shared cache pool is used for the CPU and the model training processor to perform the training operation set. operations in.
  • the DPU 1100 may include a communication interface 1110 and a processor 1120.
  • the DPU 1100 may also include a memory 1130.
  • the memory 1130 may be provided inside the DPU 1100 or outside the DPU 1100 .
  • each action performed by the DPU 101 in the embodiments shown in FIG. 3 and FIG. 4 can be implemented by the processor 1120.
  • the processor 1120 can obtain image data through the communication interface 1110 and be used to implement any of the methods performed in FIG. 2, FIG. 3, and FIG. 6.
  • each step of the processing flow can complete the methods executed in Figures 2, 3 and 6 through instructions in the form of hardware integrated logic circuits or software in the processor 1120.
  • the program code executed by the processor 1120 to implement the above method may be stored in the memory 1130 .
  • the memory 1130 and the processor 1120 are connected, such as coupling connection, etc.
  • Some features of the embodiments of the present application may be implemented/supported by the processor 1120 executing program instructions or software codes in the memory 1130.
  • the software components loaded on the memory 1130 can be summarized functionally or logically, for example, the processing chip 902 and the output interface circuit 903 shown in Figure 9, or the processing chip 1002 and the data reading and writing interface 1003 shown in Figure 10.
  • the functions of the communication interface 901 shown in FIG. 9 and the communication interface 1001 shown in FIG. 10 can be realized by the communication interface 1110.
  • Any communication interface involved in the embodiments of this application may be a circuit, bus, transceiver, or any other device that can be used for information exchange.
  • the communication interface 1110 in the DPU 1100 for example, the other device may be a device connected to the DPU 1100, etc.
  • the processor involved in the embodiments of this application may be a general-purpose processor, a digital signal processor, an application-specific integrated circuit, or a modern Field programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, and discrete hardware components can implement or execute the methods, steps, and logical block diagrams disclosed in the embodiments of this application.
  • a general-purpose processor may be a microprocessor or any conventional processor, etc.
  • the steps of the methods disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware processor for execution, or can be executed by a combination of hardware and software modules in the processor.
  • the coupling in the embodiment of this application is an indirect coupling or communication connection between devices, modules or modules, which may be in electrical, mechanical or other forms, and is used for information interaction between devices, modules or modules.
  • the processor may operate in conjunction with the memory.
  • the memory can be a non-volatile memory, such as a hard disk or a solid state drive, or a volatile memory, such as a random access memory.
  • Memory is, but is not limited to, any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
  • the embodiments of the present application do not limit the specific connection medium between the above communication interface, processor and memory.
  • the memory, processor and communication interface can be connected through a bus.
  • the bus can be divided into address bus, data bus, control bus, etc.
  • the chip system 1200 may include a power supply circuit 1201 and a processing circuit 1202.
  • the power supply circuit 1201 is used to power the processing circuit 1202, and the processing circuit is used to perform the following Steps:
  • image data including a plurality of encoded images
  • Output the model training data, the model training data is used for the model training processor to perform operations in the training operation set, or the model training data is used for the CPU and the model training processing
  • the processor performs operations in the set of training operations.
  • the power supply circuit 1201 may be located in the same chip as the processing circuit 1202, or the power supply circuit 1201 may be located in another chip other than the chip where the processing circuit 1202 is located.
  • the power supply circuit 1201 includes, but is not limited to, at least one of the following: a power supply subsystem, a power management chip, a power management processor, or a power management control circuit.
  • embodiments of the present application also provide a computer storage medium, which stores a software program.
  • the software program can implement any one or more of the above.
  • Embodiments provide methods performed by the data processing system 100 .
  • the computer storage medium may include: U disk, mobile hard disk, read-only memory, random access memory, magnetic disk or optical disk and other various media that can store program codes.
  • embodiments of the present application may be provided as methods, devices, systems, storage media or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including, but not limited to, disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer program that cause a computer or other programmable data processing device to operate in a particular manner. in a computer-readable memory, such that the instructions stored in the computer-readable memory produce an article of manufacture that includes instruction means that implements one or more processes in the flowchart and/or one or more blocks in the block diagram Functions specified in the box.
  • These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device.
  • Instructions provide steps for implementing the functions specified in a process or processes of a flowchart diagram and/or a block or blocks of a block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Facsimiles In General (AREA)

Abstract

提供一种数据处理方法,DPU对获取图像数据执行图像操作集合中的操作,得到模型训练数据,该图像处理操作集合中至少包括图像解码操作,从而DPU向模型训练处理器输出模型训练数据,该模型训练数据用于供模型训练处理器执行训练操作集合中的操作;或者,DPU向CPU输出模型训练数据,该模型训练数据用于供CPU以及模型训练处理器执行训练操作集合中的操作,训练操作集合至少包括模型训练操作。如此,利用DPU对图像数据执行图像解码等操作,可以通过加速数据处理来提高AI模型训练效率,并可以通过减少对模型训练处理器有限的内存空间的占用,来提高训练AI模型的效率。

Description

数据处理方法、数据处理单元、系统及相关设备
本申请要求于2022年04月29日提交中国国家知识产权局、申请号为202210473780.6、申请名称为“一种文件系统访问方法及装置”和要求于2022年08月04日提交中国国家知识产权局、申请号为202210934646.1、申请名称为“数据处理方法、数据处理单元、系统及相关设备”的中国专利申请的优先权。其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据处理技术领域,尤其涉及一种数据处理方法、数据处理单元、系统及相关设备。
背景技术
目前,计算节点中通常可以配置有模型训练处理器,并利用该模型训练处理器处理计算节点上的模型训练业务,以此卸载计算节点中的中央处理器(central processing unit,CPU)的计算负担、提高计算节点的计算性能。举例来说,计算节点中可以配置有图形处理器(graphics processing unit,GPU),其中,GPU可以用于训练CPU提供的人工智能(artificial intelligence,AI)模型等。
但是,实际应用时,计算节点上的模型训练业务,可能对于业务处理时延的要求较高,这使得模型训练处理器训练模型的效率,可能难以满足部分应用场景的需求。
发明内容
提供一种数据处理方法、数据处理单元、系统、存储介质以及计算机程序产品,以提高计算节点上的AI模型的训练效率,从而满足实际应用场景中部分模型训练业务的低时延要求。
第一方面,本申请实施例提供一种数据处理方法,该数据处理方法由DPU执行,该DPU分别与CPU、模型训练处理器通过系统总线耦合,或者,该DPU与模型训练处理器是一个训练卡上的不同芯片,该训练卡与CPU能够在训练卡上进行数据通信;在进行数据处理时,DPU获取图像数据,该图像数据包括多个编码图像,然后,DPU对该图像数据执行图像操作集合中的操作,得到模型训练数据,其中,处理该图像数据的操作包括由DPU执行的图像处理操作集合中的操作以及非DPU执行的训练操作集合中的操作,并且,图像处理操作集合中至少包括图像解码操作,而训练操作集合至少包括模型训练操作,从而DPU向模型训练处理器输出模型训练数据,该模型训练数据用于供模型训练处理器执行训练操作集合中的操作;或者,DPU向CPU输出模型训练数据,该模型训练数据用于供CPU以及模型训练处理器执行训练操作集合中的操作。
由于通常情况下模型训练处理器主要擅长执行训练操作集合中的模型训练操作,因此,利用DPU对图像数据进行图像解码(以及执行其它操作)的速度,会比利用模型训练处理器进行图像解码(以及执行其它操作)的速度更快,从而利用DPU可以提高处理模型训练业务的效率、实现加速AI模型训练。并且,利用DPU对图像数据执行图像解码等操作,可以减少实现训练AI模型的过程中所产生的临时数据对模型训练处理器有限的内存空间的占用,这使得模型训练处理器可以具有充足的内存空间训练AI模型,从而也能提高训练AI模型的效率。
在AI训练中,模型训练处理器(例如GPU)针对图像数据通常需要执行多个操作,这多个操作组成一个集合。本方案把这个集合划分成复数个子集,把其中一个子集卸载到DPU中执行;其余子集由模型训练处理器执行,或者由模型训练处理器与CPU共同执行。这样能够节约模型训练处理器的资源占用,让模型训练处理器把资源集中用于模型训练,而不是被图像解码等操作占用大量资源。此外,DPU的位置往往更接近用户,用户数据先经过DPU(例如集成网络功能的DPU),经过DPU处理后再到达模型训练处理器,这样可以减小模型训练处理器的计算负担,从而模型训练处理器能够具有更多的算力来加速AI模型训练。
在一种实现方式中,卸载到DPU的操作是DPU比模型训练处理器更擅长的操作。换言之,DPU与模型训练处理器执行同样的操作时,DPU比模型训练处理器效率更高,或者DPU比模型训练处理器耗电更少,或者DPU比模型训练处理器占用时间更短。
在一种可能的实施方式中,DPU包括网络接口,从而DPU在获取图像数据时,具体可以是基于该网络接口通过有线网络或者无线网络获取用于训练AI模型的图像数据。如此,可以支持对远端的存储设备进行访问,以获取图像数据,从而可以在云场景中实现对AI模型的加速训练。
在一种可能的实施方式中,当DPU通过有线网络获取存储设备中的图像数据时,该有线网络具体可以是以太网或者无线带宽网。
在一种可能的实施方式中,DPU可以与存储设备连接,从而DPU可以基于该连接从存储设备获取图像数据。如此,可以在本地实现对AI模型的加速训练。
在一种可能的实施方式中,存储设备例如可以是硬盘驱动器HDD、闪存介质驱动器、叠瓦式磁记录SMR、存储阵列、存储服务器中的一种或多种,实际应用时也可以是其它方式实现。
可选地,存储设备例如可以是易失性存储器,或者可以是非易失性存储器。
在一种可能的实施方式中,当存储设备与DPU连接时,存储设备与DPU之间的通信协议可以包括:小型计算机系统接口SCSI协议、串行连接小型计算机系统接口SAS协议、外围元件快速互连PCIe协议、通用串行总线USB协议、快速非易失性存储器NVMe协议中的一种或多种。如此,DPU可以基于任意一种或者多种通信协议实现与存储设备之间的数据通信。
在一种可能的实施方式中,DPU、CPU以及模型训练处理器位于同一个服务器。
可选地,DPU可以与模型训练处理器位于不同的服务器,从而可以实现对远端的模型训练处理器提供AI模型加速训练的服务。
在一种可能的实施方式中,所述模型训练处理器为图形处理器GPU、神经网络处理器NPU、张量处理器TPU中的一种或者多种。
在一种可能的实施方式中,用于耦合DPU、CPU以及模型训练处理器的系统总线,或者而用于耦合CPU与训练卡的系统总线,可以包括外围元件快速互连PCIe总线、计算快速链接CXL总线、快速非易失性存储器NVMe总线中的一种或多种。
在一种可能的实施方式中,图像处理操作集合还包括图像数据变换操作,该图像数据变化操作例如可以是中心裁剪操作、尺寸调整操作、数据增强操作、归一化操作中的一种或者多种,或者还可以其它类型的操作等,从而DPU在执行图像处理操作集合中的操作时,可以先对图像数据执行图像解码操作,得到矩阵数据,然后再对该矩阵数据执行图像数据变换操作,得到输出给CPU或者输出给模型训练处理器的模型训练数据。
在一种可能的实施方式中,训练操作集合还包括图像数据变换操作,则模型训练数据用于被CPU执行图像数据变换操作并得到临时数据,该临时数据用于被模型训练处理器执行模 型训练操作。即,由DPU对图像数据执行图像解码操作后,将得到的模型训练数据输出给CPU;CPU对模型训练数据执行图像数据变换操作,并将得到的临时数据输出给模型训练处理器;模型训练处理器利用该临时数据实现对AI模型的训练。如此,利用DPU以及CPU的算力,可以进一步加速AI模型的训练。
在一种可能的实施方式中,模型训练处理器在训练得到的AI模型后,可以将该AI模型输出给DPU,从而DPU可以向本地或者远端的存储设备发送该AI模型,并且,该AI模型可以通过文件格式或者键值KV格式将AI模型存储于存储设备中。如此,可以实现对AI模型的本地或者云端保存。
在一种可能的实施方式中,DPU具体可以是向其它DPU输出所述模型训练数据,其中,该其它DPU分别与其它CPU、其它模型训练处理器通过系统总线耦合,或者该其它DPU与其它模型训练处理器是一个训练卡上的不同芯片。这样,其它DPU在接收到该模型训练数据后,可以向其它模型训练处理器输出该模型训练数据,该模型训练数据用于被其它模型训练处理器执行训练操作集合中的操作。如此,不仅可以实现利用DPU加速AI模型的训练,而且DPU所处理得到的模型训练数据可以共享给其它DPU,以便其它DPU能够基于共享得到的模型训练数据加速其它模型训练处理器的模型训练。
第二方面,本申请实施例还提供了一种数据处理方法,该数据处理方法由DPU执行,该DPU分别与CPU、多个模型训练处理器通过系统总线耦合,或者DPU与多个模型训练处理器是一个训练卡上的不同芯片;则,在进行数据处理时,DPU获取图像数据,该图像数据包括多个编码图像;然后,DPU对图像数据执行图像处理操作集合中的操作,得到模型训练数据,其中,处理图像数据的操作包括图像处理操作集合中的操作以及训练操作集合中的操作,该图像处理操作集合至少包括图像解码操作,该训练操作集合至少包括模型训练操作,从而DPU将模型训练数据写入供多个模型训练处理器访问的共享缓存,该共享缓存中的模型训练数据用于供多个模型训练处理器执行训练操作集合中的操作,或者,该共享缓存中的模型训练数据用于供CPU以及多个模型训练处理器执行所述训练操作集合中的操作。
由于DPU对图像数据进行图像解码(以及执行其它操作)的速度,会比模型训练处理器进行图像解码(以及执行其它操作)的速度更快,从而利用DPU可以提高处理模型训练业务的效率、实现加速AI模型训练。并且,利用DPU对图像数据执行图像解码等操作,可以减少实现训练AI模型的过程中所产生的临时数据对多个模型训练处理器有限的内存空间的占用,这使得多个模型训练处理器可以具有充足的内存空间训练AI模型,从而能够为多个模型训练处理器提高训练AI模型的效率。
第三方面,本申请实施例还提供了一种数据处理方法,该数据处理方法由目标DPU执行,该目标DPU分别与CPU、模型训练处理器通过系统总线耦合,或者目标DPU与模型训练处理器是一个训练卡上的不同芯片;则,在进行数据处理时,目标DPU获取图像数据,图像数据包括多个编码图像;然后,目标DPU对图像数据执行图像处理操作集合中的操作,得到模型训练数据,其中,处理图像数据的操作包括图像处理操作集合中的操作以及训练操作集合中的操作,该图像处理操作集合至少包括图像解码操作,该训练操作集合至少包括模型训练操作,从而目标DPU将模型训练数据写入基于多个DPU中的缓存所构建的共享缓存池,该多个DPU包括所述目标DPU,该共享缓存池中的模型训练数据用于供模型训练处理器执行训练操作集合中的操作,或者,该共享缓存池中的模型训练数据用于供CPU以及模型训练处理器执行该训练操作集合中的操作。
由于利用一个或者多个DPU对图像数据进行图像解码(以及执行其它操作)的速度,会比模型训练处理器进行图像解码(以及执行其它操作)的速度更快,从而利用一个或者多个DPU可以提高处理模型训练业务的效率、实现加速AI模型训练。并且,利用一个或者多个DPU对图像数据执行图像解码等操作,可以减少实现训练AI模型的过程中所产生的临时数据对模型训练处理器有限的内存空间的占用,这使得模型训练处理器可以具有充足的内存空间训练AI模型,从而能够为模型训练处理器提高训练AI模型的效率。
第四方面,本申请实施例提供了一种第一数据处理单元DPU,所述第一DPU分别与第一CPU、第一模型训练处理器通过系统总线耦合,或者所述第一DPU与所述模型训练处理器是一个训练卡上的不同芯片,所述DPU包括:通信接口,用于获取图像数据,所述图像数据包括多个编码图像;处理芯片,用于对所述图像数据执行图像处理操作集合中的操作,得到模型训练数据,其中,处理所述图像数据的操作包括图像处理操作集合中的操作以及训练操作集合中的操作,所述图像处理操作集合至少包括图像解码操作,所述训练操作集合至少包括模型训练操作;输出接口电路,用于输出所述模型训练数据,所述模型训练数据用于供所述第一模型训练处理器执行所述训练操作集合中的操作,或者,所述模型训练数据用于供所述第一CPU以及所述第一模型训练处理器执行所述训练操作集合中的操作。
在一种可能的实施方式中,所述图像处理操作集合还包括图像数据变换操作,所述处理芯片,用于:对所述图像数据执行所述图像解码操作,得到矩阵数据;对所述矩阵数据执行所述图像数据变换操作,得到所述模型训练数据。
在一种可能的实施方式中,所述训练操作集合还包括图像数据变换操作,所述模型训练数据用于被所述第一CPU执行所述图像数据变换操作并得到临时数据,所述临时数据用于被所述第一模型训练处理器执行所述模型训练操作。
在一种可能的实施方式中,所述通信接口,用于:获取所述第一模型训练处理器输出的人工智能AI模型;向本地或者远端存储设备发送所述AI模型,所述AI模型通过文件格式或者键值KV格式把所述AI模型存储于所述存储设备中。
在一种可能的实施方式中,所述输出接口电路,还用于向第二DPU输出所述模型训练数据;所述第二DPU分别与第二CPU、第二模型训练处理器通过系统总线耦合,或者所述第二DPU与所述第二模型训练处理器是一个训练卡上的不同芯片,所述第二DPU用于:接收所述模型训练数据;向所述第二模型训练处理器输出所述模型训练数据,所述模型训练数据用于被所述第二模型训练处理器执行所述训练操作集合中的操作。
由于第四方面提供的DPU,对应于第一方面提供的数据处理方法,因此,第四方面以及第四方面中各实施方式所具有技术效果,可以参见相应的第一方面以及第一方面中各实施方式所具有的技术效果,在此不做赘述。
第五方面,本申请实施例提供一种数据处理单元DPU,所述DPU分别与中央处理器CPU、多个模型训练处理器通过系统总线耦合,或者所述DPU与所述多个模型训练处理器是一个训练卡上的不同芯片,所述数据处理装置包括:通信接口,用于获取图像数据,所述图像数据包括多个编码图像;处理芯片,用于对所述图像数据执行图像处理操作集合中的操作,得到模型训练数据,其中,处理所述图像数据的操作包括图像处理操作集合中的操作以及训练操作集合中的操作,所述图像处理操作集合至少包括图像解码操作,所述训练操作集合至少包括模型训练操作;数据读写接口,用于将所述模型训练数据写入供所述多个模型训练处理器访问的共享缓存,所述共享缓存中的模型训练数据用于供所述多个模型训练处理器执行所述训练操作集合中的操作,或者,所述共享缓存中的模型训练数据用于供所述CPU以及所述多 个模型训练处理器执行所述训练操作集合中的操作。
由于第五方面提供的DPU,对应于第二方面提供的数据处理方法,因此,第五方面的实施方式所具有技术效果,可以参见相应的第二方面的实施方式所具有的技术效果,在此不做赘述。
第六方面,本申请实施例提供一种目标数据处理单元DPU,所述目标DPU分别与中央处理器CPU、模型训练处理器通过系统总线耦合,或者所述目标DPU与所述模型训练处理器是一个训练卡上的不同芯片,所述目标DPU包括:通信接口,用于获取图像数据,所述图像数据包括多个编码图像;处理芯片,用于对所述图像数据执行图像处理操作集合中的操作,得到模型训练数据,其中,处理所述图像数据的操作包括图像处理操作集合中的操作以及训练操作集合中的操作,所述图像处理操作集合至少包括图像解码操作,所述训练操作集合至少包括模型训练操作;数据读写接口,用于将所述模型训练数据写入基于多个DPU中的缓存所构建的共享缓存池,所述多个DPU包括所述目标DPU,所述共享缓存池中的模型训练数据用于供所述模型训练处理器执行所述训练操作集合中的操作,或者,所述共享缓存池中的模型训练数据用于供所述CPU以及所述模型训练处理器执行所述训练操作集合中的操作。
由于第六方面提供的目标DPU,对应于第三方面提供的数据处理方法,因此,第六方面的实施方式所具有技术效果,可以参见相应的第三方面的实施方式所具有的技术效果,在此不做赘述。
第七方面,本申请实施例提供了一种DPU,该DPU用于执行上述第一方面至第三方面的任一实现方式中DPU所执行的数据处理方法。
第八方面,本申请实施例提供一种数据处理系统,该数据处理系统包括上述第一方面至第三方面的任一实现方式所述的DPU、CPU以及模型训练处理器。
第九方面,本申请实施例提供一种芯片系统,所述芯片系统包括供电电路以及处理电路,所述供电电路用于对所述处理电路进行供电,所述处理电路用于执行上述第一方面至第三方面的任一实现方式中DPU所执行的数据处理方法。
其中,供电电路可以与处理电路位于同一个芯片内,或者,供电电路可以位于处理电路所在芯片之外的另一个芯片内。供电电路包括但不限于如下至少一个:供电子系统、电管管理芯片、功耗管理处理器或功耗管理控制电路。
第十方面,本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有程序或指令,当其在计算机上运行时,使得上述第一方面至第三方面中的任一实现方式中所述的数据处理方法被执行。
第十一方面,本申请实施例还提供一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面至第三方面中的任一实现方式中所述的数据处理方法。
另外,第四方面至十一方面中任一种实现方式所带来的技术效果可参见第一方面至第三方面中不同实现方式所带来的技术效果,此处不再赘述。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请中记载的一些实施例,对于本领域普通技术人员来讲,还可以根据这些附图获得其他的附图。
图1a为本申请实施例提供的一示例性数据处理系统的结构示意图;
图1b为本申请实施例提供的另一示例性数据处理系统的结构示意图;
图2为本申请实施例提供的一种数据处理方法的流程示意图;
图3为本申请实施例提供的另一种数据处理方法的流程示意图;
图4为本申请实施例提供的另一种数据处理系统的结构示意图;
图5为本申请实施例提供的又一种数据处理系统的结构示意图;
图6为本申请实施例提供的又一种数据处理方法的流程示意图;
图7为本申请实施例提供的又一种数据处理系统的结构示意图;
图8为本申请实施例提供的再一种数据处理系统的结构示意图;
图9为本申请实施例提供的一种DPU的结构示意图;
图10为本申请实施例提供的另一种DPU的结构示意图;
图11为本申请实施例提供的一种DPU的硬件结构示意图;
图12为本申请实施例提供的一种芯片系统的结构示意图。
具体实施方式
参见图1a,为一示例性数据处理系统的结构示意图。如图1a所示,数据处理系统100可以包括数据处理单元(data processing unit,DPU)101、CPU102、以及模型训练处理器103。其中,DPU101、CPU102、模型训练处理器103之间可以通过系统总线进行耦合,该系统总线用于连接计算机内部的组件,例如可以是外围组件快速互连(peripheral component interconnect express,PCIE)总线、计算快速链接(compute express link,CXL)总线、快速非易失性存储器(non-volatile memory express,NVMe)总线中的任意一种或多种,或者可以是其它可能的总线,本申请对此并不进行限定。其中,DPU是一种板卡,可以通过PCIe接口或者其他接口的插槽插在主板(如服务器主板等)上。DPU可以和模型训练处理器(例如GPU)集成在同一个板卡上,或者分别是独立的板卡。
所述DPU拥有网络接口,例如100GB以太网络接口、infiniband网络接口。所述DPU还可以拥有访问计算机节点本地存储介质的功能,例如通过PCIe协议访问固态驱动器(SSD)。
其中,DPU101、CPU102可以与模型训练处理器103部署于同一计算节点,如图1a所示的计算节点1。示例性地,计算节点1,例如可以是终端或者服务器,又或者可以是其它具有计算能力的设备等。在其它实施方式中,DPU101、CPU102与模型训练处理器103也可以部署于不同计算节点,如DPU101部署于一个计算节点,CPU102与模型训练处理器103部署于另一个计算节点等。
作为一些示例,模型训练处理器103,例如可以是GPU、神经网络处理器(neural-network processing unit,NPU)、张量处理器(tensor processing unit,TPU)中的任意一种或者多种,或者可以是其它类型的处理器。
进一步地,数据处理系统100还可以包括存储设备104,并且,计算节点1通过DPU101与存储设备104进行通信,如图1a所示。具体地,计算节点1通过DPU101对存储设备104进行访问,如通过DPU101读取存储设备104中存储的用于训练模型的图像数据,或者将完成训练的模型通过DPU101发送至存储设备104中进行存储等。
其中,DPU101可以与存储设备104连接,如通过接口或者总线建立连接等。示例性地,存储设备104与DPU101之间的通信协议包括:小型计算机系统接口(small computer system interface,SCSI)协议、串行连接小型计算机系统接口(serial attached small computer system interface,SAS)协议、PCIe协议、通用串行总线(universal serial bus,USB)协议(如USB 3.0、USB 2.0等)、NVMe协议中的一种或多种,或者可以是其它可适用的通信 协议。
或者,DPU101可以包括网络接口,从而DPU101可以通过该网络接口,基于有线网络或者无线网络与存储设备104进行通信。作为一些示例,当DPU101与存储设备104基于有线网络进行通信时,该有线网络例如可以是以太网或者无限带宽网(infiniband)等。
其中,存储设备104例如可以是硬盘驱动器(hard disk drive,HDD)、闪存介质驱动器、叠瓦式磁记录(shingled magnetic recording,SMR)、存储阵列、存储服务器中的一种或多种。或者,按照存储设备的类型划分,存储设备104可以是非易失性存储器(non-volatile memory,NVM),如只读内存(read-only memory,ROM)、闪存(flash memory)、存储级存储器(storage class memory,SCM)等;或者,存储设备104可以是易失性存储器(volatile memory),如随机存取存储器(random-access memory,RAM)等。需要说明的是,图1a中所示的存储设备104,可以是能够提供存储服务的一个设备或者多个设备的集合,为便于描述,在此以存储设备104进行统一指代。
在图1a所示的数据处理系统100中,CPU102可以指示模型训练处理器103对该计算节点1上的一个或者多个AI模型进行训练,如指示模型训练处理器103训练目标识别模型、人脸检测模型等。
以处理计算节点1上的一个模型训练任务为例,如果仅利用模型训练处理器103所具有的单一算力执行处理该模型训练业务所需的所有步骤,由于模型训练处理器103的算力有限,这使得模型训练处理器103完成AI模型的训练所需耗时可能过长,从而导致AI模型的训练效率较低。举例来说,假设模型训练处理器103具体为GPU,则,GPU可以访问存储设备104中存储的作为训练样本的图像数据,并将访问到的图像数据存储至本地内存,然后在本地内存对该图像数据执行解码、图像数据变换(如数据增强等)、以及训练AI模型等操作,并在本地内存保存执行这些操作所产生的临时数据。由于GPU的内存空间有限,难以存储GPU处理所有图像数据所产生的临时数据,这不仅会导致迭代训练AI模型的批数据量(batch size)较小,而且,在多轮迭代训练AI模型的过程中,GPU需要多次重复读取和解码存储设备104中的同一训练数据集,从而会增加GPU迭代训练AI模型的耗时,导致GPU训练AI模型的效率较低。
基于此,本申请实施例提供了一种数据处理方法,利用DPU101协同模型训练处理器103处理业务,以此提高处理计算节点1上的AI模型的训练效率。具体实现时,在处理计算节点1上的模型训练业务时,DPU101可以获取存储设备104中存储的用于训练AI模型所需的图像数据(即AI模型的训练样本),该图像数据包括多个编码图像,其中,处理图像数据的操作包括多种操作,该多种操作可以被划入由DPU执行的图像处理操作集合、以及由模型训练处理器103(或者模型训练处理器103+CPU102)执行的训练操作集合,并且,图像处理操作集合中至少包括图像解码操作,而训练操作集合中至少包括模型训练操作。然后,DPU101可以对获取的图像数据执行图像处理操作集合中的操作,得到模型训练数据,并输出该模型训练数据。这样,模型训练处理器103可以继续对该模型训练数据执行训练操作集合中的操作,实现对AI模型的训练。或者,CPU102与模型训练处理器103针对该模型训练数据依次执行训练操作集合中的不同操作,实现对AI模型的训练。
由于通常情况下模型训练处理器103主要擅长执行训练操作集合中的模型训练操作,对图像解码等操作相对而言并不擅长。因此,利用DPU101对图像数据进行图像解码(以及执行其它操作)的速度,会比利用模型训练处理器103进行图像解码(以及执行其它操作)的速度更快,从而利用DPU101可以提高处理模型训练业务的效率、实现加速计算节点1上的AI 模型训练。并且,利用DPU101对图像数据执行图像解码等操作,可以减少实现训练AI模型的过程中所产生的临时数据对模型训练处理器103有限的内存空间的占用,这使得模型训练处理器103可以具有充足的内存空间训练AI模型,从而也能提高训练AI模型的效率,进而提高处理计算节点1上的模型训练业务的整体效率。
仍以CPU102利用GPU处理模型训练业务为例,DPU101可以从存储设备104中读取训练AI模型所需的图像数据,并在DPU101的内存中依次对该图像数据执行解码、尺寸调整操作(resize)、数据增强操作等(这些操作即为上述图像处理操作集合中的操作),得到模型训练数据;然后,DPU101可以将模型训练数据输出给GPU。GPU可以在本地内存存储该模型训练数据,并利用该模型训练数据训练AI模型。如此,由于在处理AI模型训练业务的过程中,对图像数据执行解码操作、尺寸调整操作、数据增强操作所分别生成的临时数据所占用的内存空间为DPU的内存空间,因此,即使GPU的本地内存的空间有限,GPU也能具有足够的内存空间存储DPU101输出的模型训练数据(也即经过数据增强操作所得到的数据),从而GPU可以采用更大的batch size对AI模型进行迭代训练,实现AI模型的训练加速。并且,在对AI模型进行多次迭代训练的过程中,GPU可以直接从DPU101的内存中读取模型训练数据,从而无需多次重复执行读取存储设备104中作为训练样本的图像数据、对图像数据进行解码、尺寸调整、数据增强的操作,这不仅可以降低资源消耗,还能进一步加快该AI模型训练业务的处理效率。另外,在GPU利用一个batch size的训练样本训练AI模型时,DPU101可以并行执行对其它图像数据的解码、尺寸调整、数据增强等操作,以便于GPU在完成一次AI模型过程后,能够及时获得下一batch size的训练样本继续训练AI模型,从而可以进一步加快AI模型的训练效率,摆脱GPU训练AI模型的输入输出约束(IO bound)和算力约束。
值得注意的是,图1a所示的系统架构仅作为一种示例,并不用于限定其具体实现局限于该示例。比如,在图1b所示的数据处理系统100中,DPU101与模型训练处理器103是一个训练卡200上的不同芯片,并且,CPU与该训练卡200能够在训练卡200上进行数据通信,而在训练卡200上,DPU101与模型训练处理器103可以通过片上总线进行连接。又比如,图1中的计算节点1也可以包括更多数量或者更多类型的模型训练处理器。再比如,DPU101可以同时与多个计算节点进行连接,从而可以为多个计算节点实现加速模型训练业务处理,本实施例对此并不进行限定。
并且,包括上述DPU101、CPU102、模型训练处理器103、以及存储设备104的数据处理系统100,可以适用于集中式存储的应用场景或者分布式存储的应用场景。
其中,在集中式存储应用场景中,可以由一台或多台计算节点组成中心节点,并且整个数据处理系统100的所有数据处理业务都集中部署在这个中心节点上。此时,计算节点1与实现存储设备104之间可以采用盘控分离架构,即计算节点1与存储设备104独立部署;或者,计算节点1与存储设备104之间可以采用盘控一体架构,即计算节点1可以具有槽位,并通过该槽位将存储设备104放置在该计算节点1中,与该计算节点1集成部署。
在分布式存储应用场景中,数据处理系统100中的数据可以分散存储在多个独立的存储设备104上。此时,计算节点1可以与存储设备104集成部署,使得该计算节点同时具有计算能力以及存储能力,并且在该计算节点1上可以创建虚拟机,或者也可以不创建虚拟机。或者,计算节点1与存储设备104之间可以采用存算分离架构,即计算节点1与存储设备104独立部署并通过网络进行通信。另外,存储设备104中可以包括一种或者多种不同的存储介质,本实施例对此并不进行限定。
为使本申请的上述目的、特征和优点能够更加明显易懂,下面将结合附图对本申请实施例中的各种非限定性实施方式进行示例性说明。显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,基于上述内容所获得的所有其它实施例,都属于本申请保护的范围。
如图2所示,为本申请实施例中一种数据处理方法的流程示意图,该方法可以应用于如图1a或者图1b所示的数据处理系统100中。实际应用时,该方法也可以应用于其它可适用的数据处理系统中。为便于理解与描述,下面以应用于图1a所示的数据处理系统100为例进行示例性说明,该方法具体可以包括:
S201:DPU101访问存储设备104,获取用于训练AI模型的图像数据,该图像数据包括多个编码图像。
本实施例中,存储设备104中可以存储作为AI模型训练样本的图像数据,该图像数据用于训练计算节点1上的一个或者多个AI模型,如目标识别模型、人脸检测模型等,从而处理模型训练任务时,DPU101可以通过访问存储设备104,获取其上存储的用于训练该AI模型的图像数据。
作为一种实现示例,DPU101可以在CPU102的控制下,对存储设备104进行访问。具体实现时,CPU102可以对外提供客户端,该客户端例如可以是网络浏览器,或者可以是运行在用户侧的应用程序(application),用于实现与用户之间的交互,从而CPU102可以通过该客户端接收到用户针对训练AI模型的指示信息。然后,CPU102可以根据该指示信息,确定用于训练该AI模型的图像数据在存储设备104中的存储位置,并生成包括该存储位置的训练指令,从而CPU102可以将该训练指令下发给DPU101。如此,DPU101响应该训练指令,根据该训练指令中的存储位置访问得到存储设备104中的图像数据。实际应用时,也可以通过其它方式触发DPU101访问存储设备104,本实施例对此并不进行限定。
S202:DPU101对访问得到图像数据执行图像处理操作集合中的操作,得到模型训练数据,其中,处理图像数据的操作包括图像处理操作集合中的操作以及训练操作集合中的操作,并且,图像处理操作集合至少包括图像解码操作,训练操作集合至少包括模型训练操作。
S203:DPU101向模型训练处理器103输出模型训练数据。
通常情况下,DPU101所获取的图像数据为多个经过编码(encode)处理的图像(通常称之为编码图像),因此,在利用图像数据训练AI模型的过程中,针对图像数据所执行的操作可以包括图像解码操作以及模型训练操作等。其中,图像解码操作用于对该多个编码图像进行解码,通常可以得到矩阵形式的数据,在部分应用场景中,模型训练处理器103可以直接将矩阵形式的数据作为AI模型的输入,训练AI模型(即执行模型训练操作)。
进一步地,除了图像解码操作以及模型训练操作之外,针对图像数据所执行的操作还可以图像数据变换操作,该图像数据变换操作例如可以是中心裁剪(center crop)操作、尺寸调整操作、数据增强操作、归一化(normalize)操作中的一种或者多种,或者还可以其它类型的操作等。其中,中心裁剪操作,是指将图像数据(或者对图像数据进行解码后所得到的矩阵数据)裁剪为预设尺寸的图像数据,以满足模型训练业务中对于图像数据的尺寸需求。尺寸调整操作,是指对图像数据(或者对图像数据进行解码后所得到的矩阵数据)进行缩放,以调整图像数据的尺寸大小。数据增强操作,是指对图像数据(或者对图像数据进行解码后所得到的矩阵数据)进行尺寸变化、或者像素值变化、或者视角变化等(如图像翻转、旋转、平移、缩放等),可以丰富AI模型的训练样本。归一化操作,是指对图像数据(或者对图像数据进行解码后所得到的矩阵数据)中的每个颜色通道的数据进行归一化,如将每个颜色通 道上的各个像素的像素值减去该颜色通道的平均像素值,并除以该颜色通道的像素值的方差等。
本实施例中,可以预先将该上述针对图像数据所执行的多个操作划分为由DPU101执行的操作,以及由其它处理器(如模型训练处理器103)执行的操作。为便于区分,以下将该多个操作中由DPU101执行的操作划入图像处理操作集合中,由其它处理器执行的操作划入训练操作集合中。如此,DPU101在获得图像数据后,可以对其执行图像处理操作集合中的操作,并生成相应的数据,为便于区分,以下称之为模型训练数据。
作为一种示例,针对图像数据所执行的多个操作包括图像解码操作以及模型训练操作,则,DPU101可以对图像数据执行图像解码操作后,所得到的矩阵数据即为模型训练数据。然后,模型训练处理器103可以对该模型训练数据执行模型训练操作,具体可以是将该模型训练数据作为AI模型的输入,训练AI模型。
作为又一种示例,针对图像数据所执行的多个操作,除了包括图像解码操作以及模型训练操作之外,还包括图像数据变换操作,如上述中心裁剪操作、尺寸调整操作、数据增强操作、归一化操作中的一种或者多种。
此时,DPU101所需执行的操作数量,可以根据DPU101的算力进行确定。举例来说,假设针对图像数据所执行的多个操作包括图像解码操作、尺寸调整操作、数据增强操作以及模型训练操作。当DPU101的计算能力较高、内存资源较多时,表征DPU101的算力较强,此时,可以将图像解码操作、尺寸调整操作、数据增强操作划入由DPU101执行的图像处理操作集合中,将模型训练操作划入由模型训练处理器103执行的训练操作集合中,以此可以减少模型训练处理器103的计算负担。相应地,图像处理操作集合中包括多种类型的操作,DPU101基于图像数据依次执行图像解码操作、尺寸调整操作、数据增强操作后,可以得到输出给模型训练处理器103的模型训练数据。而当DPU101的计算能力较低、或者内存资源较少时,表征DPU101的算力较差,此时,可以将图像解码操作、尺寸调整操作划入图像处理操作集合中,将数据增强操作、模型训练操作划入训练操作集合中,基于DPU101所具有的算力,尽可能减少模型训练处理器103的计算负担。
或者,DPU101所需执行的操作数量,也可以根据模型训练处理器103的当前负载大小进行确定(该模型训练处理器103例如可以是同时训练多个AI模型等)。比如,当模型训练处理器103的负载较大时,DPU101可以执行图像解码操作、尺寸调整操作、数据增强操作,模型训练处理器103执行模型训练操作,以减少模型训练处理器103在训练AI模型的过程中所需的计算量,以此可以避免该模型训练处理器103的负载过大。而当模型训练处理器103的负载较小时,DPU101可以仅执行图像解码操作,或者仅执行图像解码操作以及尺寸调整操作,由模型训练处理器103执行其余操作,这可以使得模型训练处理器103的资源利用率能够达到较高的水平,避免模型训练处理器103上过多的资源处于闲置状态而产生资源浪费。
值得注意的是,上述实现示例仅作为一些示例性说明,在其它实施例中,也可以基于其它依据将针对图像数据所需执行的多个操作划入图像处理操作集合以及训练操作集合,本实施例对此并不进行限定。
S204:模型训练处理器103对模型训练数据执行训练操作集合中的操作。
本实施例中,由DPU101以及模型训练处理器103协同处理模型训练业务,因此,在DPU101输出模型训练数据后,模型训练处理器103可以继续对该模型训练数据进行处理,以此完成对模型训练业务的处理。
具体地,当训练操作集合中仅包括模型训练操作时,模型训练处理器103可以直接根据 DPU101输出的模型训练数据进行模型训练,具体可以是将该模型训练数据输入至AI模型中,并根据该AI模型输出的推理结果对AI模型中的参数进行更新,以此完成对AI模型的一次训练过程。而当训练操作集合中包括上述图像数据变换操作以及模型训练操作时,模型训练处理器103可以先对模型训练数据执行该图像数据变换操作,得到临时数据;然后,模型训练处理器103对根据该临时数据执行模型训练操作,以实现对AI模型的训练。
在进一步的实施方式中,模型训练处理器103在根据模型训练数据完成对AI模型的训练后,若训练后的AI模型满足模型训练终止条件,如迭代训练次数达到预设次数或者AI模型收敛等,则模型训练处理器103还可以向CPU102返回完成训练的AI模型,以便CPU102向上层应用或者与该CPU102交互的客户端反馈该AI模型。或者,模型训练处理器103可以向DPU101返回完成训练的AI模型,以便DPU101将该AI模型发送至存储设备104中进行存储等,本实施例对此并不进行限定。
实际应用场景中,当模型训练处理器103向CPU102反馈AI模型时,CPU102可以将该AI模型写入在本地的存储区域,如保存在计算节点1上的硬盘等。当模型训练处理器103向DPU101反馈AI模型时,若DPU101与存储设备104通过有线网络或者无线网络进行数据通信,则DPU101可以根据该AI模型生成相应的文件,并通过DPU101上的网络接口向远端的存储设备104发送该文件,从而AI模型可以基于文件格式存储于存储设备104中;或者,DPU101将AI模型发送至远端的存储设备104后,存储设备104可以通过键值对(key-value)的形式保存该AI模型,其中,key为存储设备104所创建的键,value为AI模型。若DPU101与存储设备104通过PCIe总线等方式建立连接,则DPU101可以基于文件格式或者KV格式将AI模型发送至本地的存储设备104。
由于DPU101执行了图像解码等处理图像数据所需执行的部分操作,从而可以减少模型训练处理器103所需执行的操作,这不仅可以利用DPU101的硬件实现加速对模型训练业务数据的处理,而且,DPU101在执行图像处理操作集合中的操作所生成的模型训练数据,占用的是DPU101的内存空间,这使得即使模型训练处理器103的本地内存的空间有限,模型训练处理器103也能具有足够的内存空间根据模型训练数据训练AI模型,从而可以避免模型训练处理器103训练AI模型的效率受到模型训练处理器103中有限的内存空间的影响,实现加速AI模型效率。另外,DPU101所生成的模型训练数据,可以在DPU101的内存空间进行存储,这使得当模型训练处理器103再次需要该模型训练数据时(如重复利用同一数据集对AI模型进行迭代训练等),可以直接从DPU101的内存空间读取,而无需DPU101重新从存储设备104中读取图像数据并重新执行图像解码等操作,从而可以进一步加速AI模型的训练。
上述实施例中,是以DPU101与模型训练处理器103共同处理模型训练业务为例进行示例性说明,实际应用时,当处理模型训练业务所需的算力要求较高时,也可以结合DPU101、CPU102以及模型训练处理器103的综合算力来加速AI模型的训练。下面,结合附图,对利用DPU101、CPU102以及模型训练处理器103处理模型训练业务的过程进行介绍。
参见图3,图3示出了本申请实施例的另一种数据处理方法的流程示意图。如图3所示,该方法具体可以包括:
S301:DPU101访问存储设备104,获取用于训练AI模型的图像数据,该图像数据包括多个编码图像。
S302:DPU101对访问得到模型训练业务的数据执行图像处理操作集合中的操作,得到模型训练数据,其中,处理图像数据的操作包括图像处理操作集合中的操作以及训练操作集合 中的操作,并且,图像处理操作集合至少包括图像解码操作,训练操作集合至少包括图像数据变换操作以及模型训练操作。
本实施例中,步骤S301至步骤S302的具体实现方式,可参见前述实施例中的步骤S201以及步骤S202的相关之处描述,在此不做赘述。
S303:DPU101向CPU102输出模型训练数据。
本实施例中,DPU101在输出模型训练数据后,可以交由CPU102继续对模型训练数据进行处理。
S304:CPU102对模型训练数据执行训练操作集合中的图像数据变换操作,得到临时数据。
S305:CPU102向模型训练处理器103输出该临时数据。
其中,图像数据变换操作,例如可以是中心裁剪操作、尺寸调整操作、数据增强操作、归一化操作中的一种或者多种。并且,当图像数据变换操作包括多种操作时,CPU102可以基于模型训练数据依次执行该多种操作,得到用于输出给模型训练处理器103的临时数据。
S306:模型训练处理器103对临时数据执行训练操作集合中的模型训练操作。
进一步地,模型训练处理器103在利用临时数据完成对AI模型的训练后,若该AI模型满足模型训练终止条件,如迭代训练次数达到预设次数或者AI模型收敛等,则模型训练处理器103还可以向CPU102返回完成训练的AI模型,以便CPU102向上层应用或者与该CPU102交互的客户端反馈该AI模型。或者,模型训练处理器103可以向DPU101返回完成训练的AI模型,以便DPU101将该AI模型发送至存储设备104中进行存储等,本实施例对此并不进行限定。
实际应用时,可以根据针对不同AI模型的训练任务,确定是否综合CPU102的算力来加速训练AI模型。
举例来说,当AI模型的尺寸较小时,DPU101以及模型训练处理器103的算力足以满足训练该AI模型所需的算力,从而可以仅利用DPU101以及模型训练处理器103训练该AI模型,并且该AI模型的训练效率可以达到较高的水平。
而当AI模型的尺寸较小时,训练该AI模型所需要求的算力较高,此时,可以综合DPU101、CPU102以及模型训练处理器103的算力,实现对AI模型的加速训练,以此使得AI模型的训练效率能够达到较高的水平。
上述图2以及图3所示实施例中,主要介绍了利用DPU101(以及CPU102)来协助模型训练处理器103加速AI模型训练的实现过程,实际应用时,当DPU101的算力较高时,DPU101可以同时为多个模型训练处理器103提供加速AI模型训练的服务。其中,DPU101、CPU102、以及该多个模型训练处理器103,可以部署于同一计算节点,如图4所示,此时,DPU101可以与该计算节点中的多个模型训练处理器103通过系统总线进行耦合,并为该多个模型训练处理器103提供模型训练数据。或者,DPU101、CPU102、以及多个模型训练处理器103可以部署于不同的计算节点,如图5所示,DPU101、CPU102、以及至少一个模型训练处理器103部署于计算节点1中,其余模型训练处理器103可以分别部署于计算节点2以及计算节点3,并且,计算节点2以及计算节点3中分别还包括CPU以及其它硬件(图5中未示出)。此时,DPU101可以通过PCIe总线、CXL总线、NVMe总线中的至少一种总线(或者其它总线)与多个计算节点中的模型训练处理器103进行耦合。并且,该多个模型训练处理器103可以是相同类型的处理器,如均为GPU等;或者,该多个模型训练处理器103可以包括多个不同类型的处理器,如包括GPU、NPU、TPU等。
下面,结合图5所述的系统架构,对本申请实施例提供的又一种数据处理方法进行示例性说明。参见图6,该方法具体可以包括:
S601:DPU101接收到针对AI模型的训练指令。
S602:DPU101根据该训练指令,从存储设备104中获取作为模型训练样本的图像数据,该图像数据包括多个编码图像。
在图5所示的系统架构中,可以是由任意计算节点上的CPU向DPU101下发训练指令,或者,图5所示的多个计算节点中可以存在代理节点,该代理节点可以负责与系统外部进行交互,如可以接收用户通过客户端下发的训练AI模型的指示信息,以及向该客户端呈现训练完成的AI模型等,从而由代理节点中的CPU向DPU101下发训练指令等。
本实施例中,多个计算节点中的模型训练处理器103可以实现对同一AI模型的分布式训练,或者不同模型训练处理器103负责训练不同的AI模型。相应地,DPU101所接收到的训练指令,可以指示利用多个计算节点上的异构处理器,以便DPU101向该训练指令指示的多个异构处理器提供经过图像数据变换操作的图像数据。另外,该训练指令还可以指示用于作为AI模型训练样本的图像数据在存储设备104上的存储位置,以便DPU101基于该训练指令中的存储位置,访问得到图像数据。
S603:DPU101对获取的图像数据执行图像解码操作,得到矩阵数据。
S604:DPU101对矩阵数据执行尺寸调整操作以及数据增强操作等图像数据变换操作,得到模型训练数据(也即经过图像数据变换操作后的矩阵数据)。
本实施例中,步骤S603至步骤S604的具体实现过程,可参见前述实施例的相关之处描述,在此不做赘述。
S605:DPU101将模型训练数据存储至DPU101中的共享缓存。
本实施例中,DPU101中可以配置有共享缓存,该共享缓存可以被多个计算节点所访问。这样,DPU101在将经过图像数据变换操作后的矩阵数据写入共享缓存后,多个计算节点中的模型训练处理器103可以从该共享缓存中获取到作为模型输入的矩阵数据。
或者,在其它可能的实施例中,DPU101也可以是通过逐个向计算节点发送经过图像数据变换操作后的矩阵数据的方式,使得各个计算节点中的模型训练处理器103获取经过图像数据变换操作后的矩阵数据。本实施例中,对于各个计算节点的模型训练处理器103获取经过图像数据变换操作后的矩阵数据的具体实现方式,并不进行限定。
S606:多个计算节点中的模型训练处理器103基于共享缓存中的矩阵数据执行AI模型的训练操作,完成AI模型的训练业务。
如此,利用DPU101对图像数据执行图像解码操作、图像数据变换操作,不仅可以减少执行该操作所需的耗时,而且,执行这些操作所生成的临时数据所占用的内存空间为DPU101的内存空间,这使得即使各个计算节点中的模型训练处理器103的本地内存的空间有限,模型训练处理器103也能具有足够的内存空间存储DPU101提供的矩阵数据以及利用该矩阵数据对AI模型进行训练,从而模型训练处理器103可以采用更大的batch size对AI模型进行迭代训练,实现AI模型的训练加速。
并且,图像数据执行图像解码操作、图像数据变换操作所生成的模型训练数据,可以存储于DPU101的共享缓存中,这使得各个计算节点中的模型训练处理器103在迭代训练AI模型的过程中,模型训练处理器103在每轮迭代时可以直接从DPU101的共享缓存中读取该模型训练数据,从而无需DPU101多次重复执行读取存储设备104中的图像数据、以及对图像数据进行图像解码和图像数据变换操作的操作,这不仅可以降低资源消耗,还能进一步加快该AI 模型训练业务的处理效率。
另外,当图像数据的数量较大时,DPU101可以分批次读取该图像数据以及执行相应的图像解码操作以及图像数据变换操作,这样,在模型训练处理器103利用当前批次的图像数据对应的矩阵数据训练AI模型时,DPU101可以从存储设备104中继续读取下一批次的图像数据并执行图像解码和图像数据变换操作,以便于模型训练处理器103在完成一次AI模型过程后,能够及时获得下一批次的模型训练数据继续训练AI模型,从而通过将图像解码操作、图像数据变换操作与模型训练操作并行化执行,可以实现进一步加快AI模型的训练效率。
值得注意的是,图6所示的数据处理方法仅作为一种示例,并不用于限定DPU101为一个或者多个模型训练处理器103提供加速训练AI模型的过程进行限定。
比如,在其它可能的实施例中,多个模型训练处理器103可以部署于同一计算节点,从而DPU101中的共享缓存中的矩阵数据可以允许被多个模型训练处理器103所访问,用于实现为多个模型训练处理器103提供加速AI模型训练的服务。
又比如,在图7所示的数据处理系统700中,可以基于多个DPU101构建存储空间更大的共享缓存池,该共享缓存池可以包括多个DPU101中的共享缓存,并且,任意DPU101对获取到的图像数据执行图像解码操作(以及图像数据变换操作)后,可以将得到的模型训练数据写入该共享缓存池中进行存储,并支持多个计算节点中的模型训练处理器103访问该共享缓存池中的模型训练数据,以此基于该多个DPU101进一步扩展模型训练处理器103的算力,实现进一步提高AI模型训练的效率。其中,图7所示的数据处理系统700是以DPU独立于计算节点进行部署为例进行示例性说明,在其它可能的实现方式中,每个计算节点中可以包括CPU、模型训练处理器以及至少一个DPU,从而可以基于多个计算节点中的多个DPU所提供的算力,进一步加速各个模型训练处理器103的AI模型训练过程。并且,基于多个DPU中的缓存,可以实现跨计算节点构建共享缓存池。或者,图7中所示的多个DPU101可以位于同一计算节点,如均位于计算节点1等,从而可以利用计算节点1中的多个DPU101不仅可以计算节点1的AI模型训练过程,也能通过共享缓存池,为计算节点2以及计算节点3提供加速AI模型训练的服务。
再比如,在图8所示的数据处理系统800中,每个DPU101可以负责对至少一个模型训练处理器103提供加速AI模型训练的服务,如DPU101-1用于为计算节点1中的模型训练处理器103提供服务,DPU101-2用于为计算节点2中的模型训练处理器103提供服务,DPU101-3用于为计算节点3中的模型训练处理器103提供服务等,并且不同DPU101之间可以进行数据交互。这样,当DPU101-2在获取到用于训练AI模型的图像数据并对该图像数据执行图像解码操作以及图像数据变换操作后,可以将得到的模型训练数据共享给DPU101-1以及DPU101-3。这样,在DPU101-2该将该模型训练数据输出给计算节点2中的模型训练处理器103进行AI模型训练时,DPU101-1可以将模型训练数据输出给计算节点1中的模型训练处理器103,DPU101-3可以将模型训练数据输出给计算节点3中的模型训练处理器103,以此实现加速训练各个计算节点中的AI模型。
再比如,在其它可能的数据处理系统中,多个DPU以及多个模型训练处理器均可以部署于同一计算节点中,并且,每个DPU可以负责对至少一个模型训练处理器提供加速AI模型训练的服务。以数据处理系统包括DPU1、DPU2、模型训练处理器1以及模型训练处理器2为例。DPU1在为模型训练处理器1提供加速AI模型训练的服务时,可以对获取图像数据进行图像解码(以及图像数据变换)等操作,得到模型训练数据。然后,DPU1不仅可以将该模型训练数据输出给模型训练处理器1,以使得模型训练处理器1基于该模型训练数据训练AI模型, 而且,DPU1还可以将该模型训练数据输出给DPU2,并由DPU2将该模型训练数据提供给模型训练处理器2,以使得模型训练处理器2也能基于该模型训练数据训练其上的AI模型。当然,数据处理系统中还包括CPU等硬件,在此不再赘述。
上述图4、图5、图7以及图8所示的数据处理系统仅作为本申请实施例提供的一些示例性说明,并不用于限定数据处理系统的具体实现。
上文中结合图1至图8,详细描述了本申请所提供的数据处理方法,下面将结合图9至图12,分别描述根据本申请所提供的DPU以及芯片系统。
与上述方法同样的发明构思,本申请实施例还提供一种数据处理装置。参见图9,为本申请实施例提供的一种DPU900的结构示意图,图9所示的DPU900,例如可以是上述各实施例中的DPU101,DPU900分别与第一CPU、第一模型训练处理器通过系统总线耦合,或者DPU900与第一模型训练处理器是一个训练卡上的不同芯片。如图9所示,DPU900包括:
通信接口901,用于获取图像数据,所述图像数据包括多个编码图像;
处理芯片902,用于对所述图像数据执行图像处理操作集合中的操作,得到模型训练数据,其中,处理所述图像数据的操作包括图像处理操作集合中的操作以及训练操作集合中的操作,所述图像处理操作集合至少包括图像解码操作,所述训练操作集合至少包括模型训练操作;
输出接口电路903,用于输出所述模型训练数据,所述模型训练数据用于供所述第一模型训练处理器执行所述训练操作集合中的操作,或者,所述模型训练数据用于供所述第一CPU以及所述第一模型训练处理器执行所述训练操作集合中的操作。
在一种可能的实施方式中,所述图像处理操作集合还包括图像数据变换操作,所述处理芯片902,用于:
对所述图像数据执行所述图像解码操作,得到矩阵数据;
对所述矩阵数据执行所述图像数据变换操作,得到所述模型训练数据。
在一种可能的实施方式中,所述训练操作集合还包括图像数据变换操作,所述模型训练数据用于被所述第一CPU执行所述图像数据变换操作并得到临时数据,所述临时数据用于被所述第一模型训练处理器执行所述模型训练操作。
在一种可能的实施方式中,其特征在于,所述通信接口901,用于:
获取所述第一模型训练处理器输出的人工智能AI模型;
向本地或者远端存储设备发送所述AI模型,所述AI模型通过文件格式或者键值KV格式把所述AI模型存储于所述存储设备中。
在一种可能的实施方式中,所述输出接口电路903,还用于向DPU910输出所述模型训练数据;所述DPU910分别与第二CPU、第二模型训练处理器通过系统总线耦合,或者所述DPU910与所述第二模型训练处理器是一个训练卡上的不同芯片,所述DPU910用于:接收所述模型训练数据;向所述第二模型训练处理器输出所述模型训练数据,所述模型训练数据用于被所述第二模型训练处理器执行所述训练操作集合中的操作。
此外,本申请实施例还提供了另一种DPU,如图10所示。其中,图10所示的DPU1000分别与中央处理器CPU、多个模型训练处理器通过系统总线耦合,或者所述DPU1000与所述多个模型训练处理器是一个训练卡上的不同芯片,所述数据处理装置1000包括:
通信接口1001,用于获取图像数据,所述图像数据包括多个编码图像;
处理芯片1002,用于对所述图像数据执行图像处理操作集合中的操作,得到模型训练数 据,其中,处理所述图像数据的操作包括图像处理操作集合中的操作以及训练操作集合中的操作,所述图像处理操作集合至少包括图像解码操作,所述训练操作集合至少包括模型训练操作;
数据读写接口1003,用于将所述模型训练数据写入供所述多个模型训练处理器访问的共享缓存,所述共享缓存中的模型训练数据用于供所述多个模型训练处理器执行所述训练操作集合中的操作,或者,所述共享缓存中的模型训练数据用于供所述CPU以及所述多个模型训练处理器执行所述训练操作集合中的操作。
在另一个实施例中,图10所示的DPU1000包括:
通信接口1001,用于获取图像数据,所述图像数据包括多个编码图像;
处理芯片1002,用于对所述图像数据执行图像处理操作集合中的操作,得到模型训练数据,其中,处理所述图像数据的操作包括图像处理操作集合中的操作以及训练操作集合中的操作,所述图像处理操作集合至少包括图像解码操作,所述训练操作集合至少包括模型训练操作;
数据读写接口1003,用于将所述模型训练数据写入基于多个DPU中的缓存所构建的共享缓存池,所述多个DPU包括所述DPU1000,所述共享缓存池中的模型训练数据用于供所述模型训练处理器执行所述训练操作集合中的操作,或者,所述共享缓存池中的模型训练数据用于供所述CPU以及所述模型训练处理器执行所述训练操作集合中的操作。
与上述方法同样的发明构思,本申请实施例还提供一种数据处理装置,该数据处理装置应用于DPU,如上述各实施例中的DPU101,该DPU分别与CPU、模型训练处理器通过系统总线耦合,或者DPU与模型训练处理器是一个训练卡上的不同芯片。数据处理装置包括:
通信模块,用于获取图像数据,所述图像数据包括多个编码图像;
处理模块,用于对所述图像数据执行图像处理操作集合中的操作,得到模型训练数据,其中,处理所述图像数据的操作包括图像处理操作集合中的操作以及训练操作集合中的操作,所述图像处理操作集合至少包括图像解码操作,所述训练操作集合至少包括模型训练操作;
输出模块,用于输出所述模型训练数据,所述模型训练数据用于供所述模型训练处理器执行所述训练操作集合中的操作,或者,所述模型训练数据用于供所述CPU以及所述模型训练处理器执行所述训练操作集合中的操作。
在一种可能的实施方式中,所述图像处理操作集合还包括图像数据变换操作,所述处理模块,用于:
对所述图像数据执行所述图像解码操作,得到矩阵数据;
对所述矩阵数据执行所述图像数据变换操作,得到所述模型训练数据。
在一种可能的实施方式中,所述训练操作集合还包括图像数据变换操作,所述模型训练数据用于被所述CPU执行所述图像数据变换操作并得到临时数据,所述临时数据用于被所述模型训练处理器执行所述模型训练操作。
在一种可能的实施方式中,其特征在于,所述通信模块,用于:
获取所述模型训练处理器输出的人工智能AI模型;
向本地或者远端存储设备发送所述AI模型,所述AI模型通过文件格式或者键值KV格式把所述AI模型存储于所述存储设备中。
在一种可能的实施方式中,所述输出模块,还用于向其它DPU输出所述模型训练数据;所述其它DPU分别与其它CPU、其它模型训练处理器通过系统总线耦合,或者所述其它DPU与所述其它模型训练处理器是一个训练卡上的不同芯片,所述其它DPU用于:接收所述模型 训练数据;向所述其它模型训练处理器输出所述模型训练数据,所述模型训练数据用于被所述其它模型训练处理器执行所述训练操作集合中的操作。
此外,本申请实施例还提供了另一种数据处理装置,该数据处理装置应用于数据处理单元DPU,所述DPU分别与中央处理器CPU、多个模型训练处理器通过系统总线耦合,或者所述DPU与所述多个模型训练处理器是一个训练卡上的不同芯片,所述数据处理装置包括:
通信模块,用于获取图像数据,所述图像数据包括多个编码图像;
处理模块,用于对所述图像数据执行图像处理操作集合中的操作,得到模型训练数据,其中,处理所述图像数据的操作包括图像处理操作集合中的操作以及训练操作集合中的操作,所述图像处理操作集合至少包括图像解码操作,所述训练操作集合至少包括模型训练操作;
数据写入模块,用于将所述模型训练数据写入供所述多个模型训练处理器访问的共享缓存,所述共享缓存中的模型训练数据用于供所述多个模型训练处理器执行所述训练操作集合中的操作,或者,所述共享缓存中的模型训练数据用于供所述CPU以及所述多个模型训练处理器执行所述训练操作集合中的操作。
在另一个实施例中,数据处理装置应用于目标数据处理单元DPU,所述DPU分别与中央处理器CPU、模型训练处理器通过系统总线耦合,或者所述目标DPU与所述模型训练处理器是一个训练卡上的不同芯片,所述数据处理装置包括:
通信模块,用于获取图像数据,所述图像数据包括多个编码图像;
处理模块,用于对所述图像数据执行图像处理操作集合中的操作,得到模型训练数据,其中,处理所述图像数据的操作包括图像处理操作集合中的操作以及训练操作集合中的操作,所述图像处理操作集合至少包括图像解码操作,所述训练操作集合至少包括模型训练操作;
数据写入模块,用于将所述模型训练数据写入基于多个DPU中的缓存所构建的共享缓存池,所述多个DPU包括所述目标DPU,所述共享缓存池中的模型训练数据用于供所述模型训练处理器执行所述训练操作集合中的操作,或者,所述共享缓存池中的模型训练数据用于供所述CPU以及所述模型训练处理器执行所述训练操作集合中的操作。
本申请实施例还提供一种DPU,如图11所示,DPU1100中可以包括通信接口1110、处理器1120。可选的,DPU1100中还可以包括存储器1130。其中,存储器1130可以设置于DPU1100内部,还可以设置于DPU1100外部。示例性地,上述图3以及图4所示实施例中DPU101执行的各个动作均可以由处理器1120实现。处理器1120可以通过通信接口1110获取图像数据,并用于实现图2、图3以及图6中所执行的任一方法。在实现过程中,处理流程的各步骤可以通过处理器1120中的硬件的集成逻辑电路或者软件形式的指令完成图2、图3以及图6中执行的方法。为了简洁,在此不再赘述。处理器1120用于实现上述方法所执行的程序代码可以存储在存储器1130中。存储器1130和处理器1120连接,如耦合连接等。
本申请实施例的一些特征可以由处理器1120执行存储器1130中的程序指令或者软件代码来完成/支持。存储器1130上在加载的软件组件可以从功能或者逻辑上进行概括,例如,图9所示的处理芯片902、输出接口电路903,又例如图10所示的处理芯片1002、数据读写接口1003。而图9所示的通信接口901、图10所示的通信接口1001的功能可以由通信接口1110实现。
本申请实施例中涉及到的任一通信接口可以是电路、总线、收发器或者其它任意可以用于进行信息交互的装置。比如DPU1100中的通信接口1110,示例性地,该其它装置可以是与该DPU1100相连的设备等。
本申请实施例中涉及的处理器可以是通用处理器、数字信号处理器、专用集成电路、现 场可编程门阵列或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件,可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。
本申请实施例中的耦合是装置、模块或模块之间的间接耦合或通信连接,可以是电性,机械或其它的形式,用于装置、模块或模块之间的信息交互。
处理器可能和存储器协同操作。存储器可以是非易失性存储器,比如硬盘或固态硬盘等,还可以是易失性存储器,例如随机存取存储器。存储器是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。
本申请实施例中不限定上述通信接口、处理器以及存储器之间的具体连接介质。比如存储器、处理器以及通信接口之间可以通过总线连接。所述总线可以分为地址总线、数据总线、控制总线等。
本申请实施例还提供一种芯片系统,如图12所示,芯片系统1200可以包括供电电路1201以及处理电路1202,其中,供电电路1201用于对处理电路1202进行供电,处理电路用于执行如下操作步骤:
获取图像数据,所述图像数据包括多个编码图像;
对所述图像数据执行图像处理操作集合中的操作,得到模型训练数据,其中,处理所述图像数据的操作包括图像处理操作集合中的操作以及训练操作集合中的操作,所述图像处理操作集合至少包括图像解码操作,所述训练操作集合至少包括模型训练操作;
输出所述模型训练数据,所述模型训练数据用于供所述模型训练处理器执行所述训练操作集合中的操作,或者,所述模型训练数据用于供所述CPU以及所述模型训练处理器执行所述训练操作集合中的操作。
实际应用时,供电电路1201可以与处理电路1202位于同一个芯片内,或者,供电电路1201可以位于处理电路1202所在芯片之外的另一个芯片内。供电电路1201包括但不限于如下至少一个:供电子系统、电管管理芯片、功耗管理处理器或功耗管理控制电路。
基于以上实施例,本申请实施例还提供了一种计算机存储介质,该存储介质中存储软件程序,该软件程序在被一个或多个处理器读取并执行时可实现上述任意一个或多个实施例提供的数据处理系统100执行的方法。所述计算机存储介质可以包括:U盘、移动硬盘、只读存储器、随机存取存储器、磁碟或者光盘等各种可以存储程序代码的介质。
本领域内的技术人员应明白,本申请的实施例可提供为方法、装置、系统、存储介质或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工 作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。
显然,本领域的技术人员可以对本申请实施例进行各种改动和变型而不脱离本申请实施例的范围。这样,倘若本申请实施例的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。

Claims (27)

  1. 一种数据处理方法,其特征在于,数据处理单元DPU分别与中央处理器CPU、模型训练处理器通过系统总线耦合,或者所述DPU与所述模型训练处理器是一个训练卡上的不同芯片,所述方法包括:
    所述DPU获取图像数据,所述图像数据包括多个编码图像;
    所述DPU对所述图像数据执行图像处理操作集合中的操作,得到模型训练数据,其中,处理所述图像数据的操作包括图像处理操作集合中的操作以及训练操作集合中的操作,所述图像处理操作集合至少包括图像解码操作,所述训练操作集合至少包括模型训练操作;
    所述DPU输出所述模型训练数据,所述模型训练数据用于供所述模型训练处理器执行所述训练操作集合中的操作,或者,所述模型训练数据用于供所述CPU以及所述模型训练处理器执行所述训练操作集合中的操作。
  2. 根据权利要求1所述的方法,其特征在于,所述DPU包括网络接口,所述DPU获取所述图像数据具体包括:
    所述DPU通过有线网络或者无线网络获取所述图像数据。
  3. 根据权利要求2所述的方法,其特征在于,所述有线网络是以太网或者无线带宽网。
  4. 根据权利要求1至3任一项所述的方法,其特征在于,所述DPU与存储设备连接,所述DPU获取图像数据具体包括:
    所述DPU从所述存储设备获取所述图像数据。
  5. 根据权利要求4所述的方法,其特征在于,所述存储设备包括:
    硬盘驱动器HDD、闪存介质驱动器、叠瓦式磁记录SMR、存储阵列、存储服务器中的一种或多种。
  6. 根据权利要求4或5所述的方法,其特征在于,所述存储设备与所述DPU之间的通信协议包括:小型计算机系统接口SCSI协议、串行连接小型计算机系统接口SAS协议、外围元件快速互连PCIe协议、通用串行总线USB协议、快速非易失性存储器NVMe协议中的一种或多种。
  7. 根据权利要求1至6任一项所述的方法,其特征在于,所述DPU、CPU以及所述模型训练处理器位于同一个服务器。
  8. 根据权利要求1至7任一项所述的方法,其特征在于,所述模型训练处理器为图形处理器GPU、神经网络处理器NPU、张量处理器TPU中的一种或者多种。
  9. 根据权利要求1至8任一项所述的方法,其特征在于,所述系统总线包括外围元件快速互连PCIe总线、计算快速链接CXL总线、快速非易失性存储器NVMe总线中的一种或多种。
  10. 根据权利要求1至9任一项所述的方法,其特征在于,所述图像处理操作集合还包括图像数据变换操作,所述DPU对所述图像数据执行图像处理操作集合中的操作,得到模型训练数据,包括:
    所述DPU对所述图像数据执行所述图像解码操作,得到矩阵数据;
    所述DPU对所述矩阵数据执行所述图像数据变换操作,得到所述模型训练数据。
  11. 根据权利要求1至9任一项所述的方法,其特征在于,所述训练操作集合还包括图像数据变换操作,所述模型训练数据用于被所述CPU执行所述图像数据变换操作并得到临时数据,所述临时数据用于被所述模型训练处理器执行所述模型训练操作。
  12. 根据权利要求1至11任一项所述的方法,其特征在于,所述方法还包括:
    所述DPU获取所述模型训练处理器输出的人工智能AI模型;
    所述DPU向本地或者远端存储设备发送所述AI模型,所述AI模型通过文件格式或者键值KV格式把所述AI模型存储于所述存储设备中。
  13. 根据权利要求1至12任一项所述的方法,其特征在于,所述DPU输出所述模型训练数据,包括:
    所述DPU向其它DPU输出所述模型训练数据;
    其中,所述其它DPU分别与其它CPU、其它模型训练处理器通过系统总线耦合,或者所述其它DPU与所述其它模型训练处理器是一个训练卡上的不同芯片,所述其它DPU用于接收所述模型训练数据,并向所述其它模型训练处理器输出所述模型训练数据,所述模型训练数据用于被所述其它模型训练处理器执行所述训练操作集合中的操作。
  14. 一种数据处理方法,其特征在于,数据处理单元DPU分别与中央处理器CPU、多个模型训练处理器通过系统总线耦合,或者所述DPU与所述多个模型训练处理器是一个训练卡上的不同芯片,包括:
    所述DPU获取图像数据,所述图像数据包括多个编码图像;
    所述DPU对所述图像数据执行图像处理操作集合中的操作,得到模型训练数据,其中,处理所述图像数据的操作包括图像处理操作集合中的操作以及训练操作集合中的操作,所述图像处理操作集合至少包括图像解码操作,所述训练操作集合至少包括模型训练操作;
    所述DPU将所述模型训练数据写入供所述多个模型训练处理器访问的共享缓存,所述共享缓存中的模型训练数据用于供所述多个模型训练处理器执行所述训练操作集合中的操作,或者,所述共享缓存中的模型训练数据用于供所述CPU以及所述多个模型训练处理器执行所述训练操作集合中的操作。
  15. 一种数据处理方法,其特征在于,目标数据处理单元DPU分别与中央处理器CPU、模型训练处理器通过系统总线耦合,或者所述目标DPU与所述模型训练处理器是一个训练卡上的不同芯片,包括:
    所述目标DPU获取图像数据,所述图像数据包括多个编码图像;
    所述目标DPU对所述图像数据执行图像处理操作集合中的操作,得到模型训练数据,其中,处理所述图像数据的操作包括图像处理操作集合中的操作以及训练操作集合中的操作,所述图像处理操作集合至少包括图像解码操作,所述训练操作集合至少包括模型训练操作;
    所述目标DPU将所述模型训练数据写入基于多个DPU中的缓存所构建的共享缓存池,所述多个DPU包括所述目标DPU,所述共享缓存池中的模型训练数据用于供所述模型训练处理器执行所述训练操作集合中的操作,或者,所述共享缓存池中的模型训练数据用于供所述CPU以及所述模型训练处理器执行所述训练操作集合中的操作。
  16. 一种第一数据处理单元DPU,其特征在于,所述第一DPU分别与第一中央处理器CPU、第一模型训练处理器通过系统总线耦合,或者所述第一DPU与所述第一模型训练处理器是一个训练卡上的不同芯片,所述第一DPU包括:
    通信接口,用于获取图像数据,所述图像数据包括多个编码图像;
    处理芯片,用于对所述图像数据执行图像处理操作集合中的操作,得到模型训练数据,其中,处理所述图像数据的操作包括图像处理操作集合中的操作以及训练操作集合中的操作,所述图像处理操作集合至少包括图像解码操作,所述训练操作集合至少包括模型训练操作;
    输出接口电路,用于输出所述模型训练数据,所述模型训练数据用于供所述第一模型训练处理器执行所述训练操作集合中的操作,或者,所述模型训练数据用于供所述第一CPU以及所述第一模型训练处理器执行所述训练操作集合中的操作。
  17. 根据权利要求16任一项所述的第一DPU,其特征在于,所述图像处理操作集合还包括图像数据变换操作,所述处理芯片,用于:
    对所述图像数据执行所述图像解码操作,得到矩阵数据;
    对所述矩阵数据执行所述图像数据变换操作,得到所述模型训练数据。
  18. 根据权利要求16或17所述的第一DPU,其特征在于,所述训练操作集合还包括图像数据变换操作,所述模型训练数据用于被所述第一CPU执行所述图像数据变换操作并得到临时数据,所述临时数据用于被所述第一模型训练处理器执行所述模型训练操作。
  19. 根据权利要求16至18任一项所述的第一DPU,其特征在于,所述通信接口,用于:
    获取所述第一模型训练处理器输出的人工智能AI模型;
    向本地或者远端存储设备发送所述AI模型,所述AI模型通过文件格式或者键值KV格式把所述AI模型存储于所述存储设备中。
  20. 根据权利要求16至19任一项所述的第一DPU,其特征在于,所述输出接口电路,还用于向第二DPU输出所述模型训练数据;
    所述第二DPU分别与第二CPU、第二模型训练处理器通过系统总线耦合,或者所述第二DPU与所述第二模型训练处理器是一个训练卡上的不同芯片,所述第二DPU用于:
    接收所述模型训练数据;
    向所述第二模型训练处理器输出所述模型训练数据,所述模型训练数据用于被所述第二模型训练处理器执行所述训练操作集合中的操作。
  21. 一种数据处理单元DPU,其特征在于,所述DPU分别与中央处理器CPU、多个模型训练处理器通过系统总线耦合,或者所述DPU与所述多个模型训练处理器是一个训练卡上的不同芯片,所述DPU包括:
    通信接口,用于获取图像数据,所述图像数据包括多个编码图像;
    处理芯片,用于对所述图像数据执行图像处理操作集合中的操作,得到模型训练数据,其中,处理所述图像数据的操作包括图像处理操作集合中的操作以及训练操作集合中的操作,所述图像处理操作集合至少包括图像解码操作,所述训练操作集合至少包括模型训练操作;
    数据读写接口,用于将所述模型训练数据写入供所述多个模型训练处理器访问的共享缓存,所述共享缓存中的模型训练数据用于供所述多个模型训练处理器执行所述训练操作集合中的操作,或者,所述共享缓存中的模型训练数据用于供所述CPU以及所述多个模型训练处理器执行所述训练操作集合中的操作。
  22. 一种目标数据处理单元DPU,其特征在于,所述目标DPU分别与中央处理器CPU、模型训练处理器通过系统总线耦合,或者所述目标DPU与所述模型训练处理器是一个训练卡上的不同芯片,所述目标DPU包括:
    通信接口,用于获取图像数据,所述图像数据包括多个编码图像;
    处理芯片,用于对所述图像数据执行图像处理操作集合中的操作,得到模型训练数据,其中,处理所述图像数据的操作包括图像处理操作集合中的操作以及训练操作集合中的操作,所述图像处理操作集合至少包括图像解码操作,所述训练操作集合至少包括模型训练操作;
    数据读写接口,用于将所述模型训练数据写入基于多个DPU中的缓存所构建的共享缓存池,所述多个DPU包括所述目标DPU,所述共享缓存池中的模型训练数据用于供所述模型训练处理器执行所述训练操作集合中的操作,或者,所述共享缓存池中的模型训练数据用于供所述CPU以及所述模型训练处理器执行所述训练操作集合中的操作。
  23. 一种数据处理单元DPU,其特征在于,所述DPU用于执行权利要求1至14中任一项 所述DPU执行的方法。
  24. 一种数据处理系统,其特征在于,所述数据处理系统包括权利要求1至14中任一项所述的DPU、CPU以及模型训练处理器。
  25. 一种芯片系统,其特征在于,所述芯片系统包括供电电路以及处理电路,所述供电电路用于对所述处理电路进行供电,所述处理电路用于执行权利要求1至14中任一项所述DPU执行的方法。
  26. 一种计算机可读存储介质,其特征在于,包括指令,所述指令用于实现权利要求1至14中任一项所述DPU执行的方法。
  27. 一种包含指令的计算机程序产品,其特征在于,当其在数据处理单元DPU上运行时,使得DPU执行如权利要求1至14中任一项所述DPU执行的方法。
PCT/CN2023/078189 2022-04-29 2023-02-24 数据处理方法、数据处理单元、系统及相关设备 WO2023207295A1 (zh)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202210473780 2022-04-29
CN202210473780.6 2022-04-29
CN202210934646.1A CN117011117A (zh) 2022-04-29 2022-08-04 数据处理方法、数据处理单元、系统及相关设备
CN202210934646.1 2022-08-04

Publications (1)

Publication Number Publication Date
WO2023207295A1 true WO2023207295A1 (zh) 2023-11-02

Family

ID=88517244

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/078189 WO2023207295A1 (zh) 2022-04-29 2023-02-24 数据处理方法、数据处理单元、系统及相关设备

Country Status (1)

Country Link
WO (1) WO2023207295A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117614018A (zh) * 2024-01-24 2024-02-27 普金硬科技(南通)有限公司 一种风电集群协同智能化管理方法及系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180322383A1 (en) * 2017-05-02 2018-11-08 International Business Machines Corporation Storage controller accelaration for neural network training and inference
CN110516817A (zh) * 2019-09-03 2019-11-29 北京华捷艾米科技有限公司 一种模型训练数据加载方法及装置
CN110675587A (zh) * 2019-09-25 2020-01-10 深圳市中电数通智慧安全科技股份有限公司 火灾预警方法、装置、终端及可读存储介质
CN110892380A (zh) * 2017-07-10 2020-03-17 芬基波尔有限责任公司 用于流处理的数据处理单元
CN111814959A (zh) * 2020-06-30 2020-10-23 北京百度网讯科技有限公司 模型训练数据的处理方法、装置、系统和存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180322383A1 (en) * 2017-05-02 2018-11-08 International Business Machines Corporation Storage controller accelaration for neural network training and inference
CN110892380A (zh) * 2017-07-10 2020-03-17 芬基波尔有限责任公司 用于流处理的数据处理单元
CN110516817A (zh) * 2019-09-03 2019-11-29 北京华捷艾米科技有限公司 一种模型训练数据加载方法及装置
CN110675587A (zh) * 2019-09-25 2020-01-10 深圳市中电数通智慧安全科技股份有限公司 火灾预警方法、装置、终端及可读存储介质
CN111814959A (zh) * 2020-06-30 2020-10-23 北京百度网讯科技有限公司 模型训练数据的处理方法、装置、系统和存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117614018A (zh) * 2024-01-24 2024-02-27 普金硬科技(南通)有限公司 一种风电集群协同智能化管理方法及系统
CN117614018B (zh) * 2024-01-24 2024-04-16 普金硬科技(南通)有限公司 一种风电集群协同智能化管理方法及系统

Similar Documents

Publication Publication Date Title
JP6974270B2 (ja) 知能型高帯域幅メモリシステム及びそのための論理ダイ
CN110647480B (zh) 数据处理方法、远程直接访存网卡和设备
WO2021244194A1 (zh) 寄存器的读写方法、芯片、子系统、寄存器组及终端
CN109960671B (zh) 一种数据传输系统、方法及计算机设备
US11868665B2 (en) Data processing near data storage
CN114201421B (zh) 一种数据流处理方法、存储控制节点及可读存储介质
WO2023207295A1 (zh) 数据处理方法、数据处理单元、系统及相关设备
KR20220164570A (ko) 딥 러닝 가속기 및 랜덤 액세스 메모리를 구비한 에지 서버
US20210400286A1 (en) Video Compression in Removable Storage Device having Deep Learning Accelerator and Random Access Memory
WO2023104194A1 (zh) 一种业务处理方法及装置
US11635904B2 (en) Matrix storage method, matrix access method, apparatus and electronic device
CN115129621B (zh) 一种内存管理方法、设备、介质及内存管理模块
CN115934623B (zh) 一种基于远程直接内存访问的数据处理方法、设备及介质
CN114490023A (zh) 一种基于arm和fpga的高能物理可计算存储设备
WO2023134735A1 (zh) 计算设备、数据处理方法、系统及相关设备
CN115079936A (zh) 一种数据写入方法及装置
WO2023124304A1 (zh) 芯片的缓存系统、数据处理方法、设备、存储介质及芯片
WO2023124428A1 (zh) 芯片、加速卡以及电子设备、数据处理方法
US11847049B2 (en) Processing system that increases the memory capacity of a GPGPU
CN115344393A (zh) 业务处理方法及相关设备
CN117011117A (zh) 数据处理方法、数据处理单元、系统及相关设备
CN116601616A (zh) 一种数据处理装置、方法及相关设备
WO2023134588A1 (zh) 计算系统、方法、装置及加速设备
CN212873459U (zh) 一种用于数据压缩存储的系统
CN115473861B (zh) 基于通信与计算分离的高性能处理系统和方法、存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23794731

Country of ref document: EP

Kind code of ref document: A1