CN116069511A - System for deep learning, method for processing data and electronic equipment - Google Patents

System for deep learning, method for processing data and electronic equipment Download PDF

Info

Publication number
CN116069511A
CN116069511A CN202310230370.3A CN202310230370A CN116069511A CN 116069511 A CN116069511 A CN 116069511A CN 202310230370 A CN202310230370 A CN 202310230370A CN 116069511 A CN116069511 A CN 116069511A
Authority
CN
China
Prior art keywords
processor
data
result
random access
hidden layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310230370.3A
Other languages
Chinese (zh)
Inventor
刘伟
宿栋栋
沈艳梅
阚宏伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202310230370.3A priority Critical patent/CN116069511A/en
Publication of CN116069511A publication Critical patent/CN116069511A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/382Information transfer, e.g. on bus using universal interface adapter
    • G06F13/387Information transfer, e.g. on bus using universal interface adapter for adaptation of different data processing systems to different peripheral devices, e.g. protocol converters for incompatible systems, open system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/40Bus structure
    • G06F13/4004Coupling between buses
    • G06F13/4022Coupling between buses using switching circuits, e.g. switching matrix, connection or expansion network
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/40Bus structure
    • G06F13/4063Device-to-bus coupling
    • G06F13/4068Electrical coupling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5066Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2213/00Indexing scheme relating to interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F2213/0026PCI express

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Multi Processors (AREA)

Abstract

The embodiment of the application provides a system for deep learning, a method for processing data and electronic equipment, wherein the system comprises the following components: the dynamic random access memory is connected with the processor through a data bus; and the processors are used for reading the input data from the dynamic random access memory through the data bus, running the deep learning model to analyze the input data to obtain an output result, wherein each processor is provided with an RDMA network interface, and communication connection is established between the RDMA network interfaces so as to be used for transmitting the intermediate processing result of at least one hidden layer in the deep learning model on the input data among the processors. By the method and the device, dependence on the data bus is reduced, preemption of processing resources of the data bus by a plurality of processors is avoided, load of the data bus is reduced, system running speed is improved, and data processing efficiency is improved.

Description

System for deep learning, method for processing data and electronic equipment
Technical Field
The embodiment of the application relates to the field of deep learning, in particular to a system for deep learning, a method for processing data and electronic equipment.
Background
In the related art, in an application scenario based on a deep learning model running by a plurality of processors, most of data needs to be transmitted through a data bus in the running process, and sometimes, under the condition that a plurality of processors apply for a data bus at the same time, the data bus becomes a bottleneck of the whole system, for example, a certain GPU may need to wait for the data transmission of other 3 GPUs to complete before starting to work, so that the operation speed of the whole system is greatly reduced.
Disclosure of Invention
The embodiment of the application provides a system for deep learning, a method for processing data and electronic equipment, and aims to at least solve the problems of lower operation speed and lower operation efficiency caused by higher dependence degree of a processor on a data bus in the process of deep learning operation in the related technology.
According to one embodiment of the present application, there is provided a system for deep learning, comprising: the dynamic random access memory is connected with the processor through a data bus and is at least used for sending input data to the processor through the data bus for processing; the system comprises a plurality of processors, a deep learning model and a dynamic random access memory, wherein the processors are used for reading input data from the dynamic random access memory through a data bus, running the deep learning model to analyze the input data to obtain an output result, and returning the output result to the dynamic random access memory, each processor is provided with a remote direct address access (RDMA) network interface, and communication connection is established among the RDMA network interfaces so as to be used for transmitting an intermediate processing result of at least one hidden layer in the deep learning model on the input data among the processors.
Optionally, the RDMA network interfaces in the processors establish a communication connection between the processors in a serial form, wherein the plurality of slave processors at least includes: the input processor reads input data from the dynamic random access memory based on the data bus, and the output processor sends output results to the dynamic random access memory based on the data bus.
Optionally, the number of RDMA network interfaces in the processors is at least two, wherein at least one idle RDMA network interface exists in the input processor and the output processor, and each RDMA network interface except the idle RDMA network interface establishes a communication connection between each processor in a serial form.
Optionally, the number of RDMA network interfaces in the input processor and the number of RDMA network interfaces in the output processor are both one, and other processors except the input processor and the output processor are provided with two RDMA network interfaces, and each RDMA network interface establishes a communication connection between each processor in a serial connection manner.
Optionally, each processor is respectively configured to transmit the result data of the hidden layer obtained by processing itself to the next processor connected in series with the processor, and so on, until the nth result data of the nth hidden layer is obtained in the output processor, and send the nth result data as an output result to the dynamic random access memory.
Optionally, the input processor is configured to analyze the input data to obtain first result data of the first hidden layer; the first processor receives first result data through the RDMA network interface, analyzes and processes the first result data to obtain second result data of the second hidden layer, and transmits the second result data to the second processor through the RDMA network interface; the second processor is used for analyzing and processing the second result data to obtain third result data of a third hidden layer, and the like until n-th result data of the n-th hidden layer are obtained in the output processor, and the n-th result data are used as output results to be sent to the dynamic random access memory; the input processor, the first processor, the second processor and the output processor are connected in series through the RDMA network interface.
Optionally, the RDMA network interface establishes a communication connection between the processors in a serial and parallel form, wherein at least one of the processors includes: the input processor reads input data from the dynamic random access memory based on the data bus, and the output processor sends output results to the dynamic random access memory based on the data bus.
Optionally, each processor is respectively configured to transmit the result data processed by itself to obtain the hidden layer to a next-stage processor set connected in series with the hidden layer, where a connection relationship between the next-stage processor sets is parallel connection until the nth result data of the nth hidden layer is obtained in the output processor, and send the nth result data as an output result to the dynamic random access memory.
Optionally, the input processor is configured to analyze the input data to obtain first result data of the first hidden layer, divide the first result data into two parts to obtain first data and second data, and transmit the first data and the second data to a first processor connected in series with the input processor and a second processor through the RDMA network interface, respectively; the first processor is used for processing the first data to obtain second result data of the second hidden layer, and the second processor is used for processing the second data to obtain third result data of the second hidden layer, wherein the first processor and the second processor are processors in parallel connection; the first processor is used for transmitting the second result data to the third processor through the RDMA network interface, wherein the third processor is used for analyzing and processing the second result data to obtain fourth result data of the third hidden layer; the second processor is used for transmitting the third result data to the fourth processor through the RDMA network interface, wherein the fourth processor is used for analyzing and processing the third result data to obtain fifth result data of the third hidden layer, and the third processor and the fourth processor are processors in parallel connection; and the same is repeated until the nth result data of the nth hidden layer is obtained in the output processor, and the nth result data is used as an output result to be sent to the dynamic random access memory.
Optionally, the target number of RDMA network interfaces configured by the processor is a dynamic value that is dynamically adjusted according to the data to be processed by the processor.
Optionally, the processor is further configured to determine the target number according to a load of data to be processed by the processor, where the greater the load, the greater the target number.
Optionally, the system further comprises: and the central processing unit is used for writing the input data into the dynamic random access memory.
Optionally, the computational power of each processor is the same, and the processor includes at least one of: a graphics processor GPU or a field programmable gate array FPGA.
Optionally, the dynamic random access memory includes: double rate synchronous dynamic random access memory DDR.
Optionally, the processor is further configured to return the output result to the dynamic random access memory through the data bus.
There is also provided, in accordance with an embodiment of the present application, a method of processing data, including: the processor set receives input data from the dynamic random access memory through a data bus; the deep learning model is operated to analyze the input data to obtain an output result; returning the output result to the dynamic random access memory; and each processor in the processor set is provided with a remote direct address access (RDMA) network interface, and communication connection is established among the RDMA network interfaces so as to be used for transmitting an intermediate processing result of at least one hidden layer in the deep learning model on input data among the processors.
Optionally, the target number of RDMA network interfaces configured on the processor is a dynamic value that is dynamically adjusted according to the data to be processed by the processor.
Optionally, the target number is determined by: acquiring the load of data to be processed of a processor; and determining the load of the data to be processed by the processor to determine the target quantity, wherein the larger the load is, the more the target quantity is.
Optionally, the RDMA network interfaces in the processors establish communication connection between the processors in a serial connection manner, each processor is respectively used for transmitting the result data of the hidden layer obtained by self processing to the next processor in serial connection with the processor, and so on until the nth result data of the nth hidden layer is obtained in the output processor, and the nth result data is sent to the dynamic random access memory as an output result.
Optionally, the RDMA network interface establishes communication connection between each processor in a serial connection and parallel connection manner, and each processor is respectively used for transmitting the result data processed by itself to obtain the hidden layer to a next-stage processor set connected in series with the processor, wherein the connection relationship between the next-stage processor set is parallel connection until the nth result data of the nth hidden layer is obtained in the output processor, and the nth result data is sent to the dynamic random access memory as an output result.
Optionally, the processor comprises at least one of: a graphics processor GPU or field programmable gate array FPGA, the dynamic random access memory comprising: double rate synchronous dynamic random access memory DDR.
According to an embodiment of the present application, there is also provided an apparatus for processing data, including: the receiving module is used for receiving the input data from the dynamic random access memory through the data bus by the processor set; the operation module is used for operating the deep learning model to analyze the input data to obtain an output result; the return module is used for returning the output result to the dynamic random access memory through the data bus; and each processor in the processor set is provided with a remote direct address access (RDMA) network interface, and communication connection is established among the RDMA network interfaces so as to be used for transmitting an intermediate processing result of at least one hidden layer in the deep learning model on input data among the processors.
According to an embodiment of the present application, there is also provided a computer-readable storage medium, in which a computer program is stored, wherein the computer program, when executed by a processor, implements the steps of any one of the method embodiments.
There is also provided, in accordance with an embodiment of the present application, an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps in any one of the method embodiments when executing the computer program.
According to the method and the device, the RDMA network interfaces can be used for connecting various processors in series and in parallel, various intermediate data in the deep learning model running process can be transmitted based on the RDMA network interfaces, the load required to be processed by the data bus is greatly reduced, the data bus is only used for transmitting input data in the initial stage and output data in the final stage, dependence on the data bus is reduced, preemption of processing resources of the data bus by a plurality of processors is avoided, the load of the data bus is reduced, the running speed of the system is improved, and the technical effect of improving the data processing efficiency is achieved.
Drawings
FIG. 1 is a schematic diagram of a calculation flow of deep learning in the related art;
FIG. 2 is a schematic diagram of a computing flow in which the computing power is distributed to a plurality of GPUs for pipeline operation in the related art;
FIG. 3 is a schematic diagram of a system for deep learning according to the present application;
FIG. 4 is a schematic diagram of a system architecture for cascading FPGAs via RDMA in an embodiment of the present application;
FIG. 5 is a schematic diagram of a system architecture for serially connecting and connecting FPGAs in parallel via RDMA in an embodiment of the present application;
FIG. 6 is a schematic diagram of an application scenario when the target number of RDMA network interfaces is three in an embodiment of the present application;
FIG. 7 is a schematic structural diagram of a deep learning model in a related embodiment of the present application;
FIG. 8 is a schematic diagram of connection relationships of FPGAs in an embodiment of the present application;
FIG. 9 is a schematic diagram of an initialization flow of a process to a thread A and a thread B in the embodiment of the present application;
FIG. 10 is a schematic diagram of the execution flow of line A on the CPU in the embodiment of the present application;
FIG. 11 is a schematic diagram of a process flow of an initiator when an RDMA network interface performs data transmission in the related art;
FIG. 12 is a schematic diagram of a flow chart of an RDMA network interface in data transfer according to an embodiment of the present application;
FIG. 13 is a flow chart of an alternative method of processing data according to an embodiment of the present application;
FIG. 14 is a schematic diagram of an alternative apparatus for processing data according to an embodiment of the present application;
fig. 15 is a block diagram of an alternative electronic device 1500 according to an embodiment of the present application.
Detailed Description
Embodiments of the present application will be described in detail below with reference to the accompanying drawings in conjunction with the embodiments.
It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.
Fig. 1 is a schematic diagram of a calculation flow of deep learning in the related art. Circles in the figure represent data, and arrows represent relationships between the data and the calculation process. The original data is stored in the input layer, and after calculation of a plurality of hidden layers, the final calculation result is put in the output layer. The hidden layer can be a convolution layer, a pooling layer, a full-connection layer or other operation method and the combination of the above methods.
In particular, to the actual computing scenario, the computing work is performed by a CPU/GPU/FPGA or other device, and various data (including input, output, and data of each middle layer) are stored in the host memory or the internal memory of the GPU/FPGA. And data are transmitted between various devices/memories through PCIE buses in the same host. FIG. 2 is a typical (corresponding to the layers of FIG. 1) deep learning application scenario, where all the GPUs can be replaced with FPGAs. The calculation process (corresponding to the reference numerals in fig. 2) is as follows:
(1) The CPU writes the input data into the host memory.
(2) The GPU1 reads input data from the host memory through the PCIE bus, performs (convolution, pooling, etc.) operation of the hidden layer 1, and obtains result data of the hidden layer 1.
(3) The GPU2 reads the data of the hidden layer 1 from the GPU1 through the PCIE bus, performs (convolution, pooling, etc.) operation of the hidden layer 2, and obtains the result data of the hidden layer 2.
(4) The GPU3 reads the data of the hidden layer 2 from the GPU2 through the PCIE bus, performs (convolution, pooling, etc.) operation of the hidden layer 3, and obtains the result data of the hidden layer 3.
(5) And the GPU4 reads the data of the hidden layer 3 from the GPU3 through the PCIE bus, performs final operation, and obtains final result data.
(6) GPU4 writes the result data to the host memory through the PCIE bus.
The existing scheme shown in fig. 2 distributes the amount of computation to multiple GPUs and can be pipelined, for example, the CPU can continuously provide new input data, while GPU1 performs hidden layer 1 operation on the current batch of data, and GPU2 is performing hidden layer 2 operation on the previous batch of data. Similarly, at a certain time point, the operation scene in fig. 2 can calculate the data of 4 batches at the same time, and the calculation force of the equipment is fully utilized. All GPUs in fig. 2 can be equivalently replaced by FPGAs.
However, the solution shown in fig. 2 suffers from a problem when executed: in each step, the PCIE bus is almost used for data transfer (except step (1)), and along with pipelining of the whole process, the situation that 4 GPUs apply for using the PCIE bus at the same time may occur. At this time, the PCIE bus becomes a bottleneck of the whole system, for example, a certain GPU may need to wait for the data transfer of the other 3 GPUs to complete before starting to work, which greatly reduces the operation speed of the whole system. The present invention is directed to solving this problem.
The RDMA network serial connection and parallel connection FPGA are used, and the problem that the operation speed of the whole deep learning system is reduced due to the fact that the PCIE bus is contended under the condition of multiple GPUs/FPGAs is solved.
Technical terms or partial terms that may be involved in the embodiments of the present application are explained as follows;
1. graphics processors (English: graphics processing unit, abbreviated: GPU), also known as display cores, vision processors, display chips, are microprocessors that are dedicated to performing image and graphics related operations on personal computers, workstations, gaming machines, and some mobile devices (e.g., tablet computers, smartphones, etc.).
Ddr=double Data Rate Double Rate synchronous dynamic random access memory. Strictly speaking, DDR should be called DDR SDRAM, which is commonly known as DDR, wherein SDRAM is an abbreviation for Synchronous Dynamic Random Access Memory, synchronous dynamic random access memory.
PCIE (PCI-Express) is a general bus specification, which is advocated and promoted by Intel, and the final design purpose is to replace the bus transmission interface in the existing computer system, which not only comprises a display interface, but also comprises a plurality of application interfaces such as CPU, PCI, HDD, network and the like.
RDMA (Remote Direct Memory Access) means remote direct address access by which a local node (a computer or embedded device) can "directly" access the remote node's memory. By direct, it is meant that the remote memory can be read and written by bypassing the conventional ethernet complex TCP/IP network protocol stack, as with accessing the local memory, the process is not participated by the CPU at the opposite end, and most of the work of the read-write process is done by hardware rather than software. RDMA transfers data directly into the memory area of a computer over a network, and moves the data quickly from one system to a remote system memory without any impact on the operating system, thus eliminating the need for more or less computer processing functions. It eliminates the overhead of external memory copying and context switching, thus freeing up memory bandwidth and CPU cycles for improved application system performance.
Fig. 3 is a system for deep learning according to the present application, as in fig. 3, the system comprising:
the dynamic random access memory is connected with the processor through a data bus and is at least used for sending input data to the processor through the data bus for processing;
the system comprises a plurality of processors, a deep learning model and a dynamic random access memory, wherein the processors are used for reading input data from the dynamic random access memory through a data bus, running the deep learning model to analyze the input data to obtain an output result, and returning the output result to the dynamic random access memory, each processor is provided with a remote direct address access (RDMA) network interface, and communication connection is established among the RDMA network interfaces so as to be used for transmitting an intermediate processing result of at least one hidden layer in the deep learning model on the input data among the processors. It can be appreciated that the RDMA network interfaces described above are all provided on RDMA network cards.
From fig. 3, it is clear that in this system, 4 processors are provided, each with two RDMA network interfaces, but it should be noted that fig. 3 only schematically shows the structure of the system, and does not constitute a limitation on the number of processors and the number of RDMA network interfaces in each processor.
It should be noted that, for a certain processor, multiple RDMA network interfaces may be provided by the same RDMA network card at the same time, or may be distributed on different RMDA at the same time, it may be understood that when multiple RMDA are configured in the processor, if multiple RMDA are started at the same time, the data transmission efficiency may be accelerated.
As an alternative embodiment, to further reduce reliance on the data bus, a network interface for transmitting data may be provided in the dynamic random access memory, through which the input data is sent to the processor.
In the system, various processors can be connected in series and in parallel through the RDMA network interface, various intermediate data in the process of running the deep learning model are transmitted based on the RDMA network interface, so that the load required to be processed by a data bus is greatly reduced, the data bus is only used for transmitting input data in an initial stage and output data in a final stage, dependence on the data bus is reduced, preemption of processing resources of the data bus by a plurality of processors is avoided, the load of the data bus is reduced, the running speed of the system is improved, and the data processing efficiency is improved.
Note that the deep learning model includes a neural network model, for example, CNN (convolutional neural network, convolutional Neural Networks). Of course, other types of neural network models are also possible, and the specific type of the deep learning model is not limited in the application.
In the above technical solution, the computing power of each processor is the same, and the processor includes at least one of the following: a graphics processor GPU or a field programmable gate array FPGA; the dynamic random access memory includes: the DDR may be a PCIE bus.
Alternatively, the processor may be an ARM structure processor, which is not limited in the specific type of the processor in the present application.
In some embodiments of the present application, the system further comprises: and the CPU is used for writing the input data into the dynamic random access memory. It should be noted that in some embodiments of the present application, two types of processors, i.e., a GPU and an FPGA, are simultaneously configured, and the two sets of processors may be switched by sending a control signal, for example, before the data to be input is determined to be data such as an image, before the data is sent to the dynamic random access memory, the control signal may be sent to the processor set to control the GPU to be selected for data processing.
Optionally, the method comprises: the processor is further configured to return the output result to the dynamic random access memory through the data bus, and as another alternative implementation manner, in some embodiments of the present application, the output result may be sent to the dynamic random access memory through the RDMA network interface, and it is easy to note that the output result is sent to the dynamic random access memory through the RDMA network interface, so that the dependence on the data bus may be further reduced, and the bottleneck effect of the data bus is avoided.
In some embodiments of the present application, the RDMA network interfaces in the processors establish a communication connection between the processors in a serial form, where the plurality of slave processors at least includes: the input processor reads input data from the dynamic random access memory based on the data bus, and the output processor sends output results to the dynamic random access memory based on the data bus.
In some optional embodiments of the present application, the number of RDMA network interfaces in the processors is at least two, where at least one idle RDMA network interface exists in the input processor and the output processor, and each RDMA network interface establishes a communication connection between each processor in a serial form except for the idle RDMA network interface.
It will be appreciated that the above-described idle RDMA network interface may be used as a backup interface for enabling when the RDMA primary network interface fails or the data load is large.
In other alternative embodiments of the present application, the number of RDMA network interfaces in the input processor and the number of RDMA network interfaces in the output processor are all one, and other processors except the input processor and the output processor are all provided with two RDMA network interfaces, and each RDMA network interface establishes a communication connection between each processor in a serial connection manner. It can be appreciated that by the above manner, hardware costs can be reduced.
As an alternative implementation manner, in the technical solution that the processors are connected in series through the RDMA network interface, each processor is respectively used for transmitting the result data of the hidden layer obtained by self processing to the next processor connected in series with the processor, and so on until the nth result data of the nth hidden layer is obtained in the output processor, and the nth result data is sent to the dynamic random access memory as an output result.
Specifically, the input processor is used for analyzing and processing input data to obtain first result data of the first hidden layer; the first processor receives first result data through the RDMA network interface, analyzes and processes the first result data to obtain second result data of the second hidden layer, and transmits the second result data to the second processor through the RDMA network interface; the second processor is used for analyzing and processing the second result data to obtain third result data of a third hidden layer, and the like until n-th result data of the n-th hidden layer are obtained in the output processor, and the n-th result data are used as output results to be sent to the dynamic random access memory; the input processor, the first processor, the second processor and the output processor are connected in series through the RDMA network interface.
Taking a processor as an example of an FPGA, the above serial scheme is described first: fig. 4 is a schematic diagram of a system architecture for establishing communication connection between FPGAs in a serial manner, as shown in fig. 4, where all FPGA devices in fig. 4 have an RDMA network communication module built therein, and each FPGA has two RDMA network interfaces (abbreviated as "network interfaces"). The connection mode between the network ports is as follows: the network port 2 of the FPGA1 (namely the input processor) is connected with the network port 1 of the FPGA2 (namely the first processor); the network port 2 of the FPGA2 is connected with the network port 1 of the FPGA 3; the network port 2 of the FPGA3 (namely the second processor) is connected with the network port 1 of the FPGA4 (namely the output processor); the network port 1 of the FPGA1 and the network port 2 of the FPGA4 are not used in a vacant way.
The calculation process (corresponding to the reference numerals in fig. 4) is as follows:
(1) the CPU writes the input data into the host memory.
(2) The FPGA1 reads input data from the host memory through the PCIE bus, performs (convolution, pooling, etc.) operation of the hidden layer 1, and obtains result data of the hidden layer 1.
(3) The FPGA2 reads the data of the hidden layer 1 from the FPGA1 through the RDMA network connection, and performs (convolution, pooling, etc.) operation of the hidden layer 2 to obtain the result data of the hidden layer 2.
(4) The FPGA3 reads the data of the hidden layer 2 from the FPGA2 through the RDMA network, and performs (convolution, pooling, etc.) operation of the hidden layer 3 to obtain the result data of the hidden layer 3.
(5) The FPGA4 reads the data of the hidden layer 3 from the FPGA3 through the RDMA network, and performs final operation to obtain final result data.
(6) The FPGA4 writes the result data to the host memory through the PCIE bus.
It can be appreciated that, compared with the scheme shown in fig. 2, the scheme of fig. 4 greatly relieves the bottleneck effect of PCIE on the whole system, and improves the overall operation speed.
As another alternative, the processors may be connected in series and parallel through an RDMA network interface, specifically, the RDMA network interface establishes a communication connection between the processors in a serial and parallel manner, where the multiple processors at least include: the input processor reads input data from the dynamic random access memory based on the data bus, and the output processor sends output results to the dynamic random access memory based on the data bus.
In the embodiment that each processor is connected in series and in parallel through the RDMA network interface, each processor is respectively used for transmitting the result data of the hidden layer obtained by processing itself to the next-stage processor set connected in series with the processor, and it should be noted that the connection relationship between the next-stage processor set is parallel connection until the nth result data of the nth hidden layer is obtained in the output processor, and the nth result data is sent to the dynamic random access memory as the output result, and it is easy to notice that by means of parallel connection, parallel processing of the result data of the previous-stage hidden layer can be achieved, and the data processing efficiency is further improved.
Specifically, the input processor is used for analyzing and processing the input data to obtain first result data of the first hidden layer, dividing the first result data into two parts to obtain first data and second data, and respectively transmitting the first data and the second data to a first processor and a second processor which are connected in series with the input processor through the RDMA network interface; the first processor is used for processing the first data to obtain second result data of the second hidden layer, and the second processor is used for processing the second data to obtain third result data of the second hidden layer, wherein the first processor and the second processor are processors in parallel connection; the first processor is used for transmitting the second result data to the third processor through the RDMA network interface, wherein the third processor is used for analyzing and processing the second result data to obtain fourth result data of the third hidden layer; the second processor is used for transmitting the third result data to the fourth processor through the RDMA network interface, wherein the fourth processor is used for analyzing and processing the third result data to obtain fifth result data of the third hidden layer, and the third processor and the fourth processor are processors in parallel connection; and the same is repeated until the nth result data of the nth hidden layer is obtained in the output processor, and the nth result data is used as an output result to be sent to the dynamic random access memory.
The scenario of serial connection and parallel connection is still illustrated by taking a processor as an FPGA as an example, and it is easy to note that, although the scheme using RDMA serial connection FPGA shown in fig. 4 greatly relieves the bottleneck effect of PCIE on the whole system and improves the overall operation speed, there are some limitations, for example, in order to better achieve the effect of pipelining calculation, the scheme requires that the speeds of acquiring and transmitting data and the operation speeds of all FPGAs are almost identical, which is in practice poor in feasibility, for example, when the calculation complexity of the hidden layer 2 and the hidden layer 3 is relatively high, and when the calculation time is relatively long, the FPGAs 1 and the FPGAs 4 have to wait for the FPGAs 2 and the FPGAs 3 frequently. This reduces to some extent the effect of the scheme of fig. 4 on the increase in computation speed. In this regard, in the embodiment of the present application, an operation scheme using RDMA "series-parallel-connected" FPGA shown in fig. 5 is also provided.
As shown in fig. 5, 6 FPGAs are used in fig. 5 in total, and all of their ports are utilized. The connection mode of all the network ports is already described in the figure. Based on the description shown in fig. 5, the basic idea is to distribute the calculation load of the steps with high calculation complexity and long calculation time to two parallel FPGAs.
The calculation process (corresponding to the reference numerals in fig. 5) is as follows:
(1) the CPU writes the input data into the host memory.
(2) The FPGA1 (i.e. the input processor) reads input data from the host memory through the PCIE bus, performs operation of the hidden layer 1, obtains result data of the hidden layer 1, and divides the result data into two parts (i.e. Part1 and Part 2) to store the result data in the internal memory of the FPGA 1.
(3) The FPGA2 (i.e. the first processor) reads the data Part1 of the hidden layer 1 from the FPGA1 through the RDMA network connection, and performs the operation of the hidden layer 2 to obtain the result data Part1 of the hidden layer 2. Meanwhile, the FPGA3 (i.e. the second processor) reads the data Part2 of the hidden layer 1 from the FPGA1 through the RDMA network connection, and performs the operation of the hidden layer 2 to obtain the result data Part2 of the hidden layer 2.
(4) The FPGA4 (i.e. the third processor) reads the data Part1 of the hidden layer 2 from the FPGA2 through the RDMA network, and performs the operation of the hidden layer 3 to obtain the result data Part1 of the hidden layer 3. Meanwhile, the FPGA5 (i.e., the fourth processor) reads the data Part2 of the hidden layer 2 from the FPGA3 through the RDMA network connection, and performs the operation of the hidden layer 3 to obtain the result data Part2 of the hidden layer 3.
(5) The FPGA6 (i.e. the output processor) reads the two parts of data of the hidden layer 3 from the FPGA4 and the FPGA5 through the RDMA network, and performs the final operation to obtain the final result data.
(6) The FPGA6 writes the result data to the host memory through the PCIE bus.
It should be noted that, the scheme shown in fig. 5 is still limited by the RDMA network port of the processor, so, in order to adapt to a more complex situation, the number of RDMA network ports of the FPGA may be redesigned, so, in the related example of the present application, the RDMA network port data configured by the processor may be adjusted according to the actual situation, and specifically, the target number of RDMA network ports configured by the processor is a dynamic value dynamically adjusted according to the data to be processed by the processor.
In some embodiments of the present application, the processor is further configured to determine the target number according to a load of data to be processed by the processor, where the greater the load is, the greater the target number is. Similarly, the target number, i.e., the plurality of RDMA network interfaces, may also establish communication connections between the various processors in series as well as in parallel. Fig. 6 is a schematic diagram of an application scenario in which the target number of network interfaces of RDMA is three, as shown in fig. 6, and the calculation flow is similar to that shown in fig. 5, and will not be described herein.
It can be understood that, compared with the existing scheme shown in fig. 2, in the schemes shown in fig. 4 and 5 adopted in the present invention, only the step (2) and the step (6) use the PCIE bus to transmit data, and the other steps all adopt RDMA network connections independent of each other, so that the load on the PCIE bus is greatly reduced. It can be seen that, in the case where the number of FPGA devices is N (one of the two parallel devices), the transmission data on the PCIE bus is 2/(n+1) of the existing scheme shown in fig. 2. That is, the load of the PCIE bus is reduced by (N-1)/(n+1), and accordingly, the theoretical operation speed can be improved by (n+1)/2 times as much as the conventional scheme, and it is easy to note that the larger N is, the more processors are, and the effect of improving the operation speed is more obvious.
The combination of series and parallel schemes shown in fig. 5 and 6 can effectively solve the problem for the scenario that some single steps have excessively long calculation time and become a system bottleneck.
In addition, the technical scheme of the application is not limited to the inside of a single machine, can be executed across hosts, and has better expansibility compared with the existing scheme. It will be appreciated that in the above embodiments, it is necessary to allocate dedicated memory areas fixedly from the processing, for use with input data (read from other devices or system memory) and resultant data (calculated), respectively.
In the scenario of adding the parallel FPGA to the RDMA network interface in series, it can be seen that when a parallel FPGA processes a certain data, if the operation speed of any one of the parallel FPGAs is slower, the efficiency of the final data synthesis processing will be affected, so in some embodiments of the present application, when a plurality of independent computing flows are processed (i.e., there is no front-back logic defining relationship between the computing flows, each computing flow may be completely independently operated), in order to avoid interdependence between the parallel FPGAs, the plurality of independent computing flows may be processed by:
determining the computing process with the largest computing force or the longest computing time in a plurality of independent computing processes, determining the computing process with the largest computing force or the longest computing time as a target process to be improved, and distributing the computing amount of the target process to be improved to a plurality of parallel processors.
Specifically, a plurality of independent threads may be allocated to the plurality of independent computation flows, where the computation flows and the threads are in a one-to-one correspondence, and the more computation flows to be computed (or the more threads to be operated), the more RDMA network interfaces that need to be set in the input processor and the output processor, the more processors in the processor set connected in series with the input processor, and the data of each thread are independently processed by each processor in the processor set, so as to affect each other.
For example, if there are three computing flows of step a, step B, and step C, where step a and step B can be performed simultaneously and are completely independent of each other, two threads (thread a and thread B) that are independent of each other can be allocated for step a and step B when the steps are implemented, and the two hidden layers are shared (as shown in fig. 7, fig. 7 is a schematic diagram of the deep learning model in this embodiment) assuming that the computing time of step a and step C is the same and the computing time of step B is twice as long as that of step A, C when computing devices with the same capabilities are used due to the difference in associated complexity and data amount.
Alternatively, taking a processor as an FPGA as an example, fig. 8 is a schematic diagram of connection relation of the FPGA in the above embodiment, as shown in fig. 8, input data of the thread a and the thread B may be processed through the FPGA1 (i.e., an input processor), so as to obtain data of the thread a hidden layer 1 and data of the thread B hidden layer 1, and the data of the thread a hidden layer 1 may be sent by the FPGA3, the data of the thread B hidden layer 1 may be sent by the FPGA2, the data of the thread a hidden layer 1 may be processed by the FPGA3, the data of the thread B hidden layer 1 may be processed, so as to obtain data of the thread a hidden layer 2, and then the data of the thread a hidden layer 2 and the data of the thread B hidden layer 2 may be read by the FPGA4, and finally, output data of the thread a may be calculated, and output data of the thread B may be calculated. It can be understood that the calculated amount of the step B can be shared by the FPGA2 and the FPGA3, so that the running time of the step B is the same as that of the step a, and further, pipelining of the calculation is possible.
It should be noted that in the above embodiment, each FPGA runs a designated kernel program. Two kernel_1 routines (named kernel_1_a and kernel_1_B, respectively) are run on FPGA1 for reading the input data of threads a and B, respectively, and computing hidden layer 1 data (and invoking RDMA network interfaces to send the computed results to the node responsible for the next layer of computation, the same applies below). And running a kernel_2 routine on the FPGA2 and the FPGA3, reading the data of the hidden layer 1, and calculating the data of the hidden layer 2. Two kernel_3 routines (named kernel_3_A and kernel_ 3_B) are run on FPGA4, which read the hidden layer 2 data of threads a and B, respectively, and calculate output data, which is finally written into main memory via PCIe.
In the above embodiment, the thread a and the thread B may be started by the same process, and in the embodiment of fig. 9, the process needs to initialize the thread a and the thread B, and in this embodiment of the present application, the process is shown in fig. 9 as an initialization flow diagram of the thread a and the thread B, where the flow includes: distributing a large buffer area buffer, and equally dividing the buffer into two parts, namely input_buf_A and input_buf_B, for respectively storing input data of the thread A and the thread B; distributing a large buffer area buffer, and equally dividing the buffer into two parts, namely out_buf_A and out_buf_B, for respectively storing output data of the thread A and the thread B; step S906, loading the corresponding kernel program into each FPGA through PCIe; thread a and thread B are started.
After the initialization of the thread a and the thread B is finished, the thread a and the thread B can be normally used, fig. 10 is a schematic diagram of an execution flow of the thread a on the CPU, as shown in fig. 10, each thread is responsible for configuring a kernel corresponding to itself, including a transmission target (a certain address on the next-stage FPGA or the host), a transmission mode (RDMA or PCIe), and a transmission interface (an RDMA network port designated for use) after the completion of the computation, and the execution path of the thread B is similar to that of the thread a and is not described herein.
It should be noted that, in the related art, each RDMA terminal (such as a network card or FPGA with RDMA function) needs to be connected to a dedicated router, and a sender of information will send a data packet to the router first, and then the router forwards the data packet to a destination device. In order to identify which destination device a certain packet belongs to, the RDMA device of the sender needs to add information such as LID and GID to the packet. In order to avoid the attack of the malicious program, the two communication parties cooperate to create respective secret keys and communicate in advance, then when data transmission is performed, the software of the initiator configures the secret keys at two ends to the hardware, the hardware devices of the two parties need to check whether the secret keys at the two parties are correct, and when the secret keys at the two parties are correct, the data transmission is performed at the next step, fig. 11 is a process flow of the initiator when RDMA performs data transmission in the related art, as shown in fig. 11, the process flow includes: after receiving the data transmission request of the kernel, checking whether the key is correct, reading data from a data source to the inside of the RDMA module, adding information such as GID, LID and the like in front of the data to complete data package packaging, and then sending the packaged data package to an opposite terminal.
In the embodiments described above, the RDMA network interfaces in the present application are all directly connected one-to-one, and the data is transmitted without passing through other devices such as routers, so that operations such as checking keys and encapsulating data can be avoided, and data processing time is saved.
Fig. 12 is a schematic flow chart of data transmission by the RDMA network interface in the embodiment of the present application, as shown in fig. 12, the flow includes: after receiving the data transmission request of the kernel, reading data from a data source to the inside of the RDMA module, and sending the data to an opposite terminal. It is easy to note that, compared with the RDMA network interface in the related art, the process omits checking keys and adding GID, LID and other information to the front of data for packet encapsulation and other operations. It can be understood that, in the embodiments of the present application, the use of RDMA networks is (network ports) one-to-one device interfacing, compared with the operation of data transmission based on RDMA in the related art, when FPGA performs data packet assembly and analysis on RDMA network data, for example, information such as MAC and GID is omitted, because the information is used to identify devices, when one-to-one is used, the information such as MAC and GID is not needed, and because network attack is not performed, the security mechanism such as Key can be omitted, so that the processing speed of data can be increased.
In this embodiment, a method running on a processor is provided, fig. 13 is a flowchart of a method for processing data according to an embodiment of the present application, and as shown in fig. 13, the flowchart includes the following steps:
step S1302, a processor set receives input data from a dynamic random access memory through a data bus;
as an alternative embodiment, to further reduce reliance on the data bus, a network interface for transmitting data may be provided in the dynamic random access memory, through which the input data is sent to the processor.
Step S1304, the deep learning model is operated to analyze the input data to obtain an output result;
the deep learning model includes, but is not limited to: neural network model.
Step S1306, returning the output result to the dynamic random access memory; and each processor is provided with a remote direct address access (RDMA) network interface, and communication connection is established among the RDMA network interfaces so as to be used for transmitting the intermediate processing result of at least one hidden layer in each level of hidden layer deep learning model of the deep learning model to input data among the processors.
Specifically, for a certain processor, multiple RDMA network interfaces may be provided by the same RDMA network card at the same time, or may be distributed on different RMDAs at the same time, it may be understood that when multiple RMDAs are configured in the processor, if multiple RMDAs are started at the same time, data transmission efficiency may be accelerated.
In the above technical solution in step S1306, the output result may be returned to the dynamic random access memory through the data bus.
As another optional implementation manner, in the technical solution in step S1306, the output result may be sent to the dynamic random access memory through the RDMA network interface, and it is easy to notice that the output result is sent to the dynamic random access memory through the RDMA network interface, so that the dependence on the data bus may be further reduced, and the bottleneck influence of the data bus is avoided.
In addition, based on the system embodiment, in the embodiment of the application, the transmission of the intermediate process data can be realized by adopting the mode of the RDMA network interface serial-parallel processor, so that the dependence on a data bus is reduced. In the above technical solution, the computing power of each processor is the same, and the processor includes at least one of the following: a graphics processor GPU or a field programmable gate array FPGA; the method can also be mixed use of GPU and FPGA, and the dynamic random access memory comprises: the DDR may be a PCIE bus.
In some embodiments of the present application, both GPU and FPGA types of processors may be provided at the same time, and switching between the two sets of processors may be achieved by receiving control signals.
The processor set receives input data from the dynamic random access memory through a data bus; the deep learning model is operated to analyze the input data to obtain an output result; returning the output result to the dynamic random access memory; in the processor set, each processor is provided with a remote direct address access RDMA network interface, communication connection is established between each RDMA network interface, so that intermediate processing results of at least one hidden layer in the deep learning model on input data are transmitted between each processor, dependence on a data bus is reduced, preemption of processing resources of the data bus by a plurality of processors is avoided, load of the data bus is reduced, running speed of the system is improved, and data processing efficiency is accelerated.
As an alternative implementation manner, the RDMA network interface in the processors establishes a communication connection between the processors in a serial form, each processor is respectively used for transmitting the result data of the hidden layer obtained by self processing to the next processor in serial connection with the processor, and so on until the nth result data of the nth hidden layer is obtained in the output processor, and the nth result data is sent to the dynamic random access memory as an output result.
As another alternative implementation manner, the RDMA network interfaces in the processors may establish communication connection between the processors in a serial and parallel form, where each processor is used to transmit the result data processed by itself to obtain the hidden layer to the next processor set connected in series with the processor, where the connection relationship between the next processor set is parallel connection until the nth result data of the nth hidden layer is obtained in the output processor, and send the nth result data as an output result to the dynamic random access memory.
Similarly, the processors are provided with a target number of RDMA network interfaces, wherein the target number is a dynamic value dynamically adjusted according to data to be processed by the processors.
Optionally, the target number is determined by: acquiring the load of data to be processed of a processor; and determining the load of the data to be processed by the processor to determine the target quantity, wherein the larger the load is, the more the target quantity is.
Optionally, the method comprises: the processor is further configured to return the output result to the dynamic random access memory through the data bus, and as another alternative implementation manner, in some embodiments of the present application, the output result may be sent to the dynamic random access memory through the RDMA network interface, and it is easy to note that the output result is sent to the dynamic random access memory through the RDMA network interface, so that the dependence on the data bus may be further reduced, and the bottleneck effect of the data bus is avoided.
Fig. 14 is an apparatus for processing data according to an embodiment of the present application, as shown in fig. 14, the apparatus including:
a receiving module 140, configured to receive, by the processor set, input data from the dynamic random access memory through the data bus;
the operation module 142 is configured to operate the deep learning model to analyze the input data to obtain an output result;
a return module 144, configured to return the output result to the dynamic random access memory through the data bus; and each processor in the processor set is provided with a remote direct address access (RDMA) network interface, and communication connection is established among the RDMA network interfaces so as to be used for transmitting an intermediate processing result of at least one hidden layer in the deep learning model on input data among the processors.
In the device, a receiving module 140 is used for receiving input data from the dynamic random access memory through a data bus by the processor set; the operation module 142 is configured to operate the deep learning model to analyze the input data to obtain an output result; a return module 144, configured to return the output result to the dynamic random access memory through the data bus; in the processor set, each processor is provided with a remote direct address access RDMA network interface, communication connection is established between each RDMA network interface, so that intermediate processing results of at least one hidden layer in the deep learning model on input data are transmitted between each processor, dependence on a data bus is reduced, preemption of processing resources of the data bus by a plurality of processors is avoided, load of the data bus is reduced, running speed of the system is improved, and data processing efficiency is accelerated.
In some examples of the present application, the processors are each provided with a target number of RDMA network interfaces, where the target number is a dynamic value dynamically adjusted according to data to be processed by the processors.
Optionally, the device further comprises a determining module, wherein the determining module is used for acquiring the load of the data to be processed by the processor; the load of the data to be processed by the processor is determined to determine the target quantity, and the larger the load is, the more the target quantity is.
In some optional examples of the present application, the apparatus further includes: and the distribution module is used for evenly distributing the load to a plurality of processors in parallel connection after acquiring the load of the data to be processed by the processors. It should be noted that the above-mentioned processor includes at least one of the following: a graphics processor GPU or a field programmable gate array FPGA; the method can also be mixed use of GPU and FPGA, and the dynamic random access memory comprises: the DDR may be a PCIE bus. In some embodiments of the present application, both GPU and FPGA types of processors may be provided at the same time, and switching between the two sets of processors may be achieved by receiving control signals.
From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method described in the embodiments of the present application.
It should be noted that each of the above modules may be implemented by software or hardware, and for the latter, it may be implemented by, but not limited to: the modules are all located in the same processor; alternatively, the above modules may be located in different processors in any combination.
Embodiments of the present application also provide a computer readable storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.
Specifically, the storage medium is configured to store program instructions for the following functions, and implement the following functions:
the processor set receives input data from the dynamic random access memory through a data bus; the deep learning model is operated to analyze the input data to obtain an output result; returning the output result to the dynamic random access memory; and each processor in the processor set is provided with a remote direct address access (RDMA) network interface, and communication connection is established among the RDMA network interfaces so as to be used for transmitting an intermediate processing result of at least one hidden layer in the deep learning model on input data among the processors.
In one exemplary embodiment, the computer readable storage medium may include, but is not limited to: a usb disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing a computer program.
Embodiments of the present application also provide an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.
In an exemplary embodiment, the electronic device may further include a transmission device connected to the processor, and an input/output device connected to the processor.
Fig. 15 illustrates a schematic block diagram of an example electronic device 1500 that may be used to implement embodiments of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the application described and/or claimed herein.
As shown in fig. 15, the apparatus 1500 includes a computing unit 1501, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1502 or a computer program loaded from a storage unit 1508 into a Random Access Memory (RAM) 1503. In the RAM 1503, various programs and data required for the operation of the device 1500 may also be stored. The computing unit 1501, the ROM 1502, and the RAM 1503 are connected to each other through a bus 1504. An input/output (I/O) interface 1505 is also connected to bus 1504.
Various components in device 1500 are connected to I/O interface 1505, including: an input unit 1506 such as a keyboard, mouse, etc.; an output unit 1507 such as various types of displays, speakers, and the like; a storage unit 1508 such as a magnetic disk, an optical disk, or the like; and a communication unit 1509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 1509 allows the device 1500 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.
The computing unit 1501 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running deep learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The computing unit 1501 performs the various methods and processes described above, for example, a method of processing data. For example, in some embodiments, the method of processing data may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 1508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 1500 via the ROM 1502 and/or the communication unit 1509. When a computer program is loaded into the RAM 1503 and executed by the computing unit 1501, one or more steps of the method of processing data described above may be performed. Alternatively, in other embodiments, the computing unit 1501 may be configured to perform the method of processing data by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present application may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this application, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.
Specific examples in this embodiment may refer to the examples described in the foregoing embodiments and the exemplary implementation, and this embodiment is not described herein.
It will be appreciated by those skilled in the art that the modules or steps of the application described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may be implemented in program code executable by computing devices, so that they may be stored in a storage device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps of them may be fabricated into a single integrated circuit module. Thus, the present application is not limited to any specific combination of hardware and software.
The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the same, but rather, various modifications and variations may be made by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the principles of the present application should be included in the protection scope of the present application.

Claims (24)

1. A system for deep learning, comprising:
the dynamic random access memory is connected with the processor through a data bus and is at least used for sending input data to the processor through the data bus for processing;
the multiple processors are used for reading the input data from the dynamic random access memory through the data bus, running the deep learning model to analyze the input data to obtain an output result, and returning the output result to the dynamic random access memory, wherein each processor is provided with a remote direct address access (RDMA) network interface, and communication connection is established between the RDMA network interfaces so as to be used for transmitting an intermediate processing result of at least one hidden layer in the deep learning model on the input data between the processors.
2. The system of claim 1, wherein the RDMA network interface in the processors establishes the communication connection between the respective processors in a serial fashion, wherein at least one of the plurality of processors comprises: the input processor is a processor for reading the input data from the dynamic random access memory based on a data bus, and the output processor is a processor for sending the output result to the dynamic random access memory based on the data bus.
3. The system of claim 2, wherein the number of RDMA network interfaces in the processor is at least two, wherein there is at least one idle RDMA network interface in the input processor and the output processor, and wherein each of the RDMA network interfaces establishes the communication connection between the respective processors in a tandem manner in addition to the idle RDMA network interface.
4. The system of claim 2, wherein the number of RDMA network interfaces in the input processor and the output processor are each one, wherein each of the other processors except the input processor and the output processor is provided with two RDMA network interfaces, and wherein each of the RDMA network interfaces establishes the communication connection between the respective processors in a serial form.
5. The system of claim 3 or claim 4, wherein,
and each processor is respectively used for transmitting the result data of the hidden layer obtained by processing to the next processor connected in series with the processor, and so on until the nth result data of the nth hidden layer is obtained in the output processor, and the nth result data is used as an output result to be sent to the dynamic random access memory.
6. The system of claim 5, wherein the input processor is configured to analyze the input data to obtain first result data for a first hidden layer; the first processor receives the first result data through an RDMA network interface, analyzes and processes the first result data to obtain second result data of a second hidden layer, and transmits the second result data to a second processor through the RDMA network interface;
the second processor is used for analyzing and processing the second result data to obtain third result data of a third hidden layer, and the like until n-th result data of an n-th hidden layer are obtained in the output processor, and the n-th result data are sent to the dynamic random access memory as an output result; the input processor, the first processor, the second processor and the output processor are connected in series through an RDMA network interface.
7. The system of claim 1, wherein the RDMA network interface establishes the communication connection between the respective processors in a serial and parallel fashion, wherein at least one of the plurality of processors comprises: the input processor is a processor for reading the input data from the dynamic random access memory based on a data bus, and the output processor is a processor for sending the output result to the dynamic random access memory based on the data bus.
8. The system of claim 7, wherein each processor is configured to transmit the result data processed by itself to obtain the hidden layer to a next-stage processor set connected in series with the next-stage processor set, wherein a connection relationship between the next-stage processor sets is parallel connection until the nth result data of the nth hidden layer is obtained in the output processor, and send the nth result data as an output result to the dynamic random access memory.
9. The system of claim 8, wherein the input processor is configured to analyze the input data to obtain first result data of a first hidden layer, divide the first result data into two parts to obtain first data and second data, and transmit the first data and the second data to a first processor connected in series with the input processor through an RDMA network interface, and a second processor, respectively;
the first processor is used for processing the first data to obtain second result data of a second hidden layer, and the second processor is used for processing the second data to obtain third result data of the second hidden layer, wherein the first processor and the second processor are processors in parallel connection;
The first processor is configured to transmit the second result data to a third processor through an RDMA network interface, where the third processor is configured to analyze and process the second result data to obtain fourth result data of a third hidden layer;
the second processor is configured to transmit the third result data to a fourth processor through an RDMA network interface, where the fourth processor is configured to perform analysis processing on the third result data to obtain fifth result data of a third hidden layer, where the third processor and the fourth processor are processors in a parallel relationship;
and the same is repeated until the nth result data of the nth hidden layer is obtained in the output processor, and the nth result data is used as an output result to be sent to the dynamic random access memory.
10. The system of claim 7, wherein the target number of RDMA network interfaces configured on the processor is a dynamic value that is dynamically adjusted based on data to be processed by the processor.
11. The system of claim 10, wherein the processor is further configured to determine the target number based on a load of data to be processed by the processor, wherein the greater the load, the greater the target number.
12. The system of claim 1, wherein the system further comprises:
and the central processing unit is used for writing the input data into the dynamic random access memory.
13. The system of claim 1, wherein the processor comprises at least one of: a graphics processor GPU or a field programmable gate array FPGA.
14. The system of claim 1, wherein the dynamic random access memory comprises: double rate synchronous dynamic random access memory DDR.
15. The system according to any one of claims 1 to 11, comprising: the processor is further configured to return the output result to the dynamic random access memory through the data bus.
16. A method of processing data, comprising:
the processor set receives input data from the dynamic random access memory through a data bus;
operating a deep learning model to analyze the input data to obtain an output result;
returning the output result to the dynamic random access memory; and each processor in the processor set is provided with a remote direct address access (RDMA) network interface, and communication connection is established among the RDMA network interfaces so as to be used for transmitting an intermediate processing result of at least one hidden layer in the deep learning model on the input data among the processors.
17. The method of claim 16, wherein the target number of RDMA network interfaces configured on the processor is a dynamic value that is dynamically adjusted according to data to be processed by the processor.
18. The method of claim 17, wherein the target number is determined by:
acquiring the load of data to be processed of a processor; and determining the load of the data to be processed by the processor, and determining the target quantity, wherein the larger the load is, the more the target quantity is.
19. The method of claim 16, wherein the RDMA network interface in the processors establishes the communication connection between the processors in a serial form, the processors are each configured to transmit the result data of the hidden layer obtained by processing the RDMA network interface to a next processor in serial with the RDMA network interface, and so on until the nth result data of the nth hidden layer is obtained in the output processor, and send the nth result data to the dynamic random access memory as an output result.
20. The method of claim 18, wherein the RDMA network interface establishes the communication connection between the processors in a serial and parallel manner, and each processor is configured to transmit the result data processed by itself to obtain the hidden layer to a next processor set connected in series with the next processor set, wherein the connection relationship between the next processor set is a parallel connection until the nth result data of the nth hidden layer is obtained in the output processor, and sends the nth result data to the dynamic random access memory as an output result.
21. The method of claim 16, wherein the processor comprises at least one of: a graphics processor GPU or a field programmable gate array FPGA;
the dynamic random access memory includes: double rate synchronous dynamic random access memory DDR.
22. An apparatus for processing data, comprising:
the receiving module is used for receiving the input data from the dynamic random access memory through the data bus by the processor set;
the operation module is used for operating the deep learning model to analyze the input data to obtain an output result;
the return module is used for returning the output result to the dynamic random access memory through the data bus; and each processor in the processor set is provided with a remote direct address access (RDMA) network interface, and communication connection is established among the RDMA network interfaces so as to be used for transmitting an intermediate processing result of at least one hidden layer in the deep learning model on the input data among the processors.
23. A computer readable storage medium, characterized in that a computer program is stored in the computer readable storage medium, wherein the computer program, when being executed by a processor, implements the steps of the method of any of claims 16 to 21.
24. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any of claims 16 to 21 when the computer program is executed.
CN202310230370.3A 2023-03-10 2023-03-10 System for deep learning, method for processing data and electronic equipment Pending CN116069511A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310230370.3A CN116069511A (en) 2023-03-10 2023-03-10 System for deep learning, method for processing data and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310230370.3A CN116069511A (en) 2023-03-10 2023-03-10 System for deep learning, method for processing data and electronic equipment

Publications (1)

Publication Number Publication Date
CN116069511A true CN116069511A (en) 2023-05-05

Family

ID=86175156

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310230370.3A Pending CN116069511A (en) 2023-03-10 2023-03-10 System for deep learning, method for processing data and electronic equipment

Country Status (1)

Country Link
CN (1) CN116069511A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107463448A (en) * 2017-09-28 2017-12-12 郑州云海信息技术有限公司 A kind of deep learning weight renewing method and system
CN111680791A (en) * 2020-06-16 2020-09-18 北京字节跳动网络技术有限公司 Communication method, device and system suitable for heterogeneous environment
CN113627620A (en) * 2021-07-29 2021-11-09 上海熠知电子科技有限公司 Processor module for deep learning
CN114281521A (en) * 2021-11-21 2022-04-05 苏州浪潮智能科技有限公司 Method, system, device and medium for optimizing communication efficiency of deep learning heterogeneous resources

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107463448A (en) * 2017-09-28 2017-12-12 郑州云海信息技术有限公司 A kind of deep learning weight renewing method and system
CN111680791A (en) * 2020-06-16 2020-09-18 北京字节跳动网络技术有限公司 Communication method, device and system suitable for heterogeneous environment
CN113627620A (en) * 2021-07-29 2021-11-09 上海熠知电子科技有限公司 Processor module for deep learning
CN114281521A (en) * 2021-11-21 2022-04-05 苏州浪潮智能科技有限公司 Method, system, device and medium for optimizing communication efficiency of deep learning heterogeneous resources

Similar Documents

Publication Publication Date Title
EP3706394B1 (en) Writes to multiple memory destinations
US10411953B2 (en) Virtual machine fault tolerance method, apparatus, and system
US8140704B2 (en) Pacing network traffic among a plurality of compute nodes connected using a data communications network
US8676917B2 (en) Administering an epoch initiated for remote memory access
US8775698B2 (en) Performing an all-to-all data exchange on a plurality of data buffers by performing swap operations
US7797445B2 (en) Dynamic network link selection for transmitting a message between compute nodes of a parallel computer
US10725957B1 (en) Uniform memory access architecture
CN106557444B (en) Method and device for realizing SR-IOV network card and method and device for realizing dynamic migration
US10909655B2 (en) Direct memory access for graphics processing unit packet processing
US20220206969A1 (en) Data forwarding chip and server
CN104636185A (en) Service context management method, physical host, PCIE equipment and migration management equipment
WO2023221847A1 (en) Data access method based on direct communication of virtual machine device, and device and system
WO2023093043A1 (en) Data processing method and apparatus, and medium
CN114900699A (en) Video coding and decoding card virtualization method and device, storage medium and terminal
CN111078353A (en) Operation method of storage equipment and physical server
US11343176B2 (en) Interconnect address based QoS regulation
CN115686836A (en) Unloading card provided with accelerator
CN109729731B (en) Accelerated processing method and device
Shim et al. Design and implementation of initial OpenSHMEM on PCIe NTB based cloud computing
CN115934624B (en) Method, equipment and medium for managing multi-host remote direct memory access network
CN116069511A (en) System for deep learning, method for processing data and electronic equipment
CN116204448A (en) Multi-port solid state disk, control method and device thereof, medium and server
CN111158905A (en) Method and device for adjusting resources
CN117519908B (en) Virtual machine thermomigration method, computer equipment and medium
US20210328945A1 (en) Configurable receive buffer size

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20230505