WO2024022046A1 - 一种深度学习系统和方法 - Google Patents

一种深度学习系统和方法 Download PDF

Info

Publication number
WO2024022046A1
WO2024022046A1 PCT/CN2023/105715 CN2023105715W WO2024022046A1 WO 2024022046 A1 WO2024022046 A1 WO 2024022046A1 CN 2023105715 W CN2023105715 W CN 2023105715W WO 2024022046 A1 WO2024022046 A1 WO 2024022046A1
Authority
WO
WIPO (PCT)
Prior art keywords
data flow
sub
computing
data
module
Prior art date
Application number
PCT/CN2023/105715
Other languages
English (en)
French (fr)
Inventor
林惠敏
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2024022046A1 publication Critical patent/WO2024022046A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present application relates to the field of artificial intelligence, and more specifically, to a deep learning system and method.
  • AI Artificial intelligence
  • a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • Research in the field of AI includes deep learning, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, basic AI theory, etc. As people conduct in-depth research in the field of AI, the subfield of deep learning has also continued to develop.
  • This application provides a deep learning system and method that can realize adaptive allocation of computing tasks, improve the utilization of computing module resources, and improve the efficiency of the computing module in processing computing tasks, thereby reducing the application difficulty of deep learning.
  • a deep learning system including a processing module and N calculation modules.
  • the processing module is used to divide the data flow graph into M sub-data flow graphs, and according to the parameters of the M sub-data flow graphs and N
  • the mapping relationship between the parameters of the calculation module is to assign M sub-data flow graphs to N calculation modules, where M and N are positive integers; N calculation modules are used to perform data processing on the corresponding sub-data flow graphs. calculate.
  • the processing module can allocate M sub-data flow graphs to N computing modules based on the mapping relationship between the parameters of the M sub-data flow graphs and the parameters of the N computing modules, which can realize adaptive allocation of computing tasks. , improves the utilization of computing module resources, improves the efficiency of computing modules in processing computing tasks, and reduces the difficulty of applying deep learning.
  • the parameters of the M sub-data flow graphs include at least one of the following: data priorities of the M sub-data flow graphs, and data amounts of the M sub-data flow graphs.
  • the parameters of the N computing modules include at least one of the following: bandwidth between the N computing modules, computing power of the N computing modules, storage capacity.
  • the processing module is configured to divide the data flow graph into M sub-data flow graphs according to the service quality index.
  • the processing module can divide the data flow graph into M sub-data flow graphs according to the service quality index, so that the N calculation modules can calculate the data of the corresponding sub-data flow graphs.
  • the utilization of computing module resources can be improved, the efficiency of the computing module in processing computing tasks can be improved, and the application difficulty of deep learning can be reduced.
  • the data of the sub-data flow graph corresponding to each of the N computing modules is concurrently calculated by the N computing modules.
  • the data of the sub-data flow graph corresponding to each of the N computing modules is independently calculated by the N computing modules.
  • the N computing modules are also used to exchange data of their respective corresponding sub-data flow graphs using exchange operations.
  • N computing modules can use exchange operations to exchange the data of their corresponding sub-data flow graphs, thereby ensuring the accuracy when the N computing modules calculate the data of the sub-data flow graphs.
  • the N computing modules include at least two devices, and the at least two devices form at least one device group through an interconnection device, and the computing power of one device group is greater than or equal to one device. computing power.
  • the N computing modules include at least two devices
  • the at least two devices form at least one device group through the interconnection device, so that the computing power of the one device group is greater than or equal to the computing power of one device, so that it can Improve utilization of individual equipment.
  • a deep learning method including: dividing the data flow graph into M sub-data flow graphs, and according to the mapping relationship between the parameters of the M sub-data flow graphs and the parameters of the N computing modules, the M The sub-data flow graph is allocated to N calculation modules, where M and N are positive integers; the data of the corresponding sub-data flow graph is calculated.
  • M sub-data flow graphs can be allocated to N computing modules according to the mapping relationship between the parameters of the M sub-data flow graphs and the parameters of the N computing modules, thereby achieving adaptive allocation of computing tasks and improving It improves the utilization of computing module resources, improves the efficiency of computing modules in processing computing tasks, and reduces the difficulty of applying deep learning.
  • the parameters of the M sub-data flow graphs include at least one of the following: the data priority of the M sub-data flow graphs, and the data volume of the M sub-data flow graphs.
  • the parameters of the N computing modules include at least one of the following: bandwidth between the N computing modules, computing power of the N computing modules, storage capacity.
  • dividing the data flow graph into M sub-data flow graphs includes: dividing the data flow graph into M sub-data flow graphs according to the service quality indicator.
  • the data of the respective corresponding sub-data flow graphs are concurrently calculated by N computing modules.
  • the data of the respective corresponding sub-data flow graphs are independently calculated by N computing modules.
  • the method further includes: using an exchange operation to exchange data of the sub-data flow graphs corresponding to each of the N computing modules.
  • a deep learning system including: a memory for storing a program; a processor for executing the program stored in the memory; when the program stored in the memory is executed, the processor is configured to Execute the second aspect and the method in any implementation manner of the second aspect.
  • the processor in the above third aspect can be either a central processing unit (CPU) or a combination of a CPU and a neural network computing processor.
  • the neural network computing processor here can include a graphics processor (graphics processing unit (GPU), neural-network processing unit (NPU) and tensor processing unit (TPU), etc.
  • GPU graphics processing unit
  • NPU neural-network processing unit
  • TPU tensor processing unit
  • TPU is an artificial intelligence accelerator special integrated circuit fully customized by Google for machine learning.
  • a computer-readable storage medium stores program code for device execution.
  • the program code includes a program code for executing the second aspect or any one of the implementations of the second aspect. Methods.
  • a fifth aspect provides a computer program product containing instructions, which when the computer program product is run on a computer, causes the computer to execute the method in the above second aspect or any one of the implementations of the second aspect.
  • a chip in a sixth aspect, includes a processor and a data interface.
  • the processor reads instructions stored in the memory through the data interface and executes the above second aspect or any one of the second aspects. Methods in the implementation.
  • the chip may further include a memory, in which instructions are stored, and the processor is configured to execute the instructions stored in the memory.
  • the processor is configured to execute the method in the second aspect or any one of the implementations of the second aspect.
  • a seventh aspect provides a system on a chip (SoC), where the SoC includes the deep learning system in the above first aspect or any one of the implementation methods of the first aspect.
  • SoC system on a chip
  • Figure 1 shows a schematic block diagram of a deep learning system 100 provided by an embodiment of the present application.
  • Figure 2 shows a schematic block diagram of yet another deep learning system 200 provided by an embodiment of the present application.
  • Figure 3 shows a schematic structural diagram of a computing module provided by an embodiment of the present application.
  • Figure 4 shows a schematic diagram of a data flow diagram provided by an embodiment of the present application.
  • Figure 5 shows a schematic diagram of yet another data flow diagram provided by an embodiment of the present application.
  • Figure 6 shows a schematic diagram of yet another data flow diagram provided by an embodiment of the present application.
  • Figure 7 shows a schematic diagram of yet another data flow diagram provided by an embodiment of the present application.
  • Figure 8 shows a schematic diagram of a deep learning method 800 provided by the embodiment of the present application.
  • Figure 9 shows a schematic diagram of a deep learning system provided by an embodiment of the present application.
  • AI is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • AI is a branch of computer science that attempts to understand the nature of intelligence and produce a new intelligent machine that can respond in a manner similar to human intelligence.
  • AI is the study of the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Research in the field of AI includes deep learning, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, basic AI theory, etc.
  • the subfield of deep learning has also continued to develop.
  • the innovation of related algorithms such as images, videos, and natural languages continues to increase, and the computing resources required for computing tasks have also expanded from a single chip to a computing cluster.
  • the computing chip architecture adopts a variety of packaging technologies such as single chip, dual chip, and multi-chip.
  • the memory architecture includes on-chip cache areas, local storage areas, distributed storage systems, etc., which can provide different storage capabilities.
  • the interconnection architecture between computing modules also has different forms, including full interconnection and tree communication topology.
  • this application provides a deep learning system that can allocate computing tasks without the need to be based on the application layer. Through this system, this application can realize adaptive allocation of computing tasks, improve the utilization of computing module resources, and improve the efficiency of computing modules in processing computing tasks, thus reducing the difficulty of applying deep learning.
  • FIG. 1 shows a schematic block diagram of a deep learning system 100 provided by an embodiment of the present application.
  • the deep learning system 100 includes a management module 110 , a transceiver module 120 , a processing module 130 and a computing module 140 .
  • the management module 110 can be used to obtain the parameters of the calculation module 140 and send the parameters of the calculation module 140 to the processing module 130 through the transceiver module 120 .
  • the management module 110 can obtain the computing power of the N computing modules, where N is a positive integer.
  • the transceiver module 120 can be used to send and receive information required for computing tasks. For example, the transceiver module 120 can send the parameters of the calculation module 140 obtained by the management module 110 to the processing module 130; for another example, the transceiver module 120 can obtain the data flow graph and send the data flow graph to the processing module 130; for another example , the transceiver module 120 can obtain the data of the data flow graph, and send the data of the data flow graph to the processing module 130 .
  • the processing module 130 may be used to process information required for computing tasks.
  • the processing module 130 includes a first module, a second module and a third module.
  • the first module can be used to segment the data flow graph. For example, the first module can divide the data flow graph into M sub-data flow graphs, where M is a positive integer.
  • the second module can assign the data flow graph to the computing module 140. For example, assuming that there are M sub-data flow graphs and the computing module 140 includes N computing modules, the second module can assign the sub-data flow graph with a large amount of data in the M sub-data flow graphs.
  • the data flow diagram is assigned to the computing module with the highest computing power among the N computing modules.
  • the third module may be used to distribute the data of the data flow graph to the calculation module 140 .
  • the third module can distribute the data of the M sub-data flow graphs to the N computing modules.
  • the calculation module 140 can be used to calculate the data of the data flow graph. For example, assuming that the calculation module 140 includes N calculation modules, the N calculation modules can calculate the data of their respective corresponding sub-data flow graphs.
  • the process of processing computing tasks based on the deep learning system may include the following steps:
  • the management module 110 can obtain the parameters of the computing module 140. Assuming that the computing module 140 includes N computing modules, the management module 110 can obtain the computing power of the N computing modules.
  • the transceiver module 120 can send the parameters of the calculation module 140 obtained by the management module 110 to the processing module 130 .
  • the transceiver module 120 may send the obtained data flow graph to the processing module 130 .
  • the fourth step is the processing module 130, which includes a first module and a second module.
  • the first module may divide the data flow graph into M sub-data flow graphs, and the second module may assign the M sub-data flow graphs to the N computing modules included in the computing module 140 .
  • the second module can consider the parameters of N calculation modules when allocating M sub-data flow graphs. For example, the second module can allocate sub-data flow graphs with large amounts of data in the M sub-data flow graphs to N calculation modules.
  • the transceiver module 120 may send the obtained data of the data flow graph to the processing module 130 .
  • the processing module 130 also includes a third module.
  • the third module can allocate the data of the M sub-data flow graphs to the N calculation modules included in the calculation module 140.
  • N calculation modules can calculate the data in their corresponding sub-data flow graphs.
  • the deep learning system provided by the embodiments of the present application can realize adaptive allocation of computing tasks without allocating computing tasks based on the application layer, improves the utilization rate of computing module resources, and also improves the efficiency of computing modules.
  • the efficiency of processing computing tasks thus reduces the difficulty of applying deep learning.
  • the processing module can assign the data flow graph to the computing module, so that the computing module can calculate the data of the data flow graph.
  • the computing module includes N computing modules, the specific process of the processing module allocating the data flow graph to the N computing modules will be described in detail later with reference to Figures 2 to 7.
  • FIG. 2 is a schematic block diagram of yet another deep learning system 200 provided by an embodiment of the present application.
  • the deep learning system 200 may include a processing module 210 and N computing modules 220 .
  • the processing module 210 may be the processing module 130 in FIG. 1
  • the N calculation modules 220 may be the calculation modules 140 in FIG. 1 .
  • the processing module 210 is configured to divide the data flow graph into M sub-data flow graphs, and assign the M sub-data flow graphs to N according to the mapping relationship between the parameters of the M sub-data flow graphs and the parameters of the N calculation modules 220 a calculation module 220.
  • the processing module 210 obtains the data flow graph, which may include: a transceiver module receiving the data flow graph and sending it to the processing module 210, where the transceiver module may be the transceiver module 120 in Figure 1 .
  • the processing module 210 obtains the parameters of the N computing modules 220, which may include: the management module obtains the parameters of the N computing modules 220, and sends the parameters of the N computing modules 220 to the processing module 210 through the transceiver module, where , the management module may be the management module 110 in Figure 1 , and the transceiver module may be the transceiver module 120 in Figure 1 .
  • the processing module 210 divides the data flow graph into M sub-data flow graphs, which may be executed by the first module included in the processing module 130 in FIG. 1 .
  • the processing module 210 allocates M sub-data flow graphs to N calculation modules 220, which may be executed by the second module included in the processing module 130 in Figure 1.
  • M and N are positive integers.
  • the parameters of the M sub-data flow graphs may include at least one of the following: data priorities of the M sub-data flow graphs, and data amounts of the M sub-data flow graphs.
  • the data priority of the M sub-data flow graphs may be the data exchange priority of the M sub-data flow graphs, or may be other priorities of the data of the M sub-data flow graphs, which are not limited by the embodiments of this application.
  • only the data priority of the M sub-data flow diagrams is the data exchange priority of the M sub-data flow diagrams as an example for explanation.
  • the data exchange priority of the M sub-data flow graphs is associated with the number of data exchanges between each sub-data flow graph.
  • the data exchange priority of the P sub-data flow graphs is higher; when the Q sub-data flow graphs in the M sub-data flow graphs
  • the number of Q sub-data flow graphs Data exchange priority is lower.
  • P and Q are positive integers.
  • the processing module 210 divides the data flow graph into six sub-data flow graphs, and records the six sub-data flow graphs as sub-data flow graph 1, sub-data flow graph 2, sub-data flow graph 3, and sub-data respectively.
  • Flow diagram 4, sub-data flow diagram 5, sub-data flow diagram 6, according to the number of data exchanges between each sub-data flow diagram, sub-data flow diagram 1, sub-data flow diagram 2 and sub-data flow diagram 3 are recorded as the A set of sub-data flow diagrams, sub-data flow diagram 4, sub-data flow diagram 5 and sub-data flow diagram 6 are recorded as the second group of sub-data flow diagrams, and sub-data flow diagram 1 and sub-data flow diagram 4 are recorded as the second group of sub-data flow diagrams.
  • Sub-data flow diagram 2 and sub-data flow diagram 5 are recorded as the fourth group of sub-data flow diagrams
  • sub-data flow diagram 3 and sub-data flow diagram 6 are recorded as the fifth group of sub-data flow diagrams.
  • the parameters of the N computing modules 220 may include at least one of the following: bandwidth between the N computing modules 220 , computing power of the N computing modules 220 , and storage capacity of the N computing modules 220 .
  • the bandwidth between the N computing modules 220 may be the bandwidth between every two computing modules in the N computing modules 220 .
  • the bandwidth between the three computing modules can be the bandwidth between computing module 1 and computing module 2, or it can be the bandwidth between computing module 1 and computing module 2.
  • the bandwidth between module 1 and computing module 3 may also be the bandwidth between computing module 2 and computing module 3.
  • the parameters of the M sub-data flow graphs are associated with the parameters of the N calculation modules 220 .
  • the processing module 210 can According to the mapping relationship between the parameters of the M sub-data flow graphs and the parameters of the N calculation modules 220, the M sub-data flow graphs are allocated to the N calculation modules 220.
  • the parameters of the M sub-data flow graphs are associated with the parameters of the N calculation modules 220 in the following possible ways.
  • the data priority of the M sub-data flow graphs is associated with the bandwidth between the N computing modules 220 .
  • a computing module with a higher bandwidth between the two can be used to calculate the data of the sub-data flow graph with a higher data exchange priority; a computing module with a lower bandwidth between the two can be used to calculate the data exchange Data from lower-priority sub-dataflow graphs is calculated.
  • a computing module with higher computing power can be used to calculate the data of the sub-data flow graph with a higher data exchange priority; a computing module with a lower computing power can be used to calculate the data of the sub-data flow graph with a lower data exchange priority. Calculations are performed on the data of the data flow diagram.
  • Another possible way is that the data priority of the M sub-data flow graphs is associated with the storage capacity of the N computing modules 220 .
  • the data of the sub-data flow graph that the computing module needs to store will increase. For example, suppose there are two computing modules, denoted as computing module 1 and computing module 2 respectively.
  • the sub-data flow diagram corresponding to computing module 1 is sub-data flow diagram 1
  • the sub-data flow diagram corresponding to computing module 2 is sub-data flow diagram. 2.
  • the data of the sub-data flow graph that the computing module 1 needs to store includes the data of the sub-data flow graph 1 and the data of the sub-data flow graph 2.
  • the data of the sub-data flow diagram that the calculation module 2 needs to store includes the data of the sub-data flow diagram 1 and the data of the sub-data flow diagram 2. Therefore, the data exchange priority of the M sub-data flow graphs is associated with the storage capacity of the N computing modules 220 .
  • a computing module with a larger storage capacity can be used to calculate the data of the sub-data flow graph with a higher data exchange priority; a computing module with a smaller storage capacity can be used to calculate the data of the sub-data flow graph with a lower data exchange priority. Calculations are performed on the data of the data flow diagram.
  • the data volume of the M sub-data flow graphs is related to the bandwidth between the N computing modules 220 .
  • the data volume of M sub-data flow graphs is related to the computing power of N computing modules 220 .
  • a computing module with higher computing power can be used to calculate the data of a sub-data flow graph with a large amount of data; a computing module with a lower computing power can be used to calculate the data of a sub-data flow graph with a small amount of data. Data is calculated.
  • a computing module with a large storage capacity can be used to calculate the data of a sub-data flow graph with a large amount of data;
  • the calculation module with smaller storage capacity can be used to calculate the data of the sub-data flow graph with smaller data volume.
  • the number M of sub-data flow graphs may be greater than the number N of computing modules, the number M of sub-data flow graphs may be equal to the number N of computing modules, and the number M of sub-data flow graphs may be smaller than the number of computing modules.
  • the number N the embodiment of the present application does not limit the size relationship between the number M of sub-data flow graphs and the number N of computing modules.
  • the processing module 210 can convert the M sub-data flows into The graph is assigned to N calculation modules 220. In this scenario, the processing module 210 allocates the M sub-data flow graphs to the N computing modules 220 in the following ways.
  • Method #A when the number M of sub-data flow graphs is greater than the number N of computing modules, the processing module 210 can map the bandwidth between the N computing modules 220 according to the data priorities of the M sub-data flow graphs. , allocate M sub-data flow graphs to N calculation modules 220.
  • the processing module 210 can allocate the sub-data flow graphs with higher data exchange priority to calculations with higher bandwidth between them. Module; The processing module 210 can allocate sub-data flow graphs with lower data exchange priority to computing modules with lower bandwidth between the two.
  • Figure 3 shows a schematic structural diagram of a computing module provided by an embodiment of the present application.
  • there are 8 computing modules respectively recorded as computing module 1, computing module 2, computing module 3, computing module 4, computing module 5, computing module 6, computing module 7 and computing module 8.
  • the bandwidth between each pair of module 1, computing module 2, computing module 3 and computing module 4 is relatively high, the bandwidth between each pair of computing module 5, computing module 6, computing module 7 and computing module 8 is also relatively high, and the remaining calculation The bandwidth between two modules is low. For example, the bandwidth between computing module 1 and computing module 5 is low, and the bandwidth between computing module 2 and computing module 6 is low.
  • each sub-data flow diagram is grouped according to the data exchange priority between each sub-data flow diagram, assuming that the first group of sub-data flow diagrams includes sub-data flows Figure 1.
  • the second group of sub-data flow diagrams includes sub-data flow diagram 5, sub-data flow diagram 6, sub-data flow diagram 7 and sub-data flow diagram.
  • the third group of sub-data flow diagrams includes sub-data flow diagram 1 and sub-data flow diagram 5
  • the fourth group of sub-data flow diagrams includes sub-data flow diagram 2 and sub-data flow diagram 6
  • the first group of sub-data flow diagrams includes sub-data flow diagram 2 and sub-data flow diagram 6.
  • the data exchange priority between the data flow diagram and the second group of sub-data flow diagrams is higher, and the data exchange priority between the third group of sub-data flow diagrams and the fourth group of sub-data flow diagrams is lower.
  • the processing module 210 may assign the sub-data flow graphs in the first group of sub-data flow graphs to computing module 1, computing module 2, computing module 3 and computing module 4, and assign the sub-data flow graphs in the second group of sub-data flow graphs to computing module 1, computing module 2, computing module 3 and computing module 4. Assigned to computing module 5, computing module 6, computing module 7 and computing module 8, so that the processing module 210 can exchange data with sub-data flow diagrams with higher priority (such as the first group of sub-data flow diagrams, and such as the second group of sub-data flow diagrams). sub-data flow diagram), and allocate it to the computing module with higher bandwidth between the two.
  • the processing module 210 can assign the sub-data flow diagram with low data exchange priority (such as the third group of sub-data flow diagrams, and such as the fourth group of sub-data flow diagrams). Data flow diagram), allocated to computing modules with lower bandwidth between pairs.
  • the processing module 210 may assign the sub-data flow graph 1, the sub-data flow graph 2, the sub-data flow graph 3 and the sub-data flow graph 4 to the computing module 1, the computing module 2, the computing module 3 and the computing module 4 respectively.
  • the processing module 210 can assign sub-data flow graph 5, sub-data flow graph 6 and sub-data flow graph 7 to computing module 5, computing module 6 and computing module 7 respectively, and the processing module 210 can assign sub-data flow graph 8 and sub-data flow graph 9 Give calculation module 8.
  • the processing module 210 can allocate the M sub-data flow graphs to the N computing modules 220 according to the mapping relationship between the data priorities of the M sub-data flow graphs and the bandwidths between the N computing modules 220, thereby applying
  • the layer can realize adaptive allocation of computing tasks without allocating computing tasks, which improves the utilization of bandwidth resources of the computing module, improves the efficiency of the computing module in processing computing tasks, and reduces the difficulty of applying deep learning.
  • Method #B when the number M of sub-data flow graphs is greater than the number N of computing modules, the processing module 210 can map the data priority of the M sub-data flow graphs to the computing power of the N computing modules 220 Relationship, allocate M sub-data flow graphs to N calculation modules 220.
  • the processing module 210 can allocate the sub-data flow graphs with higher data exchange priority to the computing modules with higher computing power; processing Module 210 may allocate sub-data flow graphs with lower data exchange priority to computing modules with lower computing power.
  • each sub-data flow diagram is grouped according to the data exchange priority between each sub-data flow diagram, assuming that the first group of sub-data flow diagrams includes sub-data flows Figure 1.
  • the second group of sub-data flow diagrams includes sub-data flow diagram 5, sub-data flow diagram 6, sub-data flow diagram 7 and sub-data flow diagram. 8 and sub-data flow diagram 9, and the data exchange priority of the first group of sub-data flow diagrams is higher than the data exchange priority of the second group of sub-data flow diagrams.
  • the processing module 210 may assign the sub-data flow graphs in the first group of sub-data flow graphs to computing module 1, computing module 2, computing module 3 and computing module 4, and assign the sub-data flow graphs in the second group of sub-data flow graphs to computing module 1, computing module 2, computing module 3 and computing module 4. Assigned to computing module 5, computing module 6, computing module 7 and computing module 8, so that the processing module 210 can allocate sub-data flow graphs with higher data exchange priority (such as the first group of sub-data flow graphs) to computing power For higher computing modules, the processing module 210 can allocate sub-data flow graphs with lower data exchange priority (such as the second group of sub-data flow graphs) to computing modules with lower computing power.
  • the processing module 210 may assign the sub-data flow graph 1, the sub-data flow graph 2, the sub-data flow graph 3 and the sub-data flow graph 4 to the computing module 1, the computing module 2, the computing module 3 and the computing module 4 respectively.
  • the processing module 210 can assign sub-data flow graph 5, sub-data flow graph 6 and sub-data flow graph 7 to computing module 5, computing module 6 and computing module 7 respectively, and the processing module 210 can assign sub-data flow graph 8 and sub-data flow graph 9 Give calculation module 8.
  • the processing module 210 can allocate the M sub-data flow graphs to the N computing modules 220 according to the mapping relationship between the data priorities of the M sub-data flow graphs and the computing power of the N computing modules 220, thereby
  • the application layer can realize adaptive allocation of computing tasks without allocating computing tasks, which improves the utilization of computing resources of the computing module, improves the efficiency of the computing module in processing computing tasks, and reduces the difficulty of applying deep learning.
  • Method #C when the number M of sub-data flow graphs is greater than the number N of computing modules, the processing module 210 can map the data priority of the M sub-data flow graphs to the storage capacity of the N computing modules 220 Relationship, allocate M sub-data flow graphs to N calculation modules 220.
  • the processing module 210 can allocate the sub-data flow graphs with higher data exchange priority to the computing modules with larger storage capacity; processing Module 210 may allocate sub-data flow graphs with lower data exchange priority to computing modules with smaller storage capacity.
  • each sub-data flow diagram is grouped according to the data exchange priority between each sub-data flow diagram, assuming that the first group of sub-data flow diagrams includes sub-data flows Figure 1.
  • the second group of sub-data flow diagrams includes sub-data flow diagram 5, sub-data flow diagram 6, sub-data flow diagram 7 and sub-data flow diagram. 8 and sub-data flow diagram 9, and the data exchange priority of the first group of sub-data flow diagrams is higher than the data exchange priority of the second group of sub-data flow diagrams.
  • the processing module 210 may assign the sub-data flow graphs in the first group of sub-data flow graphs to computing module 1, computing module 2, computing module 3 and computing module 4, and assign the sub-data flow graphs in the second group of sub-data flow graphs to computing module 1, computing module 2, computing module 3 and computing module 4. Assigned to computing module 5, computing module 6, computing module 7 and computing module 8, so that the processing module 210 can allocate sub-data flow graphs with higher data exchange priority (such as the first group of sub-data flow graphs) to storage capacity For larger computing modules, the processing module 210 can allocate sub-data flow graphs with lower data exchange priority (such as the second group of sub-data flow graphs) to computing modules with smaller storage capacity.
  • the processing module 210 may assign the sub-data flow graph 1, the sub-data flow graph 2, the sub-data flow graph 3 and the sub-data flow graph 4 to the computing module 1, the computing module 2, the computing module 3 and the computing module 4 respectively.
  • the processing module 210 can assign sub-data flow graph 5, sub-data flow graph 6 and sub-data flow graph 7 to computing module 5, computing module 6 and computing module 7 respectively, and the processing module 210 can assign sub-data flow graph 8 and sub-data flow graph 9 Give calculation module 8.
  • the processing module 210 can allocate the M sub-data flow graphs to the N computing modules 220 according to the mapping relationship between the data priorities of the M sub-data flow graphs and the storage capacities of the N computing modules 220, thereby
  • the application layer can realize adaptive allocation of computing tasks without allocating computing tasks, which improves the utilization of storage capacity resources of the computing module and improves the computing model.
  • the efficiency of block processing computing tasks reduces the difficulty of applying deep learning.
  • Method #D when the number M of sub-data flow graphs is greater than the number N of computing modules, the processing module 210 can map the data volume of the M sub-data flow graphs to the computing power of the N computing modules 220 , allocate M sub-data flow graphs to N calculation modules 220.
  • the processing module 210 can allocate the sub-data flow graphs with a large amount of data to computing modules with high computing power; the processing module 210 can assign Sub-data flow graphs with small data volume are allocated to computing modules with low computing power.
  • each sub-data flow diagram is grouped according to the data amount of each sub-data flow diagram. It is assumed that the first group of sub-data flow diagrams includes sub-data flow diagram 1, sub-data flow diagram Data flow diagram 2, sub-data flow diagram 3 and sub-data flow diagram 4.
  • the second group of sub-data flow diagrams includes sub-data flow diagram 5, sub-data flow diagram 6, sub-data flow diagram 7, sub-data flow diagram 8 and sub-data flow Figure 9, and the data volume of the first group of sub-data flow diagrams is larger than the data volume of the second group of sub-data flow diagrams.
  • the processing module 210 may assign the sub-data flow graphs in the first group of sub-data flow graphs to computing module 1, computing module 2, computing module 3 and computing module 4, and assign the sub-data flow graphs in the second group of sub-data flow graphs to computing module 1, computing module 2, computing module 3 and computing module 4. Assigned to computing module 5, computing module 6, computing module 7 and computing module 8, so that the processing module 210 can allocate sub-data flow graphs with large amounts of data (such as the first group of sub-data flow graphs) to those with higher computing power.
  • the computing module and processing module 210 can allocate sub-data flow graphs with small data amounts (such as the second group of sub-data flow graphs) to computing modules with lower computing power.
  • the processing module 210 may assign the sub-data flow graph 1, the sub-data flow graph 2, the sub-data flow graph 3 and the sub-data flow graph 4 to the computing module 1, the computing module 2, the computing module 3 and the computing module 4 respectively.
  • the processing module 210 can assign sub-data flow graph 5, sub-data flow graph 6 and sub-data flow graph 7 to computing module 5, computing module 6 and computing module 7 respectively, and the processing module 210 can assign sub-data flow graph 8 and sub-data flow graph 9 Give calculation module 8.
  • the processing module 210 can allocate the M sub-data flow graphs to the N computing modules 220 according to the mapping relationship between the data volume of the M sub-data flow graphs and the computing power of the N computing modules 220, thereby applying
  • the layer can achieve adaptive allocation of computing tasks without allocating computing tasks, which improves the utilization of computing resources of the computing module, improves the efficiency of the computing module in processing computing tasks, and reduces the difficulty of applying deep learning.
  • Method #E when the number M of sub-data flow graphs is greater than the number N of computing modules, the processing module 210 can map the data volume of the M sub-data flow graphs to the bandwidth between the N computing modules 220, M sub-data flow graphs are allocated to N calculation modules 220.
  • the processing module 210 can allocate the sub-data flow graphs with a large amount of data to computing modules with high bandwidth between them; the processing module 210 210 can allocate sub-data flow graphs with small amounts of data to computing modules with low bandwidth between them.
  • each sub-data flow diagram is grouped according to the amount of data between each sub-data flow diagram, assuming that the first group of sub-data flow diagrams includes sub-data flow diagram 1 , sub-data flow diagram 2, sub-data flow diagram 3 and sub-data flow diagram 4.
  • the second group of sub-data flow diagrams includes sub-data flow diagram 5, sub-data flow diagram 6, sub-data flow diagram 7, sub-data flow diagram 8 and sub-data flow diagram.
  • Data flow diagram 9, and the data volume of the first group of sub-data flow diagrams is larger than the data volume of the second group of sub-data flow diagrams.
  • the processing module 210 may assign the sub-data flow graphs in the first group of sub-data flow graphs to computing module 1, computing module 2, computing module 3 and computing module 4, and assign the sub-data flow graphs in the second group of sub-data flow graphs to computing module 1, computing module 2, computing module 3 and computing module 4. Assigned to computing module 5, computing module 6, computing module 7 and computing module 8, so that the processing module 210 can allocate sub-data flow graphs with large amounts of data (such as the first group of sub-data flow graphs) to the bandwidth between them. The higher computing module and processing module 210 can allocate sub-data flow graphs with small data amounts (such as the second group of sub-data flow graphs). Give computing modules with lower bandwidth between pairs.
  • the processing module 210 may assign the sub-data flow graph 1, the sub-data flow graph 2, the sub-data flow graph 3 and the sub-data flow graph 4 to the computing module 1, the computing module 2, the computing module 3 and the computing module 4 respectively.
  • the processing module 210 can assign sub-data flow graph 5, sub-data flow graph 6 and sub-data flow graph 7 to computing module 5, computing module 6 and computing module 7 respectively, and the processing module 210 can assign sub-data flow graph 8 and sub-data flow graph 9 Give calculation module 8.
  • the processing module 210 can allocate the M sub-data flow graphs to the N computing modules 220 according to the mapping relationship between the data volume of the M sub-data flow graphs and the bandwidth between the N computing modules 220, so that the application layer Without the need to allocate computing tasks, adaptive allocation of computing tasks can be achieved, which improves the utilization of bandwidth resources of the computing module, improves the efficiency of the computing module in processing computing tasks, and reduces the difficulty of applying deep learning.
  • Method #F when the number M of sub-data flow graphs is greater than the number N of computing modules, the processing module 210 can use the mapping relationship between the data volume of the M sub-data flow graphs and the storage capacity of the N computing modules 220 , allocate M sub-data flow graphs to N calculation modules 220.
  • the processing module 210 can allocate the sub-data flow graphs with a larger amount of data to the computing modules with a larger storage capacity; the processing module 210 Sub-data flow graphs with smaller data volumes can be allocated to computing modules with smaller storage capacity.
  • each sub-data flow diagram is grouped according to the amount of data between each sub-data flow diagram, assuming that the first group of sub-data flow diagrams includes sub-data flow diagram 1 , sub-data flow diagram 2, sub-data flow diagram 3 and sub-data flow diagram 4.
  • the second group of sub-data flow diagrams includes sub-data flow diagram 5, sub-data flow diagram 6, sub-data flow diagram 7, sub-data flow diagram 8 and sub-data flow diagram.
  • Data flow diagram 9, and the data volume of the first group of sub-data flow diagrams is larger than the data volume of the second group of sub-data flow diagrams.
  • the processing module 210 may assign the sub-data flow graphs in the first group of sub-data flow graphs to computing module 1, computing module 2, computing module 3 and computing module 4, and assign the sub-data flow graphs in the second group of sub-data flow graphs to computing module 1, computing module 2, computing module 3 and computing module 4. Assigned to computing module 5, computing module 6, computing module 7 and computing module 8, so that the processing module 210 can allocate sub-data flow graphs with a large amount of data (such as the first group of sub-data flow graphs) to a group of sub-data flow graphs with a large storage capacity.
  • the processing module 210 may allocate sub-data flow graphs with a smaller amount of data (such as the second group of sub-data flow graphs) to computing modules with a smaller storage capacity.
  • the processing module 210 may assign the sub-data flow graph 1, the sub-data flow graph 2, the sub-data flow graph 3 and the sub-data flow graph 4 to the computing module 1, the computing module 2, the computing module 3 and the computing module 4 respectively.
  • the processing module 210 can assign sub-data flow graph 5, sub-data flow graph 6 and sub-data flow graph 7 to computing module 5, computing module 6 and computing module 7 respectively, and the processing module 210 can assign sub-data flow graph 8 and sub-data flow graph 9 Give calculation module 8.
  • the processing module 210 can allocate the M sub-data flow graphs to the N computing modules 220 according to the mapping relationship between the data volume of the M sub-data flow graphs and the storage capacity of the N computing modules 220, thereby applying
  • the layer can realize adaptive allocation of computing tasks without allocating computing tasks, which improves the utilization of storage capacity resources of the computing module, improves the efficiency of the computing module in processing computing tasks, and reduces the difficulty of applying deep learning.
  • the processing module 210 can also be based on the mapping relationship between the parameters of the M sub-data flow graphs and the parameters of the N computing modules 220 , allocate M sub-data flow graphs to N calculation modules 220.
  • the processing module allocates M sub-data flow graphs to N calculation modules 220, please refer to the descriptions in Method #A to Method #F, which will not be described again here.
  • manner #A to manner #F provide that the processing module 210 can assign M sub-data flow graphs to A scenario of N computing modules 220, however, the embodiment of the present application is not limited to this.
  • the processing module 210 can assign the M sub-data flow graphs to the N computing modules 220 according to the mapping relationship between multiple parameters of the M sub-data flow graphs and one parameter of the N computing modules 220; for another example, the processing module 210 may allocate M sub-data flow graphs to N computing modules 220 based on the mapping relationship between one parameter of the M sub-data flow graphs and multiple parameters of the N computing modules 220; for another example, the processing module 210 may assign the M sub-data flow graphs to the N computing modules 220 according to the mapping relationship between The mapping relationship between multiple parameters of the sub-data flow graph and multiple parameters of the N calculation modules 220 distributes the M sub-data flow graphs to the N calculation modules 220 .
  • the processing module 210 can allocate M sub-data flow graphs to the N calculation modules 220 according to the mapping relationship between at least one parameter of the M sub-data flow graphs and at least one parameter of the N calculation modules 220, it can be Refer to method #A to method #F, which will not be described again here.
  • N calculation modules 220 are used to calculate the data of respective corresponding sub-data flow graphs.
  • the processing module 210 can assign the M sub-data flow graphs to the N computing modules 220 according to the mapping relationship between the parameters of the M sub-data flow graphs and the parameters of the N computing modules 220, so that the N computing modules can 220 can obtain the corresponding sub-data flow diagram. For example, suppose there are three sub-data flow diagrams, which are respectively recorded as sub-data flow diagram 1, sub-data flow diagram 2 and sub-data flow diagram 3. Suppose there are three computing modules, which are respectively recorded as computing module 1, computing module 2 and computing module. 3.
  • the processing module 210 can assign sub-data flow graph 1 to the computing module 1 and assign sub-data flow graph 2 to the computing module according to the mapping relationship between the parameters of the three sub-data flow graphs and the parameters of the three computing modules 220 2. Assign sub-data flow graph 3 to computing module 3, so that computing module 1 can obtain the corresponding sub-data flow graph 1, computing module 2 can obtain the corresponding sub-data flow graph 2, and computing module 3 can obtain the corresponding sub-data flow graph 2.
  • N calculation modules 220 obtain data in respective corresponding sub-data flow graphs, which may include: the transceiver module receives the data of the data flow graph, and sends the data of the data flow graph to the processing module 210.
  • the processing module 210 The data of M sub-data flow graphs can be obtained from the data of the data flow graph, and the data of the M sub-data flow graphs can be allocated to N computing modules 220, so that the N computing modules can obtain their corresponding sub-data flow graphs. data in.
  • the transceiver module may be the transceiver module 120 in Figure 1 , and the processing module 210 distributes the data of the M sub-data flow graphs to N calculation modules 220 , which may be executed by the third module included in the processing module 130 in Figure 1 .
  • the processing module can allocate M sub-data flow graphs to N computing modules based on the mapping relationship between the parameters of the M sub-data flow graphs and the parameters of the N computing modules, so that the application layer does not need to perform computing tasks.
  • Allocation can achieve adaptive allocation of computing tasks, improve the utilization of computing module resources, improve the efficiency of computing modules in processing computing tasks, and reduce the difficulty of applying deep learning.
  • the processing module 210 is configured to divide the data flow graph into M sub-data flow graphs according to the service quality index.
  • service quality indicators can be used to characterize the service quality of computing tasks. For example, when the service quality indicator is delay, the service quality indicator can be used to characterize the processing time of the computing task; for another example, when the service quality indicator is throughput, the service quality indicator can be used to characterize the transmission data flow graph. The average rate of data.
  • One possible way is that when the service quality indicator is delay, the data of the sub-data flow graph corresponding to each of the N computing modules is calculated concurrently by the N computing modules.
  • Figure 4 shows a schematic diagram of a data flow diagram provided by an embodiment of the present application. As shown in Figure 4, assuming that the data flow diagram includes four parts A, B, C, and D.
  • the processing module 210 can divide the data flow diagram of part A into A1, A2, A3, and A4. 4 sub-data flow diagrams.
  • Part B data flow diagram is divided into four sub-data flow diagrams: B1, B2, B3, and B4.
  • Part C data flow diagram is divided into four sub-data flow diagrams: C1, C2, C3, and C4.
  • Part D data flow diagram is divided into four sub-data flow diagrams: D1, D2, D3, and D4. Assume that there are 4 calculation modules, respectively recorded as calculation module 1, calculation module 2, calculation module 3, and calculation module 4.
  • the processing module 210 performs calculations based on the parameters of the four sub-data flow graphs A1, A2, A3, and A4.
  • the mapping relationship between the parameters of the modules the processing module 210 is based on the mapping relationship between the parameters of the four sub-data flow graphs B1, B2, B3, and B4 and the parameters of the four calculation modules.
  • the processing module 210 is based on the mapping relationship between C1, C2, and C3.
  • mapping relationship between the parameters of the four sub-data flow graphs C4 and the parameters of the four calculation modules, and the processing module 210 based on the parameters of the four sub-data flow graphs D1, D2, D3, and D4 and the parameters of the four calculation modules The mapping relationship between the above multiple sub-data flow graphs is assigned to the four computing modules, so that the sub-data flow graph corresponding to computing module 1 is A1, B1, C1, D1, and the sub-data flow graph corresponding to computing module 2 is A2 , B2, C2, D2, the sub-data flow diagrams corresponding to computing module 3 are A3, B3, C3, D3, and the sub-data flow diagrams corresponding to computing module 4 are A4, B4, C4, D4.
  • the data flow diagram of part A The data of the four sub-data flow diagrams A1, A2, A3, and A4 are calculated concurrently by computing module 1, computing module 2, computing module 3, and computing module 4.
  • Part B of the data flow diagram includes B1, B2, and B3.
  • the data of the four sub-data flow diagrams B4 are calculated concurrently by computing module 1, computing module 2, computing module 3, and computing module 4.
  • the data flow diagram in part C includes the four sub-data flows C1, C2, C3, and C4.
  • the data of the graph is calculated concurrently by computing module 1, computing module 2, computing module 3, and computing module 4.
  • the data of the four sub-data flow graphs D1, D2, D3, and D4 included in the data flow graph of part D are calculated by the computing module 1.
  • Computing module 2, computing module 3, and computing module 4 calculate concurrently.
  • Figure 5 shows a schematic diagram of yet another data flow diagram provided by an embodiment of the present application.
  • the data flow diagram includes four parts A, B, C, and D.
  • the processing module 210 can divide the data flow diagram into four sub-parts A, B, C, and D. Data flow diagram. Assume that there are four calculation modules, respectively recorded as calculation module 1, calculation module 2, calculation module 3, and calculation module 4. It is assumed that the processing module 210 performs calculations based on the parameters of the four sub-data flow graphs A, B, C, and D.
  • the mapping relationship between the parameters of the modules assigns the four sub-data flow graphs A, B, C, and D to the four computing modules, so that the sub-data flow graph corresponding to computing module 1 is A, and the sub-data corresponding to computing module 2 is
  • the flow graph is B
  • the sub-data flow graph corresponding to computing module 3 is C
  • the sub-data flow graph corresponding to computing module 4 is D.
  • the data of the four sub-data flow graphs A, B, C, and D are generated by the computing module 1.
  • Calculation module 2, calculation module 3, and calculation module 4 are calculated independently.
  • the processing module 210 according to the mapping relationship between the parameters of the multiple sub-data flow graphs (for example, the four sub-data flow graphs A, B, C, and D) and the parameters of the four calculation modules, allocate the multiple sub-data flow graphs to 4
  • the mapping relationship between the parameters of the multiple sub-data flow graphs for example, the four sub-data flow graphs A, B, C, and D
  • allocate the multiple sub-data flow graphs to 4 For an exemplary description of each computing module, please refer to the descriptions in Method #A to Method #F, which will not be described again here.
  • the processing module can divide the data flow graph into M sub-data flow graphs according to the service quality index, so that the N calculation modules can calculate the data of the corresponding sub-data flow graphs.
  • the service quality indicator is delay
  • N computing modules can perform concurrent calculations on the data of their corresponding sub-data flow graphs; for another example, when the service quality indicator is throughput, N computing modules can perform concurrent calculations on their respective corresponding sub-data flow graphs.
  • the data of the sub-data flow diagram is calculated independently. In this way, the utilization rate of computing module resources can be improved, the efficiency of the computing module in processing computing tasks can be improved, and the application difficulty of deep learning can be reduced.
  • the N calculation modules 220 are also used to exchange data of respective corresponding sub-data flow graphs using exchange operations.
  • the exchange operation can be used as a transmission medium, and multiple computing modules can transmit data to each other through the transmission medium, thereby completing the data exchange of multiple computing modules, thereby achieving data synchronization between the multiple computing modules.
  • Figure 6 shows a schematic diagram of yet another data flow diagram provided by an embodiment of the present application.
  • the data flow diagram includes four parts A, B, C, and D.
  • Part A The data of the four sub-data flow diagrams A1, A2, A3, and A4 included in the flow diagram are concurrently calculated by computing module 1, computing module 2, computing module 3, and computing module 4.
  • Part B of the data flow diagram includes B1 and B2.
  • the data of the four sub-data flow diagrams, B3 and B4 are calculated concurrently by computing module 1, computing module 2, computing module 3 and computing module 4.
  • the data flow diagram in part C includes the four sub-data flow diagrams C1, C2, C3 and C4.
  • the data of the data flow diagram is calculated concurrently by computing module 1, computing module 2, computing module 3, and computing module 4.
  • the data of the four sub-data flow diagrams D1, D2, D3, and D4 included in the data flow diagram of part D are calculated by Computing module 1, computing module 2, computing module 3, and computing module 4 calculate concurrently.
  • the calculation module 1, the calculation module 2, the calculation module 3, and the calculation module 4 respectively complete the calculation of the data of the four sub-data flow diagrams B1, B2, B3, and B4, the calculation module 1, the calculation module 2, the calculation module 3, the calculation module Module 4 can use exchange operation E to exchange the data of the four sub-data flow graphs B1, B2, B3, and B4; when computing module 1, computing module 2, computing module 3, and computing module 4 complete D1, D2, D3,
  • computing module 1, computing module 2, computing module 3, and computing module 4 can use the exchange operation E to perform data calculation on the data of the four sub-data flow graphs D1, D2, D3, and D4. exchange.
  • computing module 1 and computing module 2 can use exchange operation E to exchange data of B1 and data of B2, so that computing module 1 includes data of B1 and data of B2, and computing module 2 includes data of B1 and data of B2. .
  • Figure 7 shows a schematic diagram of yet another data flow diagram provided by an embodiment of the present application. As shown in Figure 7, it is assumed that there are four computing modules, respectively recorded as computing module 1, computing module 2, computing module 3, and computing module 4. It is assumed that the data flow diagram includes four parts A, B, C, and D.
  • the processing module 210 The data flow diagram can be divided into four sub-data flow diagrams A, B, C, and D. The data of the four sub-data flow diagrams A, B, C, and D are composed of computing module 1, computing module 2, computing module 3, Calculation module 4 is calculated independently.
  • the calculation module 1, the calculation module 2, the calculation module 3, and the calculation module 4 respectively complete the calculation of the data of the four sub-data flow diagrams A, B, C, and D
  • the calculation module 1, the calculation module 2, the calculation module 3, the calculation module Module 4 can use exchange operation E to exchange the data of the four sub-data flow graphs A, B, C, and D.
  • computing module 1 and computing module 2 can use exchange operation E to exchange data of A and data of B, so that computing module 1 includes data of A and data of B, and computing module 2 includes data of A and data of B. .
  • N computing modules can use exchange operations to exchange the data of their corresponding sub-data flow graphs, thereby ensuring the accuracy when the N computing modules calculate the data of the sub-data flow graphs.
  • the N calculation modules 220 can also perform independent calculations on the data of their respective corresponding sub-data flow graphs. For example, when the bandwidth between the N computing modules 220 is low and more exchange operations are required between the data of the M sub-data flow graphs, the N computing modules 220 can also exchange data on the corresponding sub-data flow graphs. Make independent calculations. The embodiment of the present application does not limit whether the service quality indicator is delay, and whether the N calculation modules 220 perform independent calculations or concurrent calculations on the data of their corresponding sub-data flow graphs.
  • the N calculation modules 220 can also perform concurrent calculations on the data of their respective corresponding sub-data flow graphs. For example, when the bandwidth between the N computing modules 220 is high and fewer exchange operations are required between the data of the M sub-data flow graphs, the N computing modules 220 can also process the data of the respective corresponding sub-data flow graphs. Perform concurrent calculations. This embodiment of the present application does not limit whether the N calculation modules 220 perform independent calculations or concurrent calculations on the data of their corresponding sub-data flow graphs when the service quality index is throughput.
  • the N computing modules include at least two devices, and the at least two devices form at least one device group through an interconnection device, and the computing power of the one device group is greater than or equal to the computing power of one device.
  • N computing modules include three devices, two of which are PG devices and one device is an AG device.
  • the two PG devices are formed into a device group through an interconnection device, so that the computing power of the one device group If it is greater than or equal to the computing power of a PG device, the computing power of the device group is also greater than or equal to the computing power of an AG device.
  • the N computing modules include at least two devices
  • the at least two devices form at least one device group through the interconnection device, so that the computing power of the one device group is greater than or equal to the computing power of one device, so that it can Improve utilization of individual equipment.
  • the deep learning method in the embodiment of the present application will be described below with reference to Figure 8.
  • the deep learning method 800 shown in Figure 8 can be executed by the deep learning system shown in Figure 1 or Figure 2.
  • the deep learning system shown in Figure 1 or Figure 2 please refer to the previous description of the deep learning system.
  • repeated descriptions will be appropriately omitted when introducing the deep learning method in the embodiment of the present application.
  • the method 800 shown in FIG. 8 includes step 810 and step 820. Step 810 and step 820 are described below.
  • M and N are positive integers.
  • the parameters of the M sub-data flow graphs include at least one of the following: data priorities of the M sub-data flow graphs, and data amounts of the M sub-data flow graphs.
  • the parameters of the N computing modules include at least one of the following: bandwidth between the N computing modules, computing power of the N computing modules, and storage capacity of the N computing modules.
  • dividing the data flow graph into M sub-data flow graphs includes: dividing the data flow graph into M sub-data flow graphs according to the service quality index.
  • the data of each corresponding sub-data flow graph is calculated concurrently by N computing modules.
  • the data of each corresponding sub-data flow graph is independently calculated by N calculation modules.
  • the method further includes: using an exchange operation to exchange data of the sub-data flow graphs corresponding to each of the N computing modules.
  • the processing module can allocate M sub-data flow graphs to N computing modules based on the mapping relationship between the parameters of the M sub-data flow graphs and the parameters of the N computing modules, which can realize adaptive allocation of computing tasks. , improves the utilization of computing module resources, improves the efficiency of computing modules in processing computing tasks, and reduces the difficulty of applying deep learning.
  • Figure 9 is a schematic diagram of the hardware structure of the deep learning system provided by the embodiment of the present application.
  • the deep learning system 900 shown in Figure 9 may specifically be a computer device including a memory 910, a processor 920, a communication interface 930 and a bus 940.
  • the memory 910, the processor 920, and the communication interface 930 implement communication connections between each other through the bus 940.
  • the memory 910 may be a read only memory (ROM), a static storage device, a dynamic storage device or a random access memory (RAM).
  • the memory 910 can store programs. When the program stored in the memory 910 is executed by the processor 920, the processor 920 is used to execute various steps of the deep learning method in the embodiment of the present application. Specifically, the processor 920 may execute the method 800 above.
  • the processor 920 may include the processing module and computing module of FIG. 1 or FIG. 2 .
  • the processor 920 may be a general central processing unit (CPU), a microprocessor, an application specific integrated circuit (ASIC), a graphics processing unit (GPU), or one or more An integrated circuit is used to execute relevant programs to implement the deep learning method of the method embodiment of this application.
  • CPU central processing unit
  • ASIC application specific integrated circuit
  • GPU graphics processing unit
  • An integrated circuit is used to execute relevant programs to implement the deep learning method of the method embodiment of this application.
  • the processor 920 may also be an integrated circuit chip with signal processing capabilities.
  • the above-mentioned processor 920 can also be a general-purpose processor, a digital signal processor (digital signal processing, DSP), an application-specific integrated circuit (ASIC), an off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, Discrete gate or transistor logic devices, discrete hardware components.
  • DSP digital signal processing
  • ASIC application-specific integrated circuit
  • FPGA field programmable gate array
  • a general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc.
  • the steps of the method disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other mature storage media in this field.
  • the storage medium is located in the memory 910.
  • the processor 920 reads the information in the memory 910, and combines its hardware to complete the functions required to be performed by the modules included in the device shown in any one of Figures 3 to 7, or to execute the present application. Deep learning methods of method embodiments.
  • the communication interface 930 uses a transceiver device such as but not limited to a transceiver to implement communication between the device 900 and other devices or communication networks.
  • a transceiver device such as but not limited to a transceiver to implement communication between the device 900 and other devices or communication networks.
  • the data flow diagram can be obtained through the communication interface 930.
  • Bus 940 may include a path that carries information between various components of device 900 (eg, memory 910, processor 920, communication interface 930).
  • the device 900 may also include other devices necessary for normal operation. At the same time, based on specific needs, those skilled in the art should understand that the device 900 may also include hardware devices that implement other additional functions. In addition, those skilled in the art should understand that the device 900 may only include components necessary to implement the embodiments of the present application, and does not necessarily include all components shown in FIG. 9 .
  • Embodiments of the present application also provide a computer-readable storage medium that stores program code for device execution, where the program code includes execution of the deep learning method in the embodiment of the present application.
  • An embodiment of the present application also provides a computer program product containing instructions.
  • the computer program product When the computer program product is run on a computer, it causes the computer to execute the deep learning method in the embodiment of the present application.
  • An embodiment of the present application also provides a chip.
  • the chip includes a processor and a data interface.
  • the processor reads instructions stored in the memory through the data interface and executes the deep learning method in the embodiment of the present application.
  • the chip may further include a memory, in which instructions are stored, and the processor is configured to execute the instructions stored in the memory.
  • the processor is used to execute the deep learning method in the embodiment of this application.
  • An embodiment of the present application also provides a system-on-chip SoC.
  • the SoC includes the deep learning system in the embodiment of the present application.
  • the processor in the embodiment of the present application can be a central processing unit (CPU).
  • the processor can also be other general-purpose processors, digital signal processors (DSP), or application-specific integrated circuits. (application specific integrated circuit, ASIC), off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc.
  • the memory in the embodiments of the present application may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory can be read-only memory (ROM), programmable ROM (PROM), erasable programmable read-only memory (erasable PROM, EPROM), Electrically erasable programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • Volatile memory may be random access memory (RAM), which is used as an external cache.
  • RAM random access memory
  • RAM random access memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • RAM synchronous dynamic random access memory
  • synchronous DRAM synchronous DRAM, SDRAM
  • double data rate synchronous dynamic random access memory double data rate SDRAM, DDR SDRAM
  • enhanced synchronous dynamic random access memory enhanced SDRAM, ESDRAM
  • synchronous connection dynamic random access memory access memory serial DRAM, SLDRAM
  • direct memory bus random access memory direct rambus RAM, DR RAM
  • the above embodiments may be implemented in whole or in part by software, hardware, firmware, or any other combination.
  • the above-described embodiments may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions or computer programs. When the computer instructions or computer programs are loaded or executed on the computer, the processes or functions described in the embodiments of the present application are generated in whole or in part.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another, e.g., the computer instructions may be transferred from a website, computer, server, or data center Transmit to another website, computer, server or data center through wired (such as infrared, wireless, microwave, etc.) means.
  • the computer-readable storage medium may be any available medium that a computer can access, or a data storage device such as a server or a data center that contains one or more sets of available media.
  • the usable media may be magnetic media (eg, floppy disk, hard disk, tape), optical media (eg, DVD), or semiconductor media.
  • the semiconductor medium may be a solid state drive.
  • At least one refers to one or more, and “plurality” refers to two or more.
  • At least one of the following” or similar expressions thereof refers to any combination of these items, including any combination of a single item (items) or a plurality of items (items).
  • at least one of a, b, or c can mean: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, c can be single or multiple .
  • the size of the serial numbers of the above-mentioned processes does not mean the order of execution.
  • the execution order of each process should be determined by its functions and internal logic, and should not be implemented in this application.
  • the implementation of the examples does not constitute any limitations.
  • modules described as separate components may or may not be physically separated, and the components shown as modules may or may not be physical modules, that is, they may be located in one place, or they may be distributed to multiple network modules. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional module in each embodiment of the present application can be integrated into one processing module, or each module can exist physically alone, or two or more modules can be integrated into one module.
  • the functions are implemented in the form of software function modules and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product.
  • the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in various embodiments of this application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk, etc.
  • the medium for program code are examples of program code.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种深度学习系统和方法,涉及人工智能领域,该深度学习方法包括将数据流图分为M个子数据流图,并根据M个子数据流图的参数和N个计算模块的参数之间的映射关系,将M个子数据流图分配给N个计算模块,其中,M、N为正整数;对各自对应的子数据流图的数据进行计算。本申请实施例的方案可以实现计算任务的自适应分配,提高了计算模块资源的利用率,也提高了计算模块处理计算任务的效率,降低了深度学习的应用难度。

Description

一种深度学习系统和方法
本申请要求于2022年7月28日提交中国专利局、申请号为202210894943.8、申请名称为“一种深度学习系统和方法”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,并且更具体地,涉及一种深度学习系统和方法。
背景技术
人工智能(artificial intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。AI领域的研究包括深度学习、自然语言处理、计算机视觉、决策与推理、人机交互、推荐与搜索,AI基础理论等。随着人们对AI领域的深入研究,深度学习子领域也得以不断发展。
现有技术中,在深度学习子领域,将计算任务分配至具体的计算模块需要通过应用层来实现。然而,应用层将计算任务分配至具体的计算模块时,往往需要应用层在感知计算模块的参数等信息后,才可以针对不同的服务质量指标,对计算任务做出符合实际要求的分配。因此,基于应用层的计算任务分配,不能够实现计算任务的自适应分配,降低了计算模块资源的利用率,也降低了计算模块处理计算任务的效率,从而增加了深度学习的应用难度。
发明内容
本申请提供一种深度学习系统和方法,可以实现计算任务的自适应分配,提高了计算模块资源的利用率,也提高了计算模块处理计算任务的效率,从而降低了深度学习的应用难度。
第一方面,提供了一种深度学习系统,包括处理模块和N个计算模块,处理模块,用于将数据流图分为M个子数据流图,并根据M个子数据流图的参数和N个计算模块的参数之间的映射关系,将M个子数据流图分配给N个计算模块,其中,M、N为正整数;N个计算模块,用于对各自对应的子数据流图的数据进行计算。
基于上述技术方案,处理模块可以根据M个子数据流图的参数和N个计算模块的参数之间的映射关系,将M个子数据流图分配给N个计算模块,可以实现计算任务的自适应分配,提高了计算模块资源的利用率,也提高了计算模块处理计算任务的效率,降低了深度学习的应用难度。
结合第一方面,在第一方面的某些实现方式中,M个子数据流图的参数包括以下至少一项:M个子数据流图的数据优先级,M个子数据流图的数据量。
结合第一方面,在第一方面的某些实现方式中,N个计算模块的参数包括以下至少一项:N个计算模块之间的带宽、N个计算模块的算力、N个计算模块的存储容量。
结合第一方面,在第一方面的某些实现方式中,处理模块,用于根据服务质量指标,将数据流图分为M个子数据流图。
基于上述技术方案,处理模块可以根据服务质量指标,将数据流图分为M个子数据流图,使得N个计算模块可以对各自对应的子数据流图的数据进行计算。通过该方式,可以提高计算模块资源的利用率,以及提高计算模块处理计算任务的效率,降低深度学习的应用难度。
结合第一方面,在第一方面的某些实现方式中,当服务质量指标为时延时,N个计算模块各自对应的子数据流图的数据是由N个计算模块并发计算的。
结合第一方面,在第一方面的某些实现方式中,当服务质量指标为吞吐量时,N个计算模块各自对应的子数据流图的数据是由N个计算模块独立计算的。
结合第一方面,在第一方面的某些实现方式中,N个计算模块,还用于使用交换操作对各自对应的子数据流图的数据进行交换。
基于上述技术方案,N个计算模块可以使用交换操作对各自对应的子数据流图的数据进行交换,从而可以保证N个计算模块对子数据流图的数据进行计算时的准确性。
结合第一方面,在第一方面的某些实现方式中,N个计算模块包括至少两个设备,至少两个设备通过互联装置组成至少一个设备组,一个设备组的算力大于或等于一个设备的算力。
基于上述技术方案,当N个计算模块包括至少两个设备时,该至少两个设备通过互联装置组成至少一个设备组,使得该一个设备组的算力大于或等于一个设备的算力,从而能够提高单个设备的利用率。
第二方面,提供了一种深度学习方法,包括:将数据流图分为M个子数据流图,并根据M个子数据流图的参数和N个计算模块的参数之间的映射关系,将M个子数据流图分配给N个计算模块,其中,M、N为正整数;对各自对应的子数据流图的数据进行计算。
基于上述技术方案,可以根据M个子数据流图的参数和N个计算模块的参数之间的映射关系,将M个子数据流图分配给N个计算模块,从而实现计算任务的自适应分配,提高了计算模块资源的利用率,也提高了计算模块处理计算任务的效率,降低了深度学习的应用难度。
结合第二方面,在第二方面的某些实现方式中,M个子数据流图的参数包括以下至少一项:M个子数据流图的数据优先级,M个子数据流图的数据量。
结合第二方面,在第二方面的某些实现方式中,N个计算模块的参数包括以下至少一项:N个计算模块之间的带宽、N个计算模块的算力、N个计算模块的存储容量。
结合第二方面,在第二方面的某些实现方式中,将数据流图分为M个子数据流图,包括:根据服务质量指标,将数据流图分为M个子数据流图。
结合第二方面,在第二方面的某些实现方式中,当服务质量指标为时延时,各自对应的子数据流图的数据是由N个计算模块并发计算的。
结合第二方面,在第二方面的某些实现方式中,当服务质量指标为吞吐量时,各自对应的子数据流图的数据是由N个计算模块独立计算的。
结合第二方面,在第二方面的某些实现方式中,方法还包括:使用交换操作对N个计算模块各自对应的子数据流图的数据进行交换。
第三方面,提供了一种深度学习系统,包括:存储器,用于存储程序;处理器,用于执行所述存储器存储的程序,当所述存储器存储的程序被执行时,所述处理器用于执行第二方面以及第二方面中的任意一种实现方式中的方法。
上述第三方面中的处理器既可以是中央处理器(central processing unit,CPU),也可以是CPU与神经网络运算处理器的组合,这里的神经网络运算处理器可以包括图形处理器(graphics processing unit,GPU)、神经网络处理器(neural-network processing unit,NPU)和张量处理器(tensor processing unit,TPU)等等。其中,TPU是谷歌(google)为机器学习全定制的人工智能加速器专用集成电路。
第四方面,提供一种计算机可读存储介质,该计算机可读存储介质存储用于设备执行的程序代码,该程序代码包括用于执行第二方面或第二方面中的任意一种实现方式中的方法。
第五方面,提供一种包含指令的计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述第二方面或第二方面中的任意一种实现方式中的方法。
第六方面,提供一种芯片,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,执行上述第二方面或第二方面中的任意一种实现方式中的方法。
可选地,作为一种实现方式,所述芯片还可以包括存储器,所述存储器中存储有指令,所述处理器用于执行所述存储器上存储的指令,当所述指令被执行时,所述处理器用于执行第二方面或第二方面中的任意一种实现方式中的方法。
第七方面,提供一种片上系统(system on a chip,SoC),SoC包括上述第一方面或第一方面中的任意一种实现方式中的深度学习系统。
附图说明
图1示出了本申请实施例提供的一种深度学习系统100的示意性框图。
图2示出了本申请实施例提供的又一种深度学习系统200的示意性框图。
图3示出了本申请实施例提供的一种计算模块的示意性结构图。
图4示出了本申请实施例提供的一种数据流图的示意图。
图5示出了本申请实施例提供的又一种数据流图的示意图。
图6示出了本申请实施例提供的又一种数据流图的示意图。
图7示出了本申请实施例提供的又一种数据流图的示意图。
图8示出了本申请实施例提供的一种深度学习方法800的示意图。
图9示出了本申请实施例提供的一种深度学习系统的示意图。
具体实施方式
下面将结合附图,对本申请中的技术方案进行描述。
AI是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,AI是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的智能机器,该智能机器能以与人类智能相似的方式作出反应。AI也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。AI领域的研究包括深度学习、自然语言处理、计算机视觉、决策与推理、人机交互、推荐与搜索,AI基础理论等。
随着人们对AI领域的深入研究,深度学习子领域也得以不断发展。例如,图像、视频、自然语言等相关算法的创新力度不断增大,计算任务所需的算力资源也从单一芯片扩展到计算集群。再例如,为了满足应用层的需求,硬件和芯片技术也在快速发展,计算芯片架构采用单芯片、双晶片、多晶片等多种封装技术。再例如,内存架构包括片上高速缓存区,本地存储区,分布式存储系统等,可以提供不同的存储能力。再例如,在计算集群中,计算模块之间的互联架构也形态各异,有全互联形态,也有树形通信拓扑结构。
现有技术中,在深度学习子领域,将计算任务分配至具体的计算模块需要通过应用层来实现。然而,应用层将计算任务分配至具体的计算模块时,往往需要应用层在感知计算模块的参数等信息后,才可以针对不同的服务质量指标,对计算任务做出符合实际要求的分配。因此,基于应用层的计算任务分配,不能够实现计算任务的自适应分配,降低了对计算模块资源的利用率,也降低了计算模块处理计算任务的效率,从而增加了深度学习的应用难度。
鉴于上述技术问题,本申请提供了一种深度学习系统,可以无需基于应用层进行计算任务的分配。通过该系统,本申请可以实现计算任务的自适应分配,提高了计算模块资源的利用率,也提高了计算模块处理计算任务的效率,从而降低了深度学习的应用难度。
下面将结合附图详细说明本申请提供的各个实施例。
图1示出了本申请实施例提供的一种深度学习系统100的示意性框图。如图1所示,该深度学习系统100包括管理模块110、收发模块120、处理模块130和计算模块140。
管理模块110,可以用于获取计算模块140的参数,并通过收发模块120,将计算模块140的参数发送至处理模块130。例如,假设计算模块140包括N个计算模块,则管理模块110可以获取到N个计算模块的算力,其中,N为正整数。
收发模块120,可以用于收发计算任务所需的信息。例如,收发模块120可以将管理模块110获取到的计算模块140的参数发送至处理模块130;再例如,收发模块120可以获取数据流图,并将该数据流图发送至处理模块130;再例如,收发模块120可以获取数据流图的数据,并将该数据流图的数据发送至处理模块130。
处理模块130,可以用于对计算任务所需的信息进行处理。处理模块130,包括第一模块、第二模块和第三模块。
第一模块可以用于数据流图的切分,例如,第一模块可以将数据流图分为M个子数据流图,其中,M为正整数。
第二模块可以将数据流图分配给计算模块140,例如,假设有M个子数据流图,计算模块140包括N个计算模块,则第二模块可以将M个子数据流图中数据量大的子数据流图,分配给N个计算模块中算力高的计算模块。
第三模块可以用于将数据流图的数据分配给计算模块140。例如,假设有M个子数据流图,计算模块140包括N个计算模块,则第三模块可以将M个子数据流图的数据分配给N个计算模块。
计算模块140,可以用于对数据流图的数据进行计算。例如,假设计算模块140包括N个计算模块,则N个计算模块可以对各自对应的子数据流图的数据进行计算。
基于图1所示的深度学习系统的示意性框图,本申请实施例提供的基于深度学习系统对计算任务进行处理的流程,可以包括以下几个步骤:
第一步,管理模块110可以获取计算模块140的参数,假设计算模块140包括N个计算模块,则管理模块110可以获取N个计算模块的算力。
第二步,收发模块120可以将管理模块110获取到的计算模块140的参数,发送至处理模块130。
第三步,收发模块120可以将获取到的数据流图发送至处理模块130。
第四步,处理模块130,包括第一模块和第二模块。第一模块可以将数据流图分为M个子数据流图,第二模块可以将M个子数据流图分配给计算模块140包括的N个计算模块。其中,第二模块在分配M个子数据流图时,可以考虑N个计算模块的参数,例如,第二模块可以将M个子数据流图中数据量大的子数据流图,分配给N个计算模块中算力高的计算模块。
第五步,收发模块120可以将获取到的数据流图的数据发送至处理模块130。
第六步,处理模块130,还包括第三模块,第三模块可以将M个子数据流图的数据,分配给计算模块140包括的N个计算模块。
第七步,N个计算模块可以对各自对应的子数据流图中的数据进行计算。
基于上述技术方案,本申请实施例提供的深度学习系统,可以无需基于应用层进行计算任务的分配,便可以实现计算任务的自适应分配,提高了计算模块资源的利用率,也提高了计算模块处理计算任务的效率,从而降低了深度学习的应用难度。
在本申请实施例中,处理模块可以将数据流图分配给计算模块,以使得计算模块能够对数据流图的数据进行计算。当计算模块包括N个计算模块时,处理模块将数据流图分配给N个计算模块的具体过程,后面结合图2至图7予以详细说明。
图2是本申请实施例提供的又一种深度学习系统200的示意性框图。如图2所示,深度学习系统200可以包括处理模块210和N个计算模块220。其中,处理模块210可以是图1中的处理模块130,N个计算模块220可以是图1中的计算模块140。
处理模块210,用于将数据流图分为M个子数据流图,并根据M个子数据流图的参数和N个计算模块220的参数之间的映射关系,将M个子数据流图分配给N个计算模块220。
示例性地,处理模块210获取数据流图,可以包括:收发模块接收数据流图,并发送至处理模块210,其中,该收发模块可以是图1中的收发模块120。
示例性地,处理模块210获取N个计算模块220的参数,可以包括:管理模块获取N个计算模块220的参数,并通过收发模块,将N个计算模块220的参数发送至处理模块210,其中,该管理模块可以是图1中的管理模块110,该收发模块可以是图1中的收发模块120。
示例性地,处理模块210将数据流图分为M个子数据流图,可以是图1中处理模块130包括的第一模块执行的。
示例性地,处理模块210将M个子数据流图分配给N个计算模块220,可以是图1中处理模块130包括的第二模块执行的。
其中,M、N为正整数。
其中,M个子数据流图的参数可以包括以下至少一项:M个子数据流图的数据优先级,M个子数据流图的数据量。
应理解,M个子数据流图的数据优先级可以是M个子数据流图的数据交换优先级,也可以是M个子数据流图的数据的其它优先级,本申请实施例例不予限制。下文实施例中,仅以M个子数据流图的数据优先级为M个子数据流图的数据交换优先级为例予以说明。
可选地,M个子数据流图的数据交换优先级与各个子数据流图之间的数据交换次数相关联。当M个子数据流图中的P个子数据流图之间的数据交换次数较多时,该P个子数据流图的数据交换优先级较高;当M个子数据流图中的Q个子数据流图之间的数据交换次数较少时,该Q个子数据流图的数 据交换优先级较低。其中,P、Q为正整数。
示例性地,假设处理模块210将数据流图分为6个子数据流图,将该6个子数据流图分别记为子数据流图1、子数据流图2、子数据流图3、子数据流图4、子数据流图5、子数据流图6,根据各个子数据流图之间的数据交换次数,将子数据流图1、子数据流图2与子数据流图3记为第一组子数据流图,将子数据流图4、子数据流图5与子数据流图6记为第二组子数据流图,将子数据流图1与子数据流图4记为第三组子数据流图,将子数据流图2与子数据流图5记为第四组子数据流图,将子数据流图3与子数据流图6记为第五组子数据流图,假设第一组子数据流图之间的数据,以及第二组子数据流图之间的数据交换次数较多,第三组子数据流图之间的数据、第四组子数据流图之间的数据,以及第五组子数据流图之间的数据交换次数较少,则数据交换优先级较高的子数据流图所在的组为第一组数据流图和第二组数据流图,数据交换优先级较低的子数据流图所在的组为第三组数据流图、第四组数据流图和第五组数据流图。
其中,N个计算模块220的参数可以包括以下至少一项:N个计算模块220之间的带宽、N个计算模块220的算力、N个计算模块220的存储容量。
示例性地,N个计算模块220之间的带宽,可以是N个计算模块220中每两个计算模块之间的带宽。例如,假设有三个计算模块,分别记为计算模块1、计算模块2和计算模块3,则三个计算模块之间的带宽可以是计算模块1与计算模块2之间的带宽,也可以是计算模块1与计算模块3之间的带宽,也可以是计算模块2与计算模块3之间的带宽。
可选地,M个子数据流图的参数与N个计算模块220的参数相关联。
应理解,当M个子数据流图的参数与N个计算模块220的参数相关联时,M个子数据流图的参数与N个计算模块220的参数之间存在映射关系,使得处理模块210,可以根据M个子数据流图的参数和N个计算模块220的参数之间的映射关系,将M个子数据流图分配给N个计算模块220。
M个子数据流图的参数与N个计算模块220的参数相关联,可以有以下几种可能的方式。
一种可能的方式,M个子数据流图的数据优先级,与N个计算模块220之间的带宽相关联。
示例性地,可以使用两两之间带宽较高的计算模块,对数据交换优先级较高的子数据流图的数据进行计算;可以使用两两之间带宽较低的计算模块,对数据交换优先级较低的子数据流图的数据进行计算。
另一种可能的方式,M个子数据流图的数据优先级,与N个计算模块220的算力相关联。
示例性地,可以使用算力较高的计算模块,对数据交换优先级较高的子数据流图的数据进行计算;可以使用算力较低的计算模块,对数据交换优先级较低的子数据流图的数据进行计算。
另一种可能的方式,M个子数据流图的数据优先级,与N个计算模块220的存储容量相关联。
应理解,计算模块每对子数据流图的数据进行交换,该计算模块所需要存储的子数据流图的数据就会增多。例如,假设有两个计算模块,分别记为计算模块1和计算模块2,计算模块1对应的子数据流图为子数据流图1,计算模块2对应的子数据流图为子数据流图2,在计算模块1与计算模块2对各自对应的子数据流图的数据进行交换后,计算模块1需要存储的子数据流图的数据包括子数据流图1的数据和子数据流图2的数据,计算模块2需要存储的子数据流图的数据包括子数据流图1的数据和子数据流图2的数据。因此,M个子数据流图的数据交换优先级,是与N个计算模块220的存储容量相关联的。
示例性地,可以使用存储容量较大的计算模块,对数据交换优先级较高的子数据流图的数据进行计算;可以使用存储容量较小的计算模块,对数据交换优先级较低的子数据流图的数据进行计算。
另一种可能的方式,M个子数据流图的数据量,与N个计算模块220之间的带宽相关联。
示例性地,可以使用两两之间带宽较高的计算模块,对数据量较大的子数据流图的数据进行计算;可以使用两两之间带宽较低的计算模块,对数据量较小的子数据流图的数据进行计算。
另一种可能的方式,M个子数据流图的数据量,与N个计算模块220的算力相关联。
示例性地,可以使用算力较高的计算模块,对数据量较大的子数据流图的数据进行计算;可以使用算力较低的计算模块,对数据量较小的子数据流图的数据进行计算。
另一种可能的方式,M个子数据流图的数据量,与N个计算模块220的存储容量相关联。
示例性地,可以使用存储容量较大的计算模块,对数据量较大的子数据流图的数据进行计算;可 以使用存储容量较小的计算模块,对数据量较小的子数据流图的数据进行计算。
应理解,子数据流图的个数M可以大于计算模块的个数N,子数据流图的个数M可以等于计算模块的个数N,子数据流图的个数M可以小于计算模块的个数N,本申请实施例对子数据流图的个数M与计算模块的个数N之间的大小关系不作限定。
当子数据流图的个数M大于计算模块的个数N时,处理模块210,可以根据M个子数据流图的参数和N个计算模块220的参数之间的映射关系,将M个子数据流图分配给N个计算模块220。在该场景下,处理模块210将M个子数据流图分配给N个计算模块220,可以有以下几种方式。
方式#A,当子数据流图的个数M大于计算模块的个数N时,处理模块210可以根据M个子数据流图的数据优先级,与N个计算模块220之间的带宽的映射关系,将M个子数据流图分配给N个计算模块220。
示例性地,当子数据流图的个数M大于计算模块的个数N时,处理模块210可以将数据交换优先级较高的子数据流图,分配给两两之间带宽较高的计算模块;处理模块210可以将数据交换优先级较低的子数据流图,分配给两两之间带宽较低的计算模块。
图3示出了本申请实施例提供的一种计算模块的示意性结构图。如图3所示,假设有8个计算模块,分别记为计算模块1、计算模块2、计算模块3、计算模块4、计算模块5、计算模块6、计算模块7和计算模块8,假设计算模块1、计算模块2、计算模块3与计算模块4两两之间的带宽较高,计算模块5、计算模块6、计算模块7与计算模块8两两之间的带宽也较高,其余计算模块两两之间的带宽较低,例如,计算模块1与计算模块5之间的带宽较低,计算模块2与计算模块6之间的带宽较低。
假设有9个子数据流图,分别记为子数据流图1、子数据流图2、子数据流图3、子数据流图4、子数据流图5、子数据流图6、子数据流图7、子数据流图8、子数据流图9,根据各个子数据流图之间的数据交换优先级,对各个子数据流图进行分组,假设第一组子数据流图包括子数据流图1、子数据流图2、子数据流图3和子数据流图4,第二组子数据流图包括子数据流图5、子数据流图6、子数据流图7、子数据流图8和子数据流图9,第三组子数据流图包括子数据流图1和子数据流图5,第四组子数据流图包括子数据流图2和子数据流图6,且第一组子数据流图与第二组子数据流图的数据交换优先级较高,第三组子数据流图与第四组子数据流图的数据交换优先级较低。
处理模块210可以将第一组子数据流图中的子数据流图分配给计算模块1、计算模块2、计算模块3与计算模块4,将第二组子数据流图中的子数据流图分配给计算模块5、计算模块6、计算模块7与计算模块8,使得处理模块210可以将数据交换优先级较高的子数据流图(如第一组子数据流图,又如第二组子数据流图),分配给两两之间带宽较高的计算模块,处理模块210可以将数据交换优先级低的子数据流图(如第三组子数据流图,又如第四组子数据流图),分配给两两之间带宽较低的计算模块。
例如,处理模块210可以将子数据流图1、子数据流图2、子数据流图3和子数据流图4分别分配给计算模块1、计算模块2、计算模块3和计算模块4,处理模块210可以将子数据流图5、子数据流图6和子数据流图7分别分配给计算模块5、计算模块6和计算模块7,处理模块210可以将子数据流图8和子数据流图9分配给计算模块8。
通过方式#A,处理模块210可以根据M个子数据流图的数据优先级,与N个计算模块220之间的带宽的映射关系,将M个子数据流图分配给N个计算模块220,从而应用层无需进行计算任务的分配,便可以实现计算任务的自适应分配,提高了计算模块带宽资源的利用率,也提高了计算模块处理计算任务的效率,降低了深度学习的应用难度。
方式#B,当子数据流图的个数M大于计算模块的个数N时,处理模块210可以根据M个子数据流图的数据优先级,与N个计算模块220的算力之间的映射关系,将M个子数据流图分配给N个计算模块220。
示例性地,当子数据流图的个数M大于计算模块的个数N时,处理模块210可以将数据交换优先级较高的子数据流图,分配给算力较高的计算模块;处理模块210可以将数据交换优先级较低的子数据流图,分配给算力较低的计算模块。
如图3所示,假设有8个计算模块,分别记为计算模块1、计算模块2、计算模块3、计算模块4、计算模块5、计算模块6、计算模块7和计算模块8,假设计算模块1、计算模块2、计算模块3与计算 模块4的算力较高,计算模块5、计算模块6、计算模块7与计算模块8的算力较低。
假设有9个子数据流图,分别记为子数据流图1、子数据流图2、子数据流图3、子数据流图4、子数据流图5、子数据流图6、子数据流图7、子数据流图8、子数据流图9,根据各个子数据流图之间的数据交换优先级,对各个子数据流图进行分组,假设第一组子数据流图包括子数据流图1、子数据流图2、子数据流图3和子数据流图4,第二组子数据流图包括子数据流图5、子数据流图6、子数据流图7、子数据流图8和子数据流图9,且第一组子数据流图的数据交换优先级较第二组子数据流图的数据交换优先级高。
处理模块210可以将第一组子数据流图中的子数据流图分配给计算模块1、计算模块2、计算模块3与计算模块4,将第二组子数据流图中的子数据流图分配给计算模块5、计算模块6、计算模块7与计算模块8,使得处理模块210可以将数据交换优先级较高的子数据流图(如第一组子数据流图),分配给算力较高的计算模块,处理模块210可以将数据交换优先级较低的子数据流图(如第二组子数据流图),分配给算力较低的计算模块。
例如,处理模块210可以将子数据流图1、子数据流图2、子数据流图3和子数据流图4分别分配给计算模块1、计算模块2、计算模块3和计算模块4,处理模块210可以将子数据流图5、子数据流图6和子数据流图7分别分配给计算模块5、计算模块6和计算模块7,处理模块210可以将子数据流图8和子数据流图9分配给计算模块8。
通过方式#B,处理模块210可以根据M个子数据流图的数据优先级,与N个计算模块220的算力之间的映射关系,将M个子数据流图分配给N个计算模块220,从而应用层无需进行计算任务的分配,便可以实现计算任务的自适应分配,提高了计算模块算力资源的利用率,也提高了计算模块处理计算任务的效率,降低了深度学习的应用难度。
方式#C,当子数据流图的个数M大于计算模块的个数N时,处理模块210可以根据M个子数据流图的数据优先级,与N个计算模块220的存储容量之间的映射关系,将M个子数据流图分配给N个计算模块220。
示例性地,当子数据流图的个数M大于计算模块的个数N时,处理模块210可以将数据交换优先级较高的子数据流图,分配给存储容量较大的计算模块;处理模块210可以将数据交换优先级较低的子数据流图,分配给存储容量较小的计算模块。
如图3所示,假设有8个计算模块,分别记为计算模块1、计算模块2、计算模块3、计算模块4、计算模块5、计算模块6、计算模块7和计算模块8,假设计算模块1、计算模块2、计算模块3与计算模块4的存储容量较大,计算模块5、计算模块6、计算模块7与计算模块8的存储容量较小。
假设有9个子数据流图,分别记为子数据流图1、子数据流图2、子数据流图3、子数据流图4、子数据流图5、子数据流图6、子数据流图7、子数据流图8、子数据流图9,根据各个子数据流图之间的数据交换优先级,对各个子数据流图进行分组,假设第一组子数据流图包括子数据流图1、子数据流图2、子数据流图3和子数据流图4,第二组子数据流图包括子数据流图5、子数据流图6、子数据流图7、子数据流图8和子数据流图9,且第一组子数据流图的数据交换优先级较第二组子数据流图的数据交换优先级高。
处理模块210可以将第一组子数据流图中的子数据流图分配给计算模块1、计算模块2、计算模块3与计算模块4,将第二组子数据流图中的子数据流图分配给计算模块5、计算模块6、计算模块7与计算模块8,使得处理模块210可以将数据交换优先级较高的子数据流图(如第一组子数据流图),分配给存储容量较大的计算模块,处理模块210可以将数据交换优先级较低的子数据流图(如第二组子数据流图),分配给存储容量较小的计算模块。
例如,处理模块210可以将子数据流图1、子数据流图2、子数据流图3和子数据流图4分别分配给计算模块1、计算模块2、计算模块3和计算模块4,处理模块210可以将子数据流图5、子数据流图6和子数据流图7分别分配给计算模块5、计算模块6和计算模块7,处理模块210可以将子数据流图8和子数据流图9分配给计算模块8。
通过方式#C,处理模块210可以根据M个子数据流图的数据优先级,与N个计算模块220的存储容量之间的映射关系,将M个子数据流图分配给N个计算模块220,从而应用层无需进行计算任务的分配,便可以实现计算任务的自适应分配,提高了计算模块存储容量资源的利用率,也提高了计算模 块处理计算任务的效率,降低了深度学习的应用难度。
方式#D,当子数据流图的个数M大于计算模块的个数N时,处理模块210可以根据M个子数据流图的数据量,与N个计算模块220的算力之间的映射关系,将M个子数据流图分配给N个计算模块220。
示例性地,当子数据流图的个数M大于计算模块的个数N时,处理模块210可以将数据量大的子数据流图,分配给算力高的计算模块;处理模块210可以将数据量小的子数据流图,分配给算力低的计算模块。
如图3所示,假设有8个计算模块,分别记为计算模块1、计算模块2、计算模块3、计算模块4、计算模块5、计算模块6、计算模块7和计算模块8,假设计算模块1、计算模块2、计算模块3与计算模块4的算力较高,计算模块5、计算模块6、计算模块7与计算模块8的算力较低。
假设有9个子数据流图,分别记为子数据流图1、子数据流图2、子数据流图3、子数据流图4、子数据流图5、子数据流图6、子数据流图7、子数据流图8、子数据流图9,根据各个子数据流图的数据量,对各个子数据流图进行分组,假设第一组子数据流图包括子数据流图1、子数据流图2、子数据流图3和子数据流图4,第二组子数据流图包括子数据流图5、子数据流图6、子数据流图7、子数据流图8和子数据流图9,且第一组子数据流图的数据量较第二组子数据流图的数据量大。
处理模块210可以将第一组子数据流图中的子数据流图分配给计算模块1、计算模块2、计算模块3与计算模块4,将第二组子数据流图中的子数据流图分配给计算模块5、计算模块6、计算模块7与计算模块8,使得处理模块210可以将数据量大的子数据流图(如第一组子数据流图),分配给算力较高的计算模块,处理模块210可以将数据量小的子数据流图(如第二组子数据流图),分配给算力较低的计算模块。
例如,处理模块210可以将子数据流图1、子数据流图2、子数据流图3和子数据流图4分别分配给计算模块1、计算模块2、计算模块3和计算模块4,处理模块210可以将子数据流图5、子数据流图6和子数据流图7分别分配给计算模块5、计算模块6和计算模块7,处理模块210可以将子数据流图8和子数据流图9分配给计算模块8。
通过方式#D,处理模块210可以根据M个子数据流图的数据量,与N个计算模块220的算力之间的映射关系,将M个子数据流图分配给N个计算模块220,从而应用层无需进行计算任务的分配,便可以实现计算任务的自适应分配,提高了计算模块算力资源的利用率,也提高了计算模块处理计算任务的效率,降低了深度学习的应用难度。
方式#E,当子数据流图的个数M大于计算模块的个数N时,处理模块210可以根据M个子数据流图的数据量,与N个计算模块220之间的带宽的映射关系,将M个子数据流图分配给N个计算模块220。
示例性地,当子数据流图的个数M大于计算模块的个数N时,处理模块210可以将数据量大的子数据流图,分配给两两之间带宽高的计算模块;处理模块210可以将数据量小的子数据流图,分配给两两之间带宽低的计算模块。
如图3所示,假设有8个计算模块,分别记为计算模块1、计算模块2、计算模块3、计算模块4、计算模块5、计算模块6、计算模块7和计算模块8,假设计算模块1、计算模块2、计算模块3与计算模块4两两之间的带宽较高,其余计算模块两两之间的带宽较低,例如,计算模块5、计算模块6、计算模块7、计算模块8两两之间的带宽较低。
假设有9个子数据流图,分别记为子数据流图1、子数据流图2、子数据流图3、子数据流图4、子数据流图5、子数据流图6、子数据流图7、子数据流图8、子数据流图9,根据各个子数据流图之间的数据量,对各个子数据流图进行分组,假设第一组子数据流图包括子数据流图1、子数据流图2、子数据流图3和子数据流图4,第二组子数据流图包括子数据流图5、子数据流图6、子数据流图7、子数据流图8和子数据流图9,且第一组子数据流图的数据量较第二组子数据流图的数据量大。
处理模块210可以将第一组子数据流图中的子数据流图分配给计算模块1、计算模块2、计算模块3与计算模块4,将第二组子数据流图中的子数据流图分配给计算模块5、计算模块6、计算模块7与计算模块8,使得处理模块210可以将数据量大的子数据流图(如第一组子数据流图),分配给两两之间带宽较高的计算模块,处理模块210可以将数据量小的子数据流图(如第二组子数据流图),分配 给两两之间带宽较低的计算模块。
例如,处理模块210可以将子数据流图1、子数据流图2、子数据流图3和子数据流图4分别分配给计算模块1、计算模块2、计算模块3和计算模块4,处理模块210可以将子数据流图5、子数据流图6和子数据流图7分别分配给计算模块5、计算模块6和计算模块7,处理模块210可以将子数据流图8和子数据流图9分配给计算模块8。
通过方式#E,处理模块210可以根据M个子数据流图的数据量,与N个计算模块220之间的带宽的映射关系,将M个子数据流图分配给N个计算模块220,从而应用层无需进行计算任务的分配,便可以实现计算任务的自适应分配,提高了计算模块带宽资源的利用率,也提高了计算模块处理计算任务的效率,降低了深度学习的应用难度。
方式#F,当子数据流图的个数M大于计算模块的个数N时,处理模块210可以根据M个子数据流图的数据量,与N个计算模块220的存储容量之间的映射关系,将M个子数据流图分配给N个计算模块220。
示例性地,当子数据流图的个数M大于计算模块的个数N时,处理模块210可以将数据量较大的子数据流图,分配给存储容量较大的计算模块;处理模块210可以将数据量较小的子数据流图,分配给存储容量较小的计算模块。
如图3所示,假设有8个计算模块,分别记为计算模块1、计算模块2、计算模块3、计算模块4、计算模块5、计算模块6、计算模块7和计算模块8,假设计算模块1、计算模块2、计算模块3与计算模块4的存储容量较大,计算模块5、计算模块6、计算模块7与计算模块8的存储容量较小。
假设有9个子数据流图,分别记为子数据流图1、子数据流图2、子数据流图3、子数据流图4、子数据流图5、子数据流图6、子数据流图7、子数据流图8、子数据流图9,根据各个子数据流图之间的数据量,对各个子数据流图进行分组,假设第一组子数据流图包括子数据流图1、子数据流图2、子数据流图3和子数据流图4,第二组子数据流图包括子数据流图5、子数据流图6、子数据流图7、子数据流图8和子数据流图9,且第一组子数据流图的数据量较第二组子数据流图的数据量大。
处理模块210可以将第一组子数据流图中的子数据流图分配给计算模块1、计算模块2、计算模块3与计算模块4,将第二组子数据流图中的子数据流图分配给计算模块5、计算模块6、计算模块7与计算模块8,使得处理模块210可以将数据量较大的子数据流图(如第一组子数据流图),分配给存储容量较大的计算模块,处理模块210可以将数据量较小的子数据流图(如第二组子数据流图),分配给存储容量较小的计算模块。
例如,处理模块210可以将子数据流图1、子数据流图2、子数据流图3和子数据流图4分别分配给计算模块1、计算模块2、计算模块3和计算模块4,处理模块210可以将子数据流图5、子数据流图6和子数据流图7分别分配给计算模块5、计算模块6和计算模块7,处理模块210可以将子数据流图8和子数据流图9分配给计算模块8。
通过方式#F,处理模块210可以根据M个子数据流图的数据量,与N个计算模块220的存储容量之间的映射关系,将M个子数据流图分配给N个计算模块220,从而应用层无需进行计算任务的分配,便可以实现计算任务的自适应分配,提高了计算模块存储容量资源的利用率,也提高了计算模块处理计算任务的效率,降低了深度学习的应用难度。
应理解,当子数据流图的个数M小于或等于计算模块的个数N时,处理模块210,也可以根据M个子数据流图的参数和N个计算模块220的参数之间的映射关系,将M个子数据流图分配给N个计算模块220。在该场景下,处理模块将M个子数据流图分配给N个计算模块220的示例性说明,可以参考方式#A至方式#F中的描述,在此不再进行赘述。
还应理解,方式#A至方式#F给出了处理模块210可以根据M个子数据流图的一个参数和N个计算模块220的一个参数之间的映射关系,将M个子数据流图分配给N个计算模块220的场景,然而本申请实施例并不限于此。例如,处理模块210可以根据M个子数据流图的多个参数和N个计算模块220的一个参数之间的映射关系,将M个子数据流图分配给N个计算模块220;再例如,处理模块210可以根据M个子数据流图的一个参数和N个计算模块220的多个参数之间的映射关系,将M个子数据流图分配给N个计算模块220;再例如,处理模块210可以根据M个子数据流图的多个参数和N个计算模块220的多个参数之间的映射关系,将M个子数据流图分配给N个计算模块220。
关于处理模块210可以根据M个子数据流图的至少一个参数和N个计算模块220的至少一个参数之间的映射关系,将M个子数据流图分配给N个计算模块220的示例性说明,可以参考方式#A至方式#F,在此不再进行赘述。
N个计算模块220,用于对各自对应的子数据流图的数据进行计算。
应理解,处理模块210可以根据M个子数据流图的参数和N个计算模块220的参数之间的映射关系,将M个子数据流图分配给N个计算模块220,从而可以使得N个计算模块220可以获取到各自对应的子数据流图。例如,假设有3个子数据流图,分别记为子数据流图1、子数据流图2和子数据流图3,假设有3个计算模块,分别记为计算模块1、计算模块2和计算模块3,处理模块210可以根据3个子数据流图的参数和3个计算模块220的参数之间的映射关系,将子数据流图1分配给计算模块1,将子数据流图2分配给计算模块2,将子数据流图3分配给计算模块3,从而使得计算模块1可以获得对应的子数据流图1,计算模块2可以获得对应的子数据流图2,计算模块3可以获得对应的子数据流图3。
示例性地,N个计算模块220获取各自对应的子数据流图中的数据,可以包括:收发模块接收数据流图的数据,并将该数据流图的数据发送至处理模块210,处理模块210可以从数据流图的数据中获取M个子数据流图的数据,并将该M个子数据流图的数据分配给N个计算模块220,使得N个计算模块能够获取到各自对应的子数据流图中的数据。其中,收发模块可以是图1中的收发模块120,处理模块210将M个子数据流图的数据分配给N个计算模块220,可以是图1中处理模块130包括的第三模块执行的。
基于上述技术方案,处理模块可以根据M个子数据流图的参数和N个计算模块的参数之间的映射关系,将M个子数据流图分配给N个计算模块,从而应用层无需进行计算任务的分配,便可以实现计算任务的自适应分配,提高了计算模块资源的利用率,也提高了计算模块处理计算任务的效率,降低了深度学习的应用难度。
可选地,处理模块210,用于根据服务质量指标,将数据流图分为M个子数据流图。
其中,服务质量指标可以用于表征计算任务的服务质量。例如,当服务质量指标为时延时,该服务质量指标可以用于表征计算任务的处理时长;再例如,当服务质量指标为吞吐量时,该服务质量指标可以用于表征传输数据流图的数据时的平均速率。
一种可能的方式,当服务质量指标为时延时,N个计算模块各自对应的子数据流图的数据是由N个计算模块并发计算的。
图4示出了本申请实施例提供的一种数据流图的示意图。如图4所示,假设数据流图包括A、B、C、D四部分,当服务质量指标为时延时,处理模块210可以将A部分数据流图分为A1、A2、A3、A4这4个子数据流图,将B部分数据流图分为B1、B2、B3、B4这4个子数据流图,将C部分数据流图分为C1、C2、C3、C4这4个子数据流图,将D部分数据流图分为D1、D2、D3、D4这4个子数据流图。假设有4个计算模块,分别记为计算模块1、计算模块2、计算模块3、计算模块4,假设处理模块210根据A1、A2、A3、A4这4个子数据流图的参数和4个计算模块的参数之间的映射关系,处理模块210根据B1、B2、B3、B4这4个子数据流图的参数和4个计算模块的参数之间的映射关系,处理模块210根据C1、C2、C3、C4这4个子数据流图的参数和4个计算模块的参数之间的映射关系,以及处理模块210根据D1、D2、D3、D4这4个子数据流图的参数和4个计算模块的参数之间的映射关系,将上述多个子数据流图分配给4个计算模块,使得计算模块1对应的子数据流图为A1、B1、C1、D1,计算模块2对应的子数据流图为A2、B2、C2、D2,计算模块3对应的子数据流图为A3、B3、C3、D3,计算模块4对应的子数据流图为A4、B4、C4、D4,此时A部分数据流图包括的A1、A2、A3、A4这4个子数据流图的数据是由计算模块1、计算模块2、计算模块3、计算模块4并发计算的,B部分数据流图包括的B1、B2、B3、B4这4个子数据流图的数据是由计算模块1、计算模块2、计算模块3、计算模块4并发计算的,C部分数据流图包括的C1、C2、C3、C4这4个子数据流图的数据是由计算模块1、计算模块2、计算模块3、计算模块4并发计算的,D部分数据流图包括的D1、D2、D3、D4这4个子数据流图的数据是由计算模块1、计算模块2、计算模块3、计算模块4并发计算的。
关于处理模块210根据多个子数据流图(例如A1、A2、A3、A4这4个子数据流图)的参数和4 个计算模块的参数之间的映射关系,将多个子数据流图分配给4个计算模块的示例性说明,可以参考方式#A至方式#F中的描述,在此不再进行赘述。
另一种可能的方式,当服务质量指标为吞吐量时,N个计算模块各自对应的子数据流图的数据是由N个计算模块独立计算的。
图5示出了本申请实施例提供的又一种数据流图的示意图。如图5所示,假设数据流图包括A、B、C、D四部分,当服务质量指标为吞吐量时,处理模块210可以将数据流图分为A、B、C、D这4个子数据流图。假设有4个计算模块,分别记为计算模块1、计算模块2、计算模块3、计算模块4,假设处理模块210根据A、B、C、D这4个子数据流图的参数和4个计算模块的参数之间的映射关系,将A、B、C、D这4个子数据流图分配给4个计算模块,使得计算模块1对应的子数据流图为A,计算模块2对应的子数据流图为B,计算模块3对应的子数据流图为C,计算模块4对应的子数据流图为D,此时A、B、C、D这4个子数据流图的数据是由计算模块1、计算模块2、计算模块3、计算模块4独立计算的。
关于处理模块210根据多个子数据流图(例如A、B、C、D这4个子数据流图)的参数和4个计算模块的参数之间的映射关系,将多个子数据流图分配给4个计算模块的示例性说明,可以参考方式#A至方式#F中的描述,在此不再进行赘述。
基于上述技术方案,处理模块可以根据服务质量指标,将数据流图分为M个子数据流图,使得N个计算模块可以对各自对应的子数据流图的数据进行计算。例如,当服务质量指标为时延时,N个计算模块可以对各自对应的子数据流图的数据进行并发计算;再例如,当服务质量指标为吞吐量时,N个计算模块可以对各自对应的子数据流图的数据进行独立计算。通过该方式,可以提高计算模块资源的利用率,以及提高计算模块处理计算任务的效率,降低深度学习的应用难度。
可选地,N个计算模块220,还用于使用交换操作对各自对应的子数据流图的数据进行交换。
应理解,交换操作可以作为传输媒介,多个计算模块可以通过该传输媒介互相传输数据,从而完成多个计算模块的数据交换,进而使得该多个计算模块之间实现数据同步。
图6示出了本申请实施例提供的又一种数据流图的示意图。如图6所示,假设有4个计算模块,分别记为计算模块1、计算模块2、计算模块3、计算模块4,假设数据流图包括A、B、C、D四部分,A部分数据流图包括的A1、A2、A3、A4这4个子数据流图的数据是由计算模块1、计算模块2、计算模块3、计算模块4并发计算的,B部分数据流图包括的B1、B2、B3、B4这4个子数据流图的数据是由计算模块1、计算模块2、计算模块3、计算模块4并发计算的,C部分数据流图包括的C1、C2、C3、C4这4个子数据流图的数据是由计算模块1、计算模块2、计算模块3、计算模块4并发计算的,D部分数据流图包括的D1、D2、D3、D4这4个子数据流图的数据是由计算模块1、计算模块2、计算模块3、计算模块4并发计算的。当计算模块1、计算模块2、计算模块3、计算模块4分别完成B1、B2、B3、B4这4个子数据流图的数据的计算时,计算模块1、计算模块2、计算模块3、计算模块4可以使用交换操作E对B1、B2、B3、B4这4个子数据流图的数据进行交换;当计算模块1、计算模块2、计算模块3、计算模块4分别完成D1、D2、D3、D4这4个子数据流图的数据的计算时,计算模块1、计算模块2、计算模块3、计算模块4可以使用交换操作E对D1、D2、D3、D4这4个子数据流图的数据进行交换。例如,计算模块1和计算模块2可以使用交换操作E,对B1的数据和B2的数据进行交换,使得计算模块1包括B1的数据和B2的数据,计算模块2包括B1的数据和B2的数据。
图7示出了本申请实施例提供的又一种数据流图的示意图。如图7所示,假设有4个计算模块,分别记为计算模块1、计算模块2、计算模块3、计算模块4,假设数据流图包括A、B、C、D四部分,处理模块210可以将数据流图分为A、B、C、D这4个子数据流图,A、B、C、D这4个子数据流图的数据是由计算模块1、计算模块2、计算模块3、计算模块4独立计算的。当计算模块1、计算模块2、计算模块3、计算模块4分别完成A、B、C、D这4个子数据流图的数据的计算时,计算模块1、计算模块2、计算模块3、计算模块4可以使用交换操作E对A、B、C、D这4个子数据流图的数据进行交换。例如,计算模块1和计算模块2可以使用交换操作E,对A的数据和B的数据进行交换,使得计算模块1包括A的数据和B的数据,计算模块2包括A的数据和B的数据。
基于上述技术方案,N个计算模块可以使用交换操作对各自对应的子数据流图的数据进行交换,从而可以保证N个计算模块对子数据流图的数据进行计算时的准确性。
应理解,当服务质量指标为时延时,N个计算模块220也可以对各自对应的子数据流图的数据进行独立计算。例如,当N个计算模块220之间的带宽较低,且M个子数据流图的数据之间需要较多次交换操作时,N个计算模块220也可以对各自对应的子数据流图的数据进行独立计算。本申请实施例对服务质量指标为时延时,N个计算模块220对各自对应的子数据流图的数据进行独立计算还是并发计算不作限定。
还应理解,当服务质量指标为吞吐量时,N个计算模块220也可以对各自对应的子数据流图的数据进行并发计算。例如,当N个计算模块220之间的带宽较高,且M个子数据流图的数据之间需要较少次交换操作时,N个计算模块220也可以对各自对应的子数据流图的数据进行并发计算。本申请实施例对服务质量指标为吞吐量时,N个计算模块220对各自对应的子数据流图的数据进行独立计算还是并发计算不作限定。
可选地,N个计算模块包括至少两个设备,该至少两个设备通过互联装置组成至少一个设备组,该一个设备组的算力大于或等于一个设备的算力。
示例性地,假设N个计算模块包括三个设备,其中,两个设备为PG设备,一个设备为AG设备,将两个PG设备通过互联装置组成一个设备组,使得该一个设备组的算力大于或等于一个PG设备的算力,该一个设备组的算力也大于或等于一个AG设备的算力。
基于上述技术方案,当N个计算模块包括至少两个设备时,该至少两个设备通过互联装置组成至少一个设备组,使得该一个设备组的算力大于或等于一个设备的算力,从而能够提高单个设备的利用率。
可以理解,本申请实施例中的图1至图7中的例子仅仅是为了便于本领域技术人员理解本申请实施例,并非要将本申请实施例限于例示的具体场景。本领域技术人员根据图1至图7的例子,显然可以进行各种等价的修改或变化,这样的修改或变化也落入本申请实施例的范围内。例如,本申请实施例中的“N个计算模块用于对各自对应的子数据流图的数据进行计算”均可替换为“N个计算模块用于计算隶属于各自的子数据流图的数据”。
下面结合图8对本申请实施例中的深度学习方法进行说明,图8所示的深度学习方法800可以由图1或图2所示的深度学习系统执行,具体描述可以参考前文中深度学习系统的相关描述,下面在介绍本申请实施例的深度学习方法时适当省略重复的描述。
图8所示的方法800包括步骤810和步骤820。下面对步骤810和步骤820进行说明。
810,将数据流图分为M个子数据流图,并根据M个子数据流图的参数和N个计算模块的参数之间的映射关系,将M个子数据流图分配给N个计算模块。
其中,M、N为正整数。
820,对各自对应的子数据流图的数据进行计算。
可选地,作为一种实施方式,M个子数据流图的参数包括以下至少一项:M个子数据流图的数据优先级,M个子数据流图的数据量。
可选地,作为一种实施方式,N个计算模块的参数包括以下至少一项:N个计算模块之间的带宽、N个计算模块的算力、N个计算模块的存储容量。
可选地,作为一种实施方式,将数据流图分为M个子数据流图,包括:根据服务质量指标,将数据流图分为M个子数据流图。
可选地,作为一种实施方式,当服务质量指标为时延时,各自对应的子数据流图的数据是由N个计算模块并发计算的。
可选地,作为一种实施方式,当服务质量指标为吞吐量时,各自对应的子数据流图的数据是由N个计算模块独立计算的。
可选地,作为一种实施方式,方法还包括:使用交换操作对N个计算模块各自对应的子数据流图的数据进行交换。
基于上述技术方案,处理模块可以根据M个子数据流图的参数和N个计算模块的参数之间的映射关系,将M个子数据流图分配给N个计算模块,可以实现计算任务的自适应分配,提高了计算模块资源的利用率,也提高了计算模块处理计算任务的效率,降低了深度学习的应用难度。
图9是本申请实施例提供的深度学习系统的硬件结构示意图。图9所示的深度学习系统900(该深 度学习系统900具体可以是一种计算机设备)包括存储器910、处理器920、通信接口930以及总线940。其中,存储器910、处理器920、通信接口930通过总线940实现彼此之间的通信连接。
存储器910可以是只读存储器(read only memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(random access memory,RAM)。存储器910可以存储程序,当存储器910中存储的程序被处理器920执行时,处理器920用于执行本申请实施例的深度学习方法的各个步骤。具体地,处理器920可以执行上文中的方法800。
处理器920可以包括图1或图2的处理模块和计算模块。
处理器920可以采用通用的中央处理器(central processing unit,CPU),微处理器,应用专用集成电路(application specific integrated circuit,ASIC),图形处理器(graphics processing unit,GPU)或者一个或多个集成电路,用于执行相关程序,以实现本申请方法实施例的深度学习方法。
处理器920还可以是一种集成电路芯片,具有信号的处理能力。
上述处理器920还可以是通用处理器、数字信号处理器(digital signal processing,DSP)、专用集成电路(ASIC)、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器910,处理器920读取存储器910中的信息,结合其硬件完成图3至图7中任一项所示的装置中包括的模块所需执行的功能,或者,执行本申请方法实施例的深度学习方法。
通信接口930使用例如但不限于收发器一类的收发装置,来实现装置900与其他设备或通信网络之间的通信。例如,可以通过通信接口930获取数据流图。
总线940可包括在装置900各个部件(例如,存储器910、处理器920、通信接口930)之间传送信息的通路。
应注意,尽管上述装置900仅仅示出了存储器、处理器、通信接口,但是在具体实现过程中,本领域的技术人员应当理解,装置900还可以包括实现正常运行所必须的其他器件。同时,根据具体需要,本领域的技术人员应当理解,装置900还可包括实现其他附加功能的硬件器件。此外,本领域的技术人员应当理解,装置900也可仅仅包括实现本申请实施例所必须的器件,而不必包括图9中所示的全部器件。
本申请实施例还提供了一种计算机可读存储介质,该计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于执行本申请实施例中的深度学习方法。
本申请实施例还提供一种包含指令的计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行本申请实施例中的深度学习方法。
本申请实施例还提供一种芯片,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,执行本申请实施例中的深度学习方法。
可选地,作为一种实现方式,所述芯片还可以包括存储器,所述存储器中存储有指令,所述处理器用于执行所述存储器上存储的指令,当所述指令被执行时,所述处理器用于执行本申请实施例中的深度学习方法。
本申请实施例还提供一种片上系统SoC,SoC包括本申请实施例中的深度学习系统。
应理解,本申请实施例中的处理器可以为中央处理单元(central processing unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
还应理解,本申请实施例中的存储器可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、 电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的随机存取存储器(random access memory,RAM)可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令或计算机程序。在计算机上加载或执行所述计算机指令或计算机程序时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质。半导体介质可以是固态硬盘。
可以理解,在本申请的各实施例中涉及到的名称,如模块名称,应理解,其命名不对本申请实施例的保护范围造成限定。
还可以理解,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b,或c中的至少一项(个),可以表示:a,b,c,a-b,a-c,b-c,或a-b-c,其中a,b,c可以是单个,也可以是多个。
还可以理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的模块及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的系统实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个模块或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或模块的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理模块,即可以位于一个地方,或者也可以分布到多个网络模块上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。
所述功能如果以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储 程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (16)

  1. 一种深度学习系统,其特征在于,包括:处理模块和N个计算模块,
    所述处理模块,用于将数据流图分为M个子数据流图,并根据所述M个子数据流图的参数和所述N个计算模块的参数之间的映射关系,将所述M个子数据流图分配给所述N个计算模块,其中,M、N为正整数;
    所述N个计算模块,用于对各自对应的子数据流图的数据进行计算。
  2. 根据权利要求1所述的系统,其特征在于,所述M个子数据流图的参数包括以下至少一项:所述M个子数据流图的数据优先级,所述M个子数据流图的数据量。
  3. 根据权利要求1或2所述的系统,其特征在于,所述N个计算模块的参数包括以下至少一项:
    所述N个计算模块之间的带宽、所述N个计算模块的算力、所述N个计算模块的存储容量。
  4. 根据权利要求1至3中任一项所述的系统,其特征在于,
    所述处理模块,用于根据服务质量指标,将所述数据流图分为所述M个子数据流图。
  5. 根据权利要求4所述的系统,其特征在于,
    当所述服务质量指标为时延时,所述N个计算模块各自对应的子数据流图的数据是由所述N个计算模块并发计算的。
  6. 根据权利要求4所述的系统,其特征在于,
    当所述服务质量指标为吞吐量时,所述N个计算模块各自对应的子数据流图的数据是由所述N个计算模块独立计算的。
  7. 根据权利要求1至6中任一项所述的系统,其特征在于,
    所述N个计算模块,还用于使用交换操作对各自对应的子数据流图的数据进行交换。
  8. 一种深度学习方法,其特征在于,包括:
    将数据流图分为M个子数据流图,并根据所述M个子数据流图的参数和N个计算模块的参数之间的映射关系,将所述M个子数据流图分配给所述N个计算模块,其中,M、N为正整数;
    对各自对应的子数据流图的数据进行计算。
  9. 根据权利要求8所述的方法,其特征在于,所述M个子数据流图的参数包括以下至少一项:所述M个子数据流图的数据优先级,所述M个子数据流图的数据量。
  10. 根据权利要求8或9所述的方法,其特征在于,所述N个计算模块的参数包括以下至少一项:
    所述N个计算模块之间的带宽、所述N个计算模块的算力、所述N个计算模块的存储容量。
  11. 根据权利要求8至10中任一项所述的方法,其特征在于,所述将数据流图分为M个子数据流图,包括:
    根据服务质量指标,将所述数据流图分为所述M个子数据流图。
  12. 根据权利要求11所述的方法,其特征在于,
    当所述服务质量指标为时延时,所述各自对应的子数据流图的数据是由所述N个计算模块并发计算的。
  13. 根据权利要求11所述的方法,其特征在于,
    当所述服务质量指标为吞吐量时,所述各自对应的子数据流图的数据是由所述N个计算模块独立计算的。
  14. 根据权利要求8至13中任一项所述的方法,其特征在于,所述方法还包括:
    使用交换操作对所述N个计算模块各自对应的子数据流图的数据进行交换。
  15. 一种芯片,其特征在于,所述芯片包括处理器与数据接口,所述处理器通过数据接口读取存储器上存储的指令以执行如权利要求8至14中任一项所述的方法。
  16. 一种片上系统SoC,其特征在于,包括如权利要求1至7中任一项所述的深度学习系统。
PCT/CN2023/105715 2022-07-28 2023-07-04 一种深度学习系统和方法 WO2024022046A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210894943.8A CN117521841A (zh) 2022-07-28 2022-07-28 一种深度学习系统和方法
CN202210894943.8 2022-07-28

Publications (1)

Publication Number Publication Date
WO2024022046A1 true WO2024022046A1 (zh) 2024-02-01

Family

ID=89705290

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/105715 WO2024022046A1 (zh) 2022-07-28 2023-07-04 一种深度学习系统和方法

Country Status (2)

Country Link
CN (1) CN117521841A (zh)
WO (1) WO2024022046A1 (zh)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190325309A1 (en) * 2017-08-19 2019-10-24 Wave Computing, Inc. Neural network output layer for machine learning
CN110399222A (zh) * 2019-07-25 2019-11-01 北京邮电大学 Gpu集群深度学习任务并行化方法、装置及电子设备
CN110515739A (zh) * 2019-10-23 2019-11-29 上海燧原智能科技有限公司 深度学习神经网络模型负载计算方法、装置、设备及介质
CN111860820A (zh) * 2020-07-31 2020-10-30 北京灵汐科技有限公司 神经网络算子的划分方法、装置及划分设备
CN112650590A (zh) * 2020-12-29 2021-04-13 北京奇艺世纪科技有限公司 任务的处理方法、装置及系统、分配方法和装置
CN113051080A (zh) * 2021-04-22 2021-06-29 杭州海康威视数字技术股份有限公司 一种计算图执行方法、装置及异构平台
CN114418127A (zh) * 2022-03-23 2022-04-29 阿里云计算有限公司 机器学习计算优化方法和平台

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190325309A1 (en) * 2017-08-19 2019-10-24 Wave Computing, Inc. Neural network output layer for machine learning
CN110399222A (zh) * 2019-07-25 2019-11-01 北京邮电大学 Gpu集群深度学习任务并行化方法、装置及电子设备
CN110515739A (zh) * 2019-10-23 2019-11-29 上海燧原智能科技有限公司 深度学习神经网络模型负载计算方法、装置、设备及介质
CN111860820A (zh) * 2020-07-31 2020-10-30 北京灵汐科技有限公司 神经网络算子的划分方法、装置及划分设备
CN112650590A (zh) * 2020-12-29 2021-04-13 北京奇艺世纪科技有限公司 任务的处理方法、装置及系统、分配方法和装置
CN113051080A (zh) * 2021-04-22 2021-06-29 杭州海康威视数字技术股份有限公司 一种计算图执行方法、装置及异构平台
CN114418127A (zh) * 2022-03-23 2022-04-29 阿里云计算有限公司 机器学习计算优化方法和平台

Also Published As

Publication number Publication date
CN117521841A (zh) 2024-02-06

Similar Documents

Publication Publication Date Title
US10452995B2 (en) Machine learning classification on hardware accelerators with stacked memory
US10540588B2 (en) Deep neural network processing on hardware accelerators with stacked memory
WO2021136137A1 (zh) 一种资源调度方法、装置及相关设备
CN102971724B (zh) 与数据中心环境内的基于单元式虚拟资源的管理有关的方法和装置
US20160379686A1 (en) Server systems with hardware accelerators including stacked memory
CN105900063A (zh) 多处理环境中的调度方法和装置
US20150067695A1 (en) Information processing system and graph processing method
CN107122244A (zh) 一种基于多gpu的图数据处理系统及方法
US20230038051A1 (en) Data transmission method and apparatus
US20220121912A1 (en) Data processing method and apparatus
CN116070682B (zh) 神经元计算机操作系统的snn模型动态映射方法及装置
Chen et al. Towards efficient allocation of graph convolutional networks on hybrid computation-in-memory architecture
US20220300323A1 (en) Job Scheduling Method and Job Scheduling Apparatus
KR102238600B1 (ko) 스케쥴러 컴퓨팅 장치, 그것을 포함하는 분산 컴퓨팅 시스템의 데이터 노드 및 그것의 방법
WO2020024207A1 (zh) 处理业务请求的方法、装置与存储系统
WO2020124488A1 (zh) 应用进程映射方法、电子装置及计算机可读存储介质
WO2024022046A1 (zh) 一种深度学习系统和方法
Sontakke et al. Optimization of hadoop mapreduce model in cloud computing environment
CN117130723A (zh) 分配信息的确定方法、装置、计算机设备和存储介质
KR101620896B1 (ko) 이기종 프로세싱 타입을 고려한 맵리듀스 프로그램 모델의 수행 성능 향상 방법, 수행 성능 향상 장치 및 수행 성능 향상 시스템
US20120066310A1 (en) Combining multiple hardware networks to achieve low-latency high-bandwidth point-to-point communication of complex types
WO2022063273A1 (zh) 一种基于numa属性的资源分配方法及装置
CN113556242A (zh) 一种基于多处理节点来进行节点间通信的方法和设备
US20230376562A1 (en) Integrated circuit apparatus for matrix multiplication operation, computing device, system, and method
WO2020063940A1 (zh) 计算装置及相关产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23845259

Country of ref document: EP

Kind code of ref document: A1