WO2020172825A1 - Method and apparatus for determining transmission policy - Google Patents

Method and apparatus for determining transmission policy Download PDF

Info

Publication number
WO2020172825A1
WO2020172825A1 PCT/CN2019/076359 CN2019076359W WO2020172825A1 WO 2020172825 A1 WO2020172825 A1 WO 2020172825A1 CN 2019076359 W CN2019076359 W CN 2019076359W WO 2020172825 A1 WO2020172825 A1 WO 2020172825A1
Authority
WO
WIPO (PCT)
Prior art keywords
transmission strategy
neural network
network model
transmission
information
Prior art date
Application number
PCT/CN2019/076359
Other languages
French (fr)
Chinese (zh)
Inventor
范礼
王海彬
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2019/076359 priority Critical patent/WO2020172825A1/en
Priority to CN201980091568.XA priority patent/CN113412494B/en
Publication of WO2020172825A1 publication Critical patent/WO2020172825A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a method and device for determining a transmission strategy.
  • Artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results.
  • artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, and basic AI theories.
  • model parallel is to divide the neural network model into multiple parts, and each part is handed over to each training node for training, but there is a lot of communication between training nodes, and there are certain difficulties in cutting and dividing the model; while data parallel training is The training data is divided into multiple training data sets and handed over to multiple training nodes for training without cutting the partition model. Therefore, data parallel training is an effective strategy for distributed training on large-scale training data.
  • the embodiments of the present application provide a method and device for determining a transmission strategy, and transmitting gradients through the determined transmission strategy can effectively improve the efficiency of distributed training.
  • an embodiment of the present application provides a method for determining a transmission strategy, the method may be executed by a computing node, and the method includes:
  • the i-th transmission strategy is used to transmit the gradient of each layer parameter obtained in the i-th iteration of the first neural network model; determine the communication tailing time corresponding to the i-th transmission strategy, and the i-th transmission strategy
  • the computing node After the computing node generates the i-th transmission strategy, it can obtain the communication tail time corresponding to the i-th transmission strategy, so that a round of reinforcement learning can be completed based on the communication tail time corresponding to the i-th transmission strategy, so that the generated first
  • the i+1 transmission strategy tends to the optimal transmission strategy (that is, the transmission strategy that makes the communication tailing time the shortest), which helps to improve the efficiency of distributed training.
  • the method further includes: sending the i-th transmission strategy to W training nodes, and the W training nodes are used for distributed training of the first neural network model;
  • Determining the communication tailing time corresponding to the i-th transmission strategy includes: receiving the i-th communication tailing time of the parameter gradients of each layer obtained by the i-th iteration of the first neural network model by W training nodes using the i-th transmission strategy ; According to the W i-th communication tail time length, determine the communication tail time length corresponding to the i-th transmission strategy, W is an integer greater than or equal to 1.
  • generating the i-th transmission strategy includes: generating the i-th transmission strategy through a second neural network model.
  • generating the i+1th transmission strategy according to the communication tailing duration corresponding to the i-th transmission strategy includes: updating the parameters of the second neural network model according to the communication tailing duration corresponding to the i-th transmission strategy , And generate the i+1th transmission strategy through the updated second neural network model.
  • the communication tailing duration corresponding to the i-th transmission strategy is used as the reward of reinforcement learning to update the parameters of the second neural network model. Since the second neural network model has strong learning capabilities, this process is continuously executed , Can achieve the convergence of the communication tail time to the optimal value.
  • generating the sub-transmission strategy of the n+1th layer through the second neural network model is specifically: taking the second information of the sub-transmission strategy of the nth layer as the input of the second neural network model to generate The first information of the sub-transmission strategy of the n+1th layer and the first information of the sub-transmission strategy of the n+1th layer are used as the input of the second neural network model to generate the first information of the sub-transmission strategy of the n+1th layer Two information.
  • generating the i-th transmission strategy includes: generating the i-th transmission strategy through the Q table used to record state-actions in the Q-learning algorithm; the Q table includes P states and Q actions, P Each state corresponds to P data volume thresholds, and Q actions respectively correspond to Q combinations composed of state transition amount and logic topology; among them, P and Q are integers greater than or equal to 1.
  • generating the i+1th transmission strategy includes: updating the Q table according to the communication tailing duration corresponding to the ith transmission strategy, and generating the i+1th transmission strategy through the updated Q table .
  • the data volume threshold is used as the state dimension information in the Q table, and the state transition amount and logical topology are used as the action dimension information in the Q table.
  • the i-th transmission strategy further, the Q table in Q-learning is updated by using the communication tailing duration corresponding to the i-th transmission strategy as a reward for reinforcement learning. Since the Q-Learning algorithm can take actions according to the current state, obtain the corresponding After the reward, we can improve these actions, so as to be able to make better actions, that is, to get a better transmission strategy.
  • the i-th transmission strategy includes third information and fourth information; where the third information is used to indicate the transmission of the first neural network model obtained by the i-th iteration of the gradient of each layer parameter i data volume threshold, the i-th data volume threshold is used to determine the transmission timing of the gradient of each layer parameter obtained in the i-th iteration of the first neural network model; the fourth information is used to indicate the logical topology used for each transmission .
  • generating the i-th transmission strategy through the Q table includes: according to the Q table, obtaining the reward value for executing Q actions under the state corresponding to the i-1th data volume threshold, and according to the Q actions
  • the reward value determines the i-th target action and generates the i-th transmission strategy
  • the i-th data amount threshold is the sum of the state transition amount in the combination corresponding to the i-th target action and the i-1th data amount threshold, which is used for each transmission
  • the logical topology is the logical topology in the combination corresponding to the i-th target action.
  • the maximum data volume threshold among the P data volume thresholds is determined according to the parameter value of the first neural network model, and/or the minimum data volume threshold among the P data volume thresholds is determined according to the preset Let the transmission efficiency be determined.
  • an embodiment of the present application provides a method for determining a transmission strategy.
  • the method may be executed by a computing node.
  • the method includes:
  • the i-th transmission strategy is used to transmit the gradient of each layer parameter obtained in the i-th iteration of the first neural network model; determine the communication tailing time corresponding to the i-th transmission strategy, and the i-th transmission strategy
  • the communication tail time is used to indicate the time between the end time of the i-th iteration of the first neural network model and the start time of the i+1th iteration; if it is determined that the communication tail time corresponding to the i-th transmission strategy is greater than the first Threshold, the i+1th transmission strategy is generated according to the communication tailing duration corresponding to the i-th transmission strategy.
  • the method also includes:
  • the i-th transmission strategy is used as the (i+1)th transmission strategy.
  • the computing node through multiple rounds of reinforcement learning, if the generated i-th transmission strategy can make the efficiency of distributed training higher (that is, the i-th transmission strategy is a better transmission strategy), it can no longer generate new Transmission strategy. Accordingly, the training node can use the same transmission strategy (that is, the i-th transmission strategy) to transmit the gradient in subsequent iterations of the first neural network model. In this way, the processing burden of the computing node can be effectively reduced, and the training node can transmit gradients based on the same transmission strategy without receiving the newly generated transmission strategy of the computing node, which can effectively improve the efficiency of distributed training.
  • an embodiment of the present application provides a method for transmitting gradients.
  • the method may be executed by a training node.
  • the method includes:
  • the i-th transmission strategy is used to transmit the gradient of the parameters of each layer obtained in the i-th iteration of the first neural network model.
  • the i-th transmission strategy includes the sub-transmission strategy of each layer of the first neural network model, and the sub-transmission strategy of the n-th layer includes first information and second information, and the first information is used to indicate calculation Whether to initiate transmission after obtaining the gradient of the n-th layer parameter, the second information is used to indicate the logical topology used for transmission;
  • the i-th transmission strategy to transmit the gradient of each layer parameter obtained in the i-th iteration of the first neural network model, including: in the i-th iteration, after calculating the gradient of each layer parameter, according to the sub-layer of each layer
  • the transmission strategy transmits the gradient of each layer parameter.
  • the first information of the sub-transmission strategy of the nth layer is used to indicate that the transmission is initiated after the gradient of the n-th layer parameter is calculated.
  • the sub-transmission strategy of the nth layer is The logical topology indicated by the second information transmits the gradient to be transmitted.
  • the i-th transmission strategy includes third information and fourth information; where the third information is used to indicate the transmission of the first neural network model obtained by the i-th iteration of the gradient of each layer parameter i data volume threshold, the i-th data volume threshold is used to determine the transmission timing of the gradient of each layer parameter obtained in the i-th iteration of the first neural network model; the fourth information is used to indicate the logical topology used for each transmission ;
  • the i-th transmission strategy to transmit the gradient of the parameters of each layer obtained in the i-th iteration of the first neural network model, including: in the i-th iteration, after calculating the gradient of each layer parameter, if it is determined to transfer the If the data amount of the gradient is greater than or equal to the i-th data amount threshold, the logical topology indicated by the fourth information is used to transmit the gradient to be transmitted.
  • obtaining the i-th transmission strategy includes: receiving the i-th transmission strategy sent by the computing node;
  • the i-th transmission strategy After using the i-th transmission strategy to transmit the gradients of the parameters of each layer obtained in the i-th iteration of the first neural network model, it also includes: transmitting the i-th iterations of the first neural network model using the i-th transmission strategy The i-th communication tail duration of the gradient of the layer parameter is sent to the computing node.
  • embodiments of the present application provide a device, which may be a computing node or a training node, or a computer device where the computing node or training node is located, or a semiconductor chip set in the computer device.
  • the device has the function of realizing various possible designs in the first to third aspects. This function can be realized by hardware, or by hardware executing corresponding software.
  • the hardware or software includes one or more units or modules corresponding to the above-mentioned functions.
  • an embodiment of the present application provides a device that includes a processor, a memory, and an instruction stored on the memory and executable on the processor. When the instruction is executed, the device executes the first aspect.
  • an embodiment of the present application provides a computer-readable storage medium, including instructions, which when run on a computer, cause the computer to execute as described in any one of the possible designs of the first to third aspects. The method described.
  • the embodiments of the present application provide a computer program product, which when running on a computer, causes the computer to execute the method described in any one of the possible designs of the first aspect to the third aspect.
  • Figure 1 is a schematic diagram of an artificial intelligence main framework provided by an embodiment of the application.
  • Figure 2b is a schematic diagram of a centralized distributed training system provided by an embodiment of the application.
  • Figure 2c is a schematic diagram of a decentralized distributed training system provided by an embodiment of the application.
  • Figure 2d is a schematic diagram of transmission in the decentralized distributed training system provided by an embodiment of the application.
  • 2e is a possible schematic diagram of asynchronous parallel computing and communication provided by an embodiment of this application.
  • FIG. 2f is a schematic diagram of the relationship between the amount of transmitted data and the communication tail duration provided by an embodiment of the application;
  • Figure 3 is a schematic diagram of an architecture applicable to the implementation of this application.
  • FIG. 4 is a schematic flowchart corresponding to a method for determining a transmission strategy provided by an embodiment of the application
  • Figure 5a is a schematic diagram of a second neural network model generating a transmission strategy
  • FIG. 5b is an overall schematic diagram of Implementation Mode 1 provided by an embodiment of the application.
  • Figure 6a is a schematic diagram of determining the data volume threshold
  • FIG. 6b is an overall schematic diagram of Implementation Mode 2 provided by an embodiment of this application.
  • FIG. 7 is a possible exemplary block diagram of a device for determining a transmission strategy involved in an embodiment of this application.
  • FIG. 8 is a schematic diagram of an apparatus for determining a transmission strategy provided by an embodiment of the application.
  • Artificial neural network artificial neural network, ANN, referred to as neural network (NN) or neural network, in the field of machine learning and cognitive science, is a kind of imitating biological neural network (animal's central nervous system) , Especially the mathematical model or calculation model of the structure and function of the brain, used to estimate or approximate the function.
  • the neural network is calculated by connecting a large number of artificial neurons. In most cases, the artificial neural network can change the internal structure on the basis of external information. It is an adaptive system, and it has a learning function in general.
  • Loss function in statistics, is a function to measure the degree of loss and error. In neural networks, it can be understood as a function to measure the difference between the value predicted by the model and the label value of the training data , The neural network model can be trained with the goal of minimizing the loss function.
  • Gradient descent is a first-order optimization algorithm, which is usually called the steepest descent method. To use the gradient descent method to find the local minimum of a function, it is necessary to iteratively search for the specified step distance point in the opposite direction of the gradient (or approximate gradient) corresponding to the current point on the function.
  • Gradient refers to a vector (vector), the maximum value of the directional derivative of a certain function at a certain point in the gradient descent method.
  • each parameter can be updated based on the gradient of each parameter, thereby gradually approaching the minimum value of the loss function of the neural network.
  • Backpropagation (BP) algorithm short for "error backpropagation algorithm", it is a common method used to train artificial neural networks in combination with optimization methods (such as gradient descent) . This method can calculate the gradient of the loss function for all the weights in the neural network, and feed the gradient back to the optimization method to update the weights to minimize the loss function.
  • optimization methods such as gradient descent
  • Training node It can also be called a worker or a working node.
  • the training node can be a GPU or a central processing unit (CPU), which is not specifically limited.
  • Computing node It can be a GPU or a CPU, which is not limited.
  • GPU also known as display core, vision processor, display chip or graphics chip, it is a kind of graphics operation that is specially run on personal computers, workstations, game consoles and some mobile devices (such as tablet computers, smart phones, etc.) Microprocessor.
  • Figure 1 shows a schematic diagram of an artificial intelligence main framework, which describes the overall workflow of the artificial intelligence system and is suitable for general artificial intelligence field requirements.
  • Intelligent Information Chain reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has gone through the condensing process of "data-information-knowledge-wisdom". "IT value chain” from the intelligent underlying infrastructure, information (providing and processing technology realization) to the system industrial ecological process, reflecting the value artificial intelligence brings to the information technology industry.
  • Infrastructure provides computing power support for artificial intelligence systems, realizes communication with the outside world, and realizes support through the basic platform. Communicate with the outside through sensors; computing capabilities are provided by smart chips (CPU, NPU, GPU, ASIC, FPGA and other hardware acceleration chips); the basic platform includes distributed computing framework and network related platform guarantees and support, which can include cloud storage and Computing, interconnection network, etc. For example, sensors communicate with external parties to obtain data, and these data are provided to the smart chip provided by the basic platform for calculation.
  • smart chips CPU, NPU, GPU, ASIC, FPGA and other hardware acceleration chips
  • the basic platform includes distributed computing framework and network related platform guarantees and support, which can include cloud storage and Computing, interconnection network, etc. For example, sensors communicate with external parties to obtain data, and these data are provided to the smart chip provided by the basic platform for calculation.
  • the data in the upper layer of the infrastructure is used to indicate the data source in the field of artificial intelligence.
  • the data involves graphics, images, voice, and text, as well as IoT data of traditional devices, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • Data processing usually includes data training (such as deep learning, reinforcement learning), search, reasoning, and decision-making.
  • deep learning and reinforcement learning are important parts of artificial intelligence, and both deep learning and reinforcement learning belong to machine learning.
  • deep learning refers to the use of existing data to train algorithms to find patterns that solve the corresponding problems, and then use this pattern to predict new data.
  • Reinforcement learning is mainly learning through trial and error, that is, to determine the best answer by performing actions a limited number of times to get the maximum reward.
  • the difference between deep learning and reinforcement learning is that deep learning is learning from a training set, and then applying the learned knowledge to a new data set, which is a static learning; while reinforcement learning uses continuous feedback to adjust its own actions Obtaining the best results is a process of constant trial and error and dynamic learning.
  • deep learning and reinforcement learning are not mutually exclusive concepts. The two can be used in combination. For example, deep learning can be used in reinforcement learning.
  • Reasoning refers to the process of simulating human intelligent reasoning in a computer or intelligent system, using formal information to conduct machine thinking and solving problems based on reasoning control strategies.
  • the typical function is search and matching.
  • Decision-making refers to the decision-making process of intelligent information after reasoning, and usually provides functions such as classification, ranking, and prediction.
  • Smart products and industry applications refer to the products and applications of artificial intelligence systems in various fields. It is an encapsulation of the overall solution of artificial intelligence, productizing intelligent information decision-making and realizing landing applications. Its application fields Mainly include: smart manufacturing, smart transportation, smart home, smart medical, smart security, autonomous driving, safe city, smart terminal, etc.
  • the embodiment of this application will mainly study the content of the data training part in the framework shown in FIG. 1, and further study how to transmit the calculated gradient during the distributed training process of the first neural network model using training data. To improve the efficiency of distributed training.
  • Fig. 2a is a schematic diagram of the first neural network model.
  • the first neural network model includes multiple layers, and each layer includes at least one parameter.
  • the training of the first neural network model refers to determining the optimal parameter value according to the massive training data, so that the difference between the actual output data and the expected output data obtained by the first neural network model according to the training data meets the requirements.
  • the first neural network model includes N layers, which are the first layer to the Nth layer, and each layer in the first neural network model has a corresponding sequence.
  • the first layer is a layer that directly receives training data
  • the Nth layer is a layer that directly outputs data.
  • the first neural network model can be trained using the backpropagation algorithm, which specifically includes (for one iteration): input training data; calculate the actual output data from the first layer to the Nth layer according to the training data (That is, forward calculation); calculate the loss function value according to the difference between the actual output data and the expected output data; calculate the gradient of the parameters from the Nth layer to the first layer according to the loss function value, and use the gradient to update the parameters (that is Reverse calculation).
  • the backpropagation algorithm specifically includes (for one iteration): input training data; calculate the actual output data from the first layer to the Nth layer according to the training data (That is, forward calculation); calculate the loss function value according to the difference between the actual output data and the expected output data; calculate the gradient of the parameters from the Nth layer to the first layer according to the loss function value, and use the gradient to update the parameters (that is Reverse calculation).
  • multiple training nodes may be different, and multiple training nodes need to transmit the gradient of the calculated parameter a in order to determine the gradient average value. Furthermore, multiple training nodes can obtain the parameter a updated with the gradient average value. After multiple training nodes update the parameters of each layer, they respectively use the updated parameters of each layer to perform the next iteration of the first neural network model.
  • a centralized manner may be used for distributed training, or a decentralized manner may be used for distributed training. The following two methods are described in detail.
  • Figure 2b is a schematic diagram of a centralized distributed training system.
  • the distributed training system includes a central server (also called a parameter server or central node) and at least one training node (such as the one shown in Figure 2b).
  • the parameter server can communicate with at least one training node.
  • each training node has a copy of the first neural network model, and each training node can use designated data blocks (shards) to train the first neural network model.
  • the distributed training process is: the training data set is divided into multiple data blocks, for the j-th data block, the j-th data block can be divided into 3 mini-batches (mini- batches) data, and 3 mini-batches data are trained by 3 training nodes.
  • each training node can send the gradient of the calculated parameter to the parameter server according to the same rules (for example, after each training node calculates the gradient of a layer of parameters, it sends the gradient of that layer of parameters) To the parameter server), correspondingly, taking parameter a as an example, the parameter server can determine the gradient average of parameter a according to the gradient of parameter a, and then update parameter a based on the gradient average of parameter a (the specific update method is not limited ), and feed back the updated parameter a to the 3 training nodes.
  • the three training nodes can complete the update of the parameters of each layer of the first neural network model, and can use the mini-batches data in the j+1 data block to perform the next iteration based on the updated parameters.
  • Figure 2c is a schematic diagram of a decentralized distributed training system that includes at least one training node (such as the training node 1, training node 2, training node 3, training node 4, and training node shown in Figure 2c). Node 5). At least one training node can communicate with each other, such as transmitting gradients.
  • at least one training node can communicate with each other, such as transmitting gradients.
  • At least one training node may have a mutual data transmission sequence.
  • training node 1 can only transmit data to training node 2
  • training node 2 can only transmit data to training node 3
  • training node 3 can only Transmit the data to the training node 1.
  • the sequence of data transmission among multiple training nodes can be pre-configured, or it can be calculated and determined by the training node according to specific rules.
  • the distributed training process can be: dividing the training data set into multiple data blocks, for the jth data block, the jth data block can be divided into 5 mini-batches (mini -batches) data, and 5 mini-batches data are trained by 5 training nodes.
  • the gradient of the calculated parameter can be sent to other training nodes according to the sequence of data transmission.
  • the gradient can be divided into multiple groups according to the number. Each group is a slice, and each slice includes at least one gradient. The number of training nodes for training the first neural network model is the same. If there are 5 training nodes, the buffered gradient is divided into 5 slices.
  • each training node uses the same slicing rule for segmentation. For example, suppose there are 5 training nodes in total. If each training node buffers 10 gradients to be transmitted, then each slice can include 2 gradients. If each training node buffers 11 gradients to be transmitted, you can Cut into 5 slices using 2, 2, 2, 2, 3, 4 slices all include 2 gradients, and 1 slice includes 3 gradients.
  • training node i cuts the cached gradient into 5 slices, each slice is ai-ei, i is 1-5, ai-ei is the identifier of the slice, that is, training node 1 cuts the cached gradient
  • each slice is a1-e1
  • the training node 2 cuts the buffered gradient into 5 slices
  • each slice is a2-e2,...
  • the parameters corresponding to the gradient included in the slice corresponding to the slice identifier are consistent.
  • the slice a1 includes two gradients, and the two Each ladder is the gradient corresponding to the parameter R and the parameter Y, and slice a2, slice a3, slice a4, and slice a5 also include two gradients, which are also gradients corresponding to the parameter R and the parameter Y, respectively.
  • training node 1 When training node 1 transmits the buffered gradient to training node 2, training node 1 first sends slice a1 to training node 2; after training node 2 receives slice a1 sent by training node 1, it compares the received slice a1 with The sum obtained by adding the slice a2 determined by itself is sent to the training node 3 as a slice a1+a2; the slice a1 includes the gradients corresponding to the parameter R and the parameter Y, which are r1 and y1, respectively, and the slice a2 includes the parameter R and the parameter The gradients corresponding to Y are r2 and y2 respectively.
  • Training node 2 sends the sum obtained by adding slice a1 and slice a2 to training node 3 as a slice a1+a2, which can be the sum obtained by adding two corresponding gradients r1 and r2 for parameter R as parameter R
  • the gradient of is carried in slice a1+a2 and sent to training node 3.
  • parameter Y the sum of the two corresponding gradients y1 and y2 is added as the gradient of parameter Y and carried in slice a1+a2 and sent to training node 3.
  • the training node 3 After the training node 3 receives the slice a1+a2 sent by the training node 2, the sum obtained by adding the received slice a1+a2 to the slice a3 determined by itself is sent to the training node 4 as a slice a1+a2+a3 ; After the training node 4 receives the slice a1+a2+a3 sent by the training node 3, the sum obtained by adding the received slice a1+a2+a3 to the slice a4 determined by itself is used as a slice a1+a2+a3 +a4 is sent to the training node 5; after the training node 5 receives the slice a1+a2+a3+a4 sent by the training node 4, it can add the received slice a1+a2+a3+a4 to the slice a5 determined by itself Get the sum and send it as a slice a1+a2+a3+a4+a5 to the training node 1.
  • the training node 5 also calculates the gradient average according to the slice (a1+a2+a3+a4) and the slice a5 determined by itself, and trains the node 5 Send the gradient average calculated from slice a to the training node 1. Then training node 1 sends the gradient average calculated according to slice a to training node 2, training node 2 sends the gradient average calculated according to slice a to training node 3, and training node 3 will send the gradient average calculated according to slice a Send to training node 4. In this way, training node 1 to training node 5 have obtained the gradient average value calculated according to slice a, and training node 1 to training node 5 can use the gradient average value calculated according to slice a to compare the parameter R and parameter Y corresponding to slice a. The value is updated for use in the next iteration.
  • training node 2 When training node 2 transmits the buffered gradient to training node 3, training node 2 first sends slice b2 to training node 3. After training node 3 receives slice b2 sent by training node 2, it compares the received slice b2 with The sum obtained by adding the slice b3 determined by itself is sent to the training node 3 as a slice b2+b3;...; and so on, using a process similar to the above, until the training node 1 regards the calculated sum as a slice (b1+b2 +b3+b4+b5) is sent to training node 5. It can also be that training node 1 calculates the gradient average value according to slice b, and training node 1 sends the gradient average value calculated according to slice b to training node 2.
  • training node 2 sends the gradient average calculated according to slice b to training node 3
  • training node 3 sends the gradient average calculated according to slice b to training node 4
  • training node 4 sends the gradient average calculated according to slice b Send to training node 5.
  • training node 1 to training node 5 all obtain the gradient average value calculated according to slice b
  • training node 1 to training node 5 can use the gradient average value calculated according to slice b to update the parameter value corresponding to slice b, so that Used in the next iteration.
  • training node 3 first sends slice c3 to training node 4; ...; training node 2 calculates the gradient average value according to slice c, and training node 2 sends the gradient average value calculated according to slice c to training node 3. ...; until training node 1 to training node 5 obtain the gradient average value calculated according to slice c, training node 1 to training node 5 can use the gradient average value calculated according to slice c to update the parameter value corresponding to slice c .
  • training node 4 first sends slice d4 to training node 5;...; training node 3 calculates the average gradient according to slice di, and training node 3 sends the average gradient calculated according to slice d to training node 4. ...; until training node 1 to training node 5 obtain the gradient average value calculated according to slice d, training node 1 to training node 5 can use the gradient average value calculated according to slice d to update the parameter value corresponding to slice d .
  • training node 5 first sends slice e5 to training node 1; ...; training node 4 calculates the gradient average according to slice e, and training node 4 sends the gradient average calculated according to slice e to training node 5. Hence; Until the training node 1 to training node 5 have determined the gradient average calculated according to slice e, then training node 1 to training node 5 can use the gradient average calculated according to slice e to perform the parameter value corresponding to slice e Update.
  • the mini-batches data in the j+1th data block can be used for the next iteration based on the updated parameter.
  • the system may include one or more computer devices, and each computer device may be deployed with one or more training nodes. Training nodes deployed in the same computer device can communicate through a communication bus, and training nodes deployed in different computer devices can communicate through a network (such as a wireless network).
  • a network such as a wireless network
  • the training node needs to send the gradient out for aggregation operation during the reverse calculation process of neural network model training (that is, obtain the average value of the gradient in order to update the parameters). Further, in order to improve training efficiency, the training node also needs to overlap the layer-by-layer calculation and gradient transmission of the first neural network training, that is, the calculation and the communication are asynchronous and parallel.
  • Figure 2e which is a possible schematic diagram of asynchronous parallel computing and communication.
  • the gradient of the N-th layer parameter can be sent out (transmission delay is expressed as ⁇ N ); After obtaining the gradient of the parameters of the N-1 layer, the gradient of the parameters of the N-1 layer can be sent out (transmission delay is expressed as ⁇ N-1 ); and so on, after calculating the gradient of the parameters of the first layer, you can Send out the gradient of the first layer parameters (transmission delay is expressed as ⁇ 1 ). In this way, after the gradients of the parameters of each layer of the first neural network model are all sent out, the parameters of each layer of the first neural network model can be updated, and the next iteration can be executed.
  • the last transmission needs to be executed after the gradient of the first layer parameter is calculated, which leads to the phenomenon of communication tailing. If the i-th iteration is The communication tailing time (ie, ⁇ 1 ) is longer, which makes the time interval between the i-th iteration and the i+1-th iteration longer, resulting in lower efficiency of distributed training for the first neural network model.
  • the distributed training system can have multiple physical networking methods, such as wireless bandwidth (Infiniband), remote direct memory access (RDMA) (RDMA overconverged ethernet, RoCE), High-speed serial computer expansion bus (peripheral component interconnect express, PCIe), NVLink interconnect, etc.
  • RDMA remote direct memory access
  • RoCE High-speed serial computer expansion bus
  • PCIe peripheral component interconnect express
  • NVLink interconnect etc.
  • logical topologies used in the transmission gradient such as logical trees, rings, halving & doubling, hierarchical rings, hybrid topology, etc.
  • the transmitted data volume and the communication tail time curve will show different trends, as shown in Figure 2f, when the transmitted data volume is less than M0, the transmission delay using topo0 is less than topo1; otherwise, The transmission delay using topo1 is less than topo0.
  • the method of transmitting the gradient shown in FIG. 2e is that each time the gradient of a layer of parameters is calculated, the gradient of the parameters of the layer can be transmitted, but the parameter amount and the parameter distribution between layers of different neural network models are often different. For example, for the same neural network model, some layers have more parameters, and some layers have fewer parameters. For layers with fewer parameters, the efficiency of initiating transmission is obviously not high (considering factors such as communication overhead), therefore, you can calculate The obtained gradients are accumulated to a certain number and then a communication transmission is initiated intensively. For example, a two-layer or three-layer parameter gradient can be calculated to initiate transmission, so as to improve transmission efficiency. For another example, for different neural network models, the parameter distribution of some neural network models is relatively uniform, and the parameters of some neural network models may be concentrated in certain layers, which may cause sudden transmission.
  • some deep learning frameworks such as tensorflow
  • third-party libraries such as Horovod, OpenMPI
  • Horovod allows users to set a data volume threshold or time threshold.
  • the transmission is initiated.
  • this method does not provide any basis to determine the data volume threshold or time threshold, the data volume threshold or time threshold set by the user based on his own experience may not be reasonable enough, thus failing to achieve the purpose of improving the efficiency of distributed training.
  • an embodiment of the present application provides a method for determining a transmission strategy, which specifically includes: generating an i-th transmission strategy, and the i-th transmission strategy is used to transmit the layers obtained from the i-th iteration of the first neural network model
  • the gradient of the parameter determine the communication tail time corresponding to the i-th transmission strategy, and the communication tail time corresponding to the i-th transmission strategy is used to indicate the end time of the i-th iteration and the i+1-th time of the first neural network model
  • the duration between the beginning of iterations; the i+1th transmission strategy is generated according to the communication tailing duration corresponding to the i-th transmission strategy, and the i+1th transmission strategy is used to transmit the i+1th time of the first neural network model
  • the above method can be executed by a computing node.
  • the computing node After the computing node generates the i-th transmission strategy, it can obtain the communication tailing duration corresponding to the i-th transmission strategy, so that a round can be completed based on the communication tailing duration corresponding to the i-th transmission strategy Reinforcement learning makes the generated i+1th transmission strategy tend to the optimal transmission strategy (that is, the transmission strategy that makes the communication tailing time the shortest), which is beneficial to improve the efficiency of distributed training.
  • the computing node can interact with multiple training nodes that perform distributed training on the first neural network model, and then continuously try and update the transmission strategy based on the communication tailing time that multiple training nodes feedback after completing an iteration. , Which can intelligently and automatically generate a transmission strategy close to the optimal through reinforcement learning and improve the efficiency of distributed training.
  • FIG. 3 is a schematic diagram of an architecture to which the embodiments of the application are applicable. As shown in FIG. 3, it includes a computing node and a distributed training system.
  • the distributed training system may be the centralized distributed training system shown in FIG. 2b, or It can also be the decentralized distributed training system shown in Figure 2c, or it can also be other possible distributed training systems, which are not specifically limited; in Figure 3 only the distributed training system is shown in Figure 2c Take a decentralized distributed training system as an example.
  • the computing node may include an agent executor, and the training node may include an estimator, for example, training node 1 includes evaluator 1, training node 2 includes evaluator 2,... ..., the training node 5 includes an evaluator 5.
  • the agent executor can be a set of reinforcement learning networks or algorithms, which are mainly used to generate the i-th transmission strategy, and then send the i-th transmission strategy to the evaluator 1 to evaluator 5, and according to the evaluator
  • the rewards (specifically, the communication tail time) fed back from 1 to 5 are used to update its own parameters.
  • evaluator 1 is mainly used to obtain the i-th transmission strategy generated by the agent executor, and then start an iteration of the first neural network model, and transmit the reverse calculated gradient according to the i-th transmission strategy.
  • the communication tail time is measured, and the communication tail time is fed back to the agent executor as a reward; other evaluators can refer to the description of evaluator 1 and will not be repeated. In this way, after repeated iterations, the agent executor can continuously learn and evolve from the real model training environment, and eventually tend to produce the optimal transmission strategy.
  • FIG. 4 is a schematic flowchart corresponding to a method for determining a transmission strategy provided by an embodiment of the application, as shown in FIG. 4, including:
  • Step 401 The computing node generates the i-th transmission strategy, and sends the i-th transmission strategy to W training nodes respectively.
  • the W training nodes are used for distributed training of the first neural network model; the i-th transmission strategy is used for W training nodes.
  • the training node transmits the gradient of each layer parameter obtained in the i-th iteration of the first neural network model.
  • the training node receives the i-th transmission strategy, and transmits the i-th communication tailing duration of the gradient of each layer parameter obtained in the i-th iteration of the first neural network model using the i-th transmission strategy to the computing node.
  • the training node described here may be any one of the W training nodes.
  • Step 403 The computing node receives W i-th communication tailing durations sent by W training nodes.
  • the i-th communication tailing time obtained by the training node may be the end time and the i-th iteration of the i-th iteration of the first neural network model performed by the training node. +1 time between the start of the iteration.
  • the i-th communication tail duration obtained by the training node may be equal to the duration between the end time of the i-th iteration of the training node on the first neural network model and the start time of the i+1-th iteration.
  • the i-th communication tail time obtained by training node 1 can be equal to the time between the end time of the i-th iteration of the first neural network model by training node 1 and the start time of the i+1th iteration duration.
  • Step 404 The computing node determines the communication tail time corresponding to the i-th transmission strategy according to the W i-th communication tail time lengths.
  • the communication tail duration corresponding to the i-th transmission strategy is used to indicate the duration between the end time of the i-th iteration and the start time of the i+1-th iteration of the first neural network model. Since there are W training node pairs The first neural network model performs distributed training. Therefore, it can also be understood as: the communication tailing duration corresponding to the i-th transmission strategy reflects the end time and the i-th iteration of the i-th iteration of the first neural network model by W training nodes The length of time between the start of the +1 iteration.
  • the computing node may determine the communication tailing duration corresponding to the i-th transmission strategy according to the W i-th communication tailing duration according to the specific implementation manners. For example, the computing node can tail the W i-th communication
  • the average of the time length is determined as the communication tailing time corresponding to the i-th transmission strategy, that is, the communication tailing time corresponding to the i-th transmission strategy is equal to the end time of the i-th iteration of the first neural network model by W training nodes and the i-th The average of the duration between the start of the +1 iteration.
  • Step 405 The computing node generates an i+1th transmission strategy according to the communication tailing duration corresponding to the i-th transmission strategy, and sends the i+1th transmission strategy to W training nodes, and the i+1th transmission strategy is used for W trainings
  • the node transmits the gradient of each layer parameter obtained from the i+1th iteration of the first neural network model.
  • the value of X can be obtained according to the number of data blocks divided by the training data set of the first neural network model, and each data block is used for one iteration of the first neural network model (specifically, it can be See the description in Figure 2b and Figure 2c above).
  • X is equal to the number of data blocks divided by the training data set of the first neural network model.
  • the computing node determines that the communication tail time corresponding to the i-th transmission strategy is greater than the first threshold, it can generate the i-th communication tail time corresponding to the i-th transmission strategy.
  • +1 transmission strategy if the computing node determines that the communication tail duration corresponding to the i-th transmission strategy is less than or equal to the first threshold, the i-th transmission strategy can be used as the i+1-th transmission strategy, that is, the computing node passes through multiple rounds Reinforcement learning, the generated i-th transmission strategy can make distributed training more efficient (that is, after the i-th transmission strategy is a better transmission strategy), no new transmission strategy can be generated.
  • the training node can be In subsequent iterations of the first neural network model, the same transmission strategy (that is, the i-th transmission strategy) is used to transmit the gradient.
  • the first threshold can be set according to actual needs and experience. In this way, the processing burden of the computing node can be effectively reduced, and the training node can transmit gradients based on the same transmission strategy without receiving the newly generated transmission strategy of the computing node, which can effectively improve the efficiency of distributed training.
  • the communication tailing duration corresponding to the i-th transmission strategy is compared with the first threshold to determine whether the i-th transmission strategy is a better transmission strategy. In other possible examples, it may be It is determined by other methods whether the i-th transmission strategy is a better transmission strategy, which is not specifically limited.
  • the computing nodes can determine the next one based on the communication tail time of the previous iteration.
  • the transmission strategy used in the iterative process that is, after the computing node determines the communication tail time corresponding to the i-th transmission strategy, it can directly generate the i+1-th transmission strategy according to the communication tail time corresponding to the i-th transmission strategy without It is determined whether the communication tail duration corresponding to the i-th transmission strategy is greater than the first threshold, so that the transmission strategy can be adjusted in time according to changes in some factors, and the efficiency of distributed training is improved.
  • the description will be mainly based on this example below.
  • the computing node may generate a transmission strategy based on a variety of possible reinforcement learning methods.
  • the following exemplarily describes two possible implementation methods.
  • the computing node can generate the i-th transmission strategy through the second neural network model, and update the parameters of the second neural network model according to the communication tailing duration corresponding to the i-th transmission strategy, and generate the second neural network model according to the updated second neural network model. i+1 transmission strategy.
  • the second neural network model may be a recurrent neural network (RNN) model, such as a long short-term memory network (LSTM).
  • RNN recurrent neural network
  • LSTM long short-term memory network
  • the computing node can update the parameters of the second neural network model. For example, it can use the proximal policy optimization (PPO) algorithm or the asynchronous advantage actor-critic (A3C) algorithm. Update.
  • PPO proximal policy optimization
  • A3C asynchronous advantage actor-critic
  • the communication tailing duration corresponding to the i-th transmission strategy is used as the reward of reinforcement learning to update the parameters of the second neural network model. Since the second neural network model (such as the RNN model) has a strong learning ability, it is passed Continuously performing this process can achieve the convergence of the communication tail time to the optimal value.
  • the i-th transmission strategy may include the sub-transmission strategy of each layer of the first neural network model
  • the sub-transmission strategy of the n-th layer may include the first information and the second information
  • the first information is used to indicate the calculated Whether to initiate transmission after the gradient of the n-layer parameters
  • the second information is used to indicate the logical topology used for transmission.
  • the i-th transmission strategy may be a sequence that is repeated a certain number of times (the number depends on the number of layers of the first neural network model) in the form of ⁇ first information (whether to communicate), second information (logical topology) ⁇ , ⁇ The first information (whether to communicate), the second information (logical topology) ⁇ can be understood as the sub-transmission strategy of one layer of the first neural network model.
  • the first neural network model includes 3 layers, and the i-th transmission strategy is [ ⁇ Yes, topo0 ⁇ , ⁇ Yes, topo1 ⁇ , ⁇ Yes, topo1 ⁇ ].
  • a logical topological space may be preset, and the logical topological space may include multiple logical topologies for selection, and the logical topology used for transmission indicated by the second information may be one of the logical topological spaces. Topology.
  • the computing node can generate a transmission strategy in a self-excitation cycle through the second neural network model.
  • the specific implementation process of generating the transmission strategy will be described below with reference to FIG. 5a.
  • the initial input of the second neural network model can be random content (for example, it can be an identifier of a random logical topology in the logical topology space), and then the first layer of the sub-transmission strategy can be generated according to the initial input.
  • the first information of the sub-transmission strategy of the first layer is used as the input of the second neural network model, and the second information of the sub-transmission strategy of the first layer can be generated; the second information of the sub-transmission strategy of the first layer As the input of the second neural network model, the first information of the sub-transmission strategy of the second layer can be generated, and so on.
  • the second neural network model can generate the sub-transmission strategy of one layer of the first neural network model every two time steps, and then the first transmission strategy can be generated after n*2 time steps; the computing node will generate the first transmission strategy.
  • the transmission strategy is sent to W training nodes (corresponding to step 401).
  • the training node receives the first transmission strategy, and uses the first transmission strategy to transmit the gradient of each layer parameter obtained in the first iteration of the first neural network model, And use the first transmission strategy to transmit the first communication tail duration of the gradient of each layer parameter obtained in the first iteration of the first neural network model to the computing node (corresponding to step 402); the computing node receives W training nodes The first communication tailing time sent, and the communication tailing time corresponding to the first transmission strategy is determined (corresponding to step 403 and step 404); the computing node performs the second neural network model according to the communication tailing time corresponding to the first transmission strategy Update, and based on the updated neural network model, generate the second transmission strategy after n*2 time steps (corresponding to step 405), and the computing node sends the second transmission strategy to W training nodes, thereby cyclically executing the above steps 401 to Step 405, until the training of the first neural network model is completed.
  • FIG. 5b is an overall schematic diagram of Implementation Mode 1 provided by an embodiment of the application.
  • the second neural network model and parameter update algorithm can be run by the agent executor, and the evaluator is responsible for adding a communication operator to the reverse calculation of each layer of the first neural network model. ))
  • the evaluator 1 is responsible for adding a communication operator to the reverse calculation of each layer of the first neural network model of the training node 1), and then according to the transmission strategy generated by the agent executor, whether the communication operator is controlled Initiate the transmission and which logical topology to use.
  • the evaluator feeds back the communication tailing time corresponding to the i-th transmission strategy to the agent executor, and the agent executor regards the communication tailing time corresponding to the i-th transmission strategy as
  • the "reward" obtained by interacting with the environment is calculated according to the policy gradient method (such as PPO), and then the parameters of the second neural network model are updated to generate a new transmission strategy (i.e., the i+1th transmission strategy) ), complete a round of reinforcement learning.
  • the agent executor can generate an approximately optimal transmission strategy for the specific physical networking mode and the first neural network model.
  • the computing node can generate the i-th transmission strategy through the Q-table (Q-Table) used to record the state-action in the Q-learning algorithm; and, according to the communication tail time corresponding to the i-th transmission strategy, compare the Q-table Update, and generate the i+1th transmission strategy through the updated Q table.
  • the Q table includes P states and Q actions.
  • P states correspond to P data volume thresholds
  • Q actions correspond to Q combinations composed of state transition amounts and logic topology; among them, P and Q are both greater than Or an integer equal to 1.
  • the data volume threshold is used as the state dimension information in the Q table
  • the state transition amount and logical topology are used as the action dimension information in the Q table.
  • the i-th transmission strategy further, the Q table in Q-learning is updated by using the communication tailing duration corresponding to the i-th transmission strategy as a reward for reinforcement learning. Since the Q-Learning algorithm can take actions according to the current state, obtain the corresponding After the reward, we can improve these actions, so as to be able to make better actions, that is, to get a better transmission strategy.
  • the computing node may determine the minimum data amount threshold of the P transmission data amounts according to the preset transmission efficiency.
  • the preset transmission efficiency may be the lowest acceptable transmission efficiency. Accept the transmission efficiency, find the logical topology with the smallest transmission volume, and use the data volume corresponding to the logical topology as the minimum data volume threshold.
  • the computing node can determine the maximum data volume threshold of the P transmission data volume according to the parameter volume of the first neural network model, for example, determine a certain ratio (such as 50% or 80%) of the parameter volume of the first neural network model Is the maximum data volume threshold.
  • +m means +m based on the current state, for example, the current state is Mmin, then +m means transition to the state Mmin+m.
  • -m means -m on the basis of the current state, for example, the current state is Mmax, then -m means transition to the state Mmax-m.
  • 0 Means to keep the current state threshold M unchanged.
  • Topok Represents the logical topology used for transmission.
  • Q(s1,a1) represents the reward for performing the action +m,Topo1 when the current state is Mmin; among them, the action of performing +m,Topo1 refers to the generation of transmission strategy a, the data volume threshold included in transmission strategy a It is Mmin+m, and the logical topology used for each transmission is Topo1.
  • the i-th transmission strategy includes third information and fourth information; wherein the third information is used to indicate the i-th data volume threshold of the gradient of each layer parameter obtained in the i-th iteration of the first neural network model. , The i-th data amount threshold is used to determine the transmission timing of the gradient of each layer parameter obtained in the i-th iteration of the first neural network model; the fourth information is used to indicate the logical topology used for each transmission.
  • the training node calculates the gradient of the n-th layer parameter of the first neural network model, if it is determined that the data volume of the accumulated gradient (that is, the data volume of the gradient to be transmitted) is greater than or equal to the i-th data volume threshold, it initiates And use the logical topology indicated by the fourth information for transmission.
  • the computing node can generate an initialized Q table before generating the first transmission strategy, and the values of Q(s1,a1), Q(s1,a2)... in the initialized Q table can be Gaussian distribution A set of random numbers obtained.
  • the computing node uses the state corresponding to Mmin as the current state according to the initialized Q table, and determines the first target action (for example, +m, topo2) according to the reward value of Q actions executed in the current state, and then generates the first transmission strategy and Sent to W training nodes (corresponding to step 401); where the first data amount threshold is the sum of the state transition amount in the combination corresponding to the first target action and the data amount threshold corresponding to the current state (ie Mmin+m) ,
  • the logical topology used in each transmission is the logical topology in the combination corresponding to the first target action (ie topo2).
  • the training node receives the first transmission strategy, uses the first transmission strategy to transmit the gradients of the parameters of each layer obtained in the first iteration of the first neural network model, and uses the first transmission strategy to transmit the first neural network model
  • the first communication tail duration of the gradient of each layer parameter obtained in the first iteration is sent to the computing node (corresponding to step 402); the computing node receives the first communication tail duration sent by W training nodes, and determines the first transmission
  • the communication tailing time corresponding to the strategy (corresponding to steps 403 and 404); the computing node updates the initialized Q table according to the communication tailing time corresponding to the first transmission strategy, and generates the second transmission strategy based on the updated Q table ( Corresponding to step 405), the computing node sends the second transmission strategy to the W training nodes, so as to execute the above steps 401 to 405 in a loop until the training of the first neural network model is completed.
  • the computing node may determine the i-th target action based on the reward values of the Q actions. For example, the computing node can determine the action with the largest reward value among the Q actions as the i-th target action, or, The computing node may also determine the action with the second highest reward value among the Q actions as the i-th target action, which is not specifically limited.
  • the computing node may update the Q table according to the communication tailing duration corresponding to the i-th transmission strategy. For example, the computing node can use the Bellman equation to update the Q table, which is not specifically limited.
  • FIG. 6b is an overall schematic diagram of implementation manner 2 provided by an embodiment of the application.
  • a Q-Learning algorithm and functional components can be run by the agent executor, and the data volume threshold can be determined and modified by interacting with evaluators in multiple training nodes, and the Q table can be updated; evaluator; Responsible for controlling the transmission of gradients, and feeding back the communication tail duration as a reward to the agent executor.
  • evaluator 1 is responsible for controlling the transmission of the gradient calculated by training node 1. When the amount of gradient data accumulated by training node 1 exceeds the amount of data When the threshold is reached, the transmission is initiated, and the communication tail duration of the transmission gradient of the training node 1 is fed back to the actuator as a reward.
  • the computing node and the training node may include hardware structures and/or software modules corresponding to each function.
  • the embodiments of the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a certain function is executed by hardware or computer software-driven hardware depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.
  • FIG. 7 shows a possible exemplary block diagram of a device for determining a transmission strategy involved in an embodiment of the present application, and the device 700 may exist in the form of software.
  • the apparatus 700 may include: a generating unit 702 and a determining unit 703.
  • the generating unit 702 and the determining unit 703 may be collectively referred to as processing units, which are used to control and manage the actions of the apparatus 700.
  • the apparatus 700 may further include a communication unit 704, which is configured to support communication between the apparatus 700 and other nodes.
  • the communication unit 704 may also be referred to as a transceiver unit, and may include a receiving unit and/or a sending unit, which are used to perform receiving and sending operations, respectively.
  • the device 700 may further include a storage unit 701 for storing program codes and/or data of the device 700.
  • the generating unit 702 and the determining unit 703 may be a processor or a controller, which may implement or execute various exemplary logical blocks, modules, and circuits described in conjunction with the disclosure of the embodiments of the present application.
  • the communication unit 704 may be a communication interface, a transceiver, or a transceiver circuit, etc., where the communication interface is a general term. In a specific implementation, the communication interface may include multiple interfaces.
  • the storage unit 701 may be a memory.
  • the apparatus 700 may be the computing node in any of the foregoing embodiments.
  • the generating unit 702 and the determining unit 703 can support the apparatus 700 to execute the actions of the computing nodes in the above method examples.
  • the generating unit 702 and the determining unit 703 mainly perform the internal actions of the computing node in the method example, and the communication unit 704 may support communication between the apparatus 700 and the training node.
  • the generating unit 702 is used to perform the action of generating the i+1th transmission strategy in step 401 and step 405 in FIG. 4
  • the determining unit 703 is used to perform step 404 in FIG. 4
  • the communication unit 704 is used to perform The action of sending the i+1th transmission strategy in step 403 and step 405.
  • the generating unit 702 is configured to generate an i-th transmission strategy, and the i-th transmission strategy is used to transmit the gradient of each layer parameter obtained in the i-th iteration of the first neural network model;
  • the determining unit 703 is configured to determine the communication tailing duration corresponding to the i-th transmission strategy, and the communication tailing duration corresponding to the i-th transmission strategy is used to indicate the end time and the first iteration of the i-th iteration of the first neural network model. The length of time between the start of i+1 iterations;
  • the generating unit 702 is further configured to generate an i+1th transmission strategy according to the communication tail duration corresponding to the i-th transmission strategy, and the i+1th transmission strategy is used to transmit the i+1th time of the first neural network model
  • the gradient of each layer parameter obtained by iteration; where i 1, 2, ..., X-1, X is the number of iterations of the first neural network model, and X is an integer greater than 1.
  • the generating unit 702 is specifically configured to generate the i-th transmission strategy through a second neural network model.
  • the generating unit 702 is specifically configured to: update the parameters of the second neural network model according to the communication tailing duration corresponding to the i-th transmission strategy, and pass the updated second The neural network model generates the i+1th transmission strategy.
  • the i-th transmission strategy includes the sub-transmission strategy of each layer of the first neural network model, and the sub-transmission strategy of the n-th layer includes first information and second information.
  • One piece of information is used to indicate whether to initiate transmission after the gradient of the n-th layer parameter is calculated, and the second piece of information is used to indicate the logical topology used for transmission;
  • n 1, 2, ..., N
  • N is the number of layers of the first neural network model, and N is an integer greater than or equal to 1.
  • the generating unit 702 generates the sub-transmission strategy of the n+1th layer through the second neural network model, specifically:
  • the second information of the sub-transmission strategy of the nth layer is used as the input of the second neural network model, the first information of the sub-transmission strategy of the n+1th layer is generated, and the n+1th layer is The first information of the sub-transmission strategy of the layer is used as the input of the second neural network model to generate the second information of the sub-transmission strategy of the n+1th layer.
  • the generating unit 702 is specifically configured to:
  • the i-th transmission strategy is generated through the Q table used to record state-action in the Q-learning algorithm;
  • the Q table includes P states and Q actions, and the P states correspond to P data volume thresholds,
  • the Q actions respectively correspond to Q combinations formed by the amount of state transition and logic topology; among them, P and Q are integers greater than or equal to 1.
  • the generating unit 702 is specifically configured to:
  • the Q table is updated according to the communication tail duration corresponding to the i-th transmission strategy, and the i+1-th transmission strategy is generated through the updated Q table.
  • the i-th transmission strategy includes third information and fourth information; wherein, the third information is used to indicate transmission of the parameters of each layer obtained in the i-th iteration of the first neural network model
  • the i-th data volume threshold of the gradient of the i-th data volume threshold is used to determine the transmission timing of the gradient of each layer parameter obtained by the i-th iteration of the first neural network model; the fourth information is used To indicate the logical topology used for each transmission.
  • the generating unit 702 is specifically configured to:
  • the reward value for executing the Q actions under the state corresponding to the i-1th data volume threshold is obtained, and the ith target action is determined according to the reward value of the Q actions, and the i transmission strategy;
  • the i-th data volume threshold is the sum of the state transition volume in the combination corresponding to the i-th target action and the i-1th data volume threshold, and the logical topology used for each transmission is The logical topology in the combination corresponding to the i-th target action.
  • the maximum data amount threshold of the P data amount thresholds is determined according to the parameter amount of the first neural network model, and/or, the smallest of the P data amount thresholds
  • the data volume threshold is determined according to the preset transmission efficiency.
  • modules in the embodiments of the present application is illustrative, and is only a logical function division, and there may be other division methods in actual implementation.
  • the functional modules in the embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.
  • the above-mentioned integrated modules can be implemented in the form of hardware or software functional modules.
  • the integrated module is implemented in the form of a software function module and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • a computer readable storage medium includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium may be various mediums capable of storing program codes, such as a memory.
  • the apparatus may be the above-mentioned computer device for executing actions performed by a computing node or a semiconductor chip provided in the computer device.
  • the device 800 includes a memory 801, a processor 802, and a communication interface 803.
  • the processor 802 has a function of implementing the actions performed by the generating unit 702 and the determining unit 703 in FIG. 7.
  • the apparatus 800 may further include a bus 804.
  • the communication interface 803, the processor 802, and the memory 801 may be connected to each other through a communication line 804;
  • the communication line 804 may be a peripheral component interconnection standard (peripheral component interconnect, PCI for short) bus or an extended industry standard architecture (extended industry standard architecture) , Referred to as EISA) bus and so on.
  • the communication line 804 can be divided into an address bus, a data bus, a control bus, and so on. For ease of presentation, only one thick line is used in FIG. 8 to represent, but it does not mean that there is only one bus or one type of bus.
  • the processor 802 may be one or more CPUs (or GPUs), or one or more integrated circuits for controlling the execution of programs in the solutions of the present application.
  • the communication interface 803 uses any device such as a transceiver to communicate with the training node.
  • the memory 801 may be read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), or other types that can store information and instructions.
  • the dynamic storage device can also be electrically erasable programmable read-only memory (electrically programmable read-only memory, EEPROM), compact disc read-only memory (CD-ROM) or other optical disk storage, Optical disc storage (including compact disc, laser disc, optical disc, digital versatile disc, Blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or can be used to carry or store desired program codes in the form of instructions or data structures and can Any other medium accessed by the computer, but not limited to this.
  • the memory can exist independently and is connected to the processor through a communication line 804.
  • the memory can also be integrated with the processor.
  • the memory 801 is used to store computer-executed instructions for executing the solutions of the present application, and the processor 802 controls the execution.
  • the processor 802 is configured to execute computer-executable instructions stored in the memory 801, so as to implement the method provided in the foregoing embodiment of the present application.
  • the computer-executable instructions in the embodiments of the present application may also be referred to as application program code, which is not specifically limited in the embodiments of the present application.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, server, or data center.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center integrated with one or more available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a solid state disk (SSD)).
  • each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be implemented by computer program instructions.
  • These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are generated It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

Abstract

A method and an apparatus for determining a transmission policy, relating to data training in the field of artificial intelligence. The method for determining a transmission policy comprises: generating an ith transmission policy, the ith transmission policy being used to transmit the gradient of parameters of layers obtained by an ith iteration of a first neural network model; determining a communication trailing duration corresponding to the ith transmission policy, the communication trailing duration corresponding to the ith transmission policy being used to indicate a duration between the end time of the ith iteration of the first neural network model and the start time of the (i+1)th iteration; and generating an (i+1)th transmission policy according to the communication trailing duration corresponding to the ith transmission policy. Thus, after the ith transmission policy is generated, a communication trailing duration corresponding to the ith transmission policy can be obtained, so a round of reinforcement learning can be completed on the basis of the communication trailing duration corresponding to the ith transmission policy, so that the generated (i+1)th transmission policy tends toward an optimal transmission policy, facilitating the improvement of the efficiency of distributed training.

Description

一种确定传输策略的方法及装置Method and device for determining transmission strategy 技术领域Technical field
本申请涉及人工智能技术领域,特别涉及一种确定传输策略的方法及装置。This application relates to the field of artificial intelligence technology, and in particular to a method and device for determining a transmission strategy.
背景技术Background technique
人工智能(artificial intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能领域的研究包括机器人、自然语言处理、计算机视觉、决策与推理、人机交互、推荐与搜索、AI基础理论等。Artificial intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence. Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, and basic AI theories.
随着人工智能的迅速发展,在大数据集上训练的神经网络模型在许许多多领域都取得突破性的提高与广泛应用。由于通过不断的迭代来训练神经网络模型属于典型的计算密集型任务,需要进行大量的计算,因此神经网络模型的训练过程非常耗时。尽管近些年图形处理器(graphics processing unit,GPU)硬件技术、网络模型结构和训练方法均取得了一定程度的进展,但是单机(或单节点)训练耗时过久的事实仍无法回避。其次,研究表明训练数据的规模与神经网络模型的性能成线性增长关系,未来训练数据的规模可能达到PB、ZB级别。随着训练数据与神经网络模型的模型参数的规模越来越大,单机的内存(或显存)的增长速度并将不能与之相匹配。由此,单机进行神经网络模型训练已经无法满足要求。With the rapid development of artificial intelligence, neural network models trained on large data sets have achieved breakthrough improvements and widespread applications in many fields. Because training a neural network model through continuous iterations is a typical computationally intensive task and requires a lot of calculations, the training process of the neural network model is very time-consuming. Although graphics processing unit (GPU) hardware technology, network model structure and training methods have all made progress in recent years, the fact that single machine (or single node) training takes too long cannot be avoided. Secondly, research shows that the scale of training data has a linear growth relationship with the performance of the neural network model. In the future, the scale of training data may reach PB or ZB levels. As the scale of the training data and the model parameters of the neural network model becomes larger and larger, the growth rate of the memory (or video memory) of a single machine will not match it. As a result, the single-machine neural network model training can no longer meet the requirements.
由于分布式训练具有良好的灵活性与可扩展性,可以将单机资源有效的结合起来,因此分布式训练成为解决上述问题的有效手段。分布式训练主要有两种策略,模型并行训练与数据并行训练。其中,模型并行是将神经网络模型分割成多个部分,将各个部分交给各个训练节点进行训练,但是训练节点之间存在大量通信,同时切割划分模型存在一定的困难;而数据并行训练则是将训练数据划分成多个训练数据集,交给多个训练节点进行训练,无需切割划分模型,因此,数据并行训练是对大规模训练数据进行分布式训练的有效策略。Because distributed training has good flexibility and scalability, and can effectively combine single-machine resources, distributed training has become an effective means to solve the above problems. There are two main strategies for distributed training, model parallel training and data parallel training. Among them, model parallel is to divide the neural network model into multiple parts, and each part is handed over to each training node for training, but there is a lot of communication between training nodes, and there are certain difficulties in cutting and dividing the model; while data parallel training is The training data is divided into multiple training data sets and handed over to multiple training nodes for training without cutting the partition model. Therefore, data parallel training is an effective strategy for distributed training on large-scale training data.
在数据并行训练中,多个训练节点需要在网络训练的反向计算过程中,将计算出的梯度发送出去做聚合操作。然而,如何对多个训练节点计算出的梯度进行传输,以提高分布式训练的效率,仍需进一步的研究。In data parallel training, multiple training nodes need to send the calculated gradients for aggregation during the reverse calculation process of network training. However, how to transmit the gradient calculated by multiple training nodes to improve the efficiency of distributed training requires further research.
发明内容Summary of the invention
本申请实施例提供一种确定传输策略的方法及装置,通过确定出的传输策略来传输梯度能够有效提高分布式训练的效率。The embodiments of the present application provide a method and device for determining a transmission strategy, and transmitting gradients through the determined transmission strategy can effectively improve the efficiency of distributed training.
第一方面,本申请实施例提供一种确定传输策略的方法,该方法可以由计算节点执行,该方法包括:In the first aspect, an embodiment of the present application provides a method for determining a transmission strategy, the method may be executed by a computing node, and the method includes:
生成第i传输策略,第i传输策略用于传输第一神经网络模型的第i次迭代所得到的各层参数的梯度;确定第i传输策略对应的通信拖尾时长,第i传输策略对应的通信拖尾时 长用于指示第一神经网络模型的第i次迭代的结束时刻和第i+1次迭代的开始时刻之间的时长;进而,根据第i传输策略对应的通信拖尾时长生成第i+1传输策略,第i+1传输策略用于传输第一神经网络模型的第i+1次迭代所得到的各层参数的梯度;其中,i=1,2,……,X-1,X为第一神经网络模型的迭代次数,X为大于1的整数。Generate the i-th transmission strategy, the i-th transmission strategy is used to transmit the gradient of each layer parameter obtained in the i-th iteration of the first neural network model; determine the communication tailing time corresponding to the i-th transmission strategy, and the i-th transmission strategy The communication tail time is used to indicate the time between the end time of the i-th iteration of the first neural network model and the start time of the i+1th iteration; furthermore, the communication tail time corresponding to the i-th transmission strategy is used to generate the first i+1 transmission strategy, the i+1th transmission strategy is used to transmit the gradient of each layer parameter obtained in the i+1th iteration of the first neural network model; where i=1, 2, ..., X-1 , X is the number of iterations of the first neural network model, and X is an integer greater than 1.
采用上述方法,计算节点在生成第i传输策略后,可以得到第i传输策略对应的通信拖尾时长,从而可以基于第i传输策略对应的通信拖尾时长完成一轮强化学习,使得生成的第i+1传输策略趋向于最优的传输策略(即使得通信拖尾时长最短的传输策略),有利于提高分布式训练的效率。Using the above method, after the computing node generates the i-th transmission strategy, it can obtain the communication tail time corresponding to the i-th transmission strategy, so that a round of reinforcement learning can be completed based on the communication tail time corresponding to the i-th transmission strategy, so that the generated first The i+1 transmission strategy tends to the optimal transmission strategy (that is, the transmission strategy that makes the communication tailing time the shortest), which helps to improve the efficiency of distributed training.
在一种可能的设计中,生成第i传输策略之后,该方法还包括:将第i传输策略发送给W个训练节点,W个训练节点用于对第一神经网络模型进行分布式训练;In a possible design, after generating the i-th transmission strategy, the method further includes: sending the i-th transmission strategy to W training nodes, and the W training nodes are used for distributed training of the first neural network model;
确定第i传输策略对应的通信拖尾时长,包括:接收W个训练节点使用第i传输策略传输第一神经网络模型的第i次迭代所得到的各层参数的梯度的第i通信拖尾时长;根据W个第i通信拖尾时长,确定第i传输策略对应的通信拖尾时长,W为大于或等于1的整数。Determining the communication tailing time corresponding to the i-th transmission strategy includes: receiving the i-th communication tailing time of the parameter gradients of each layer obtained by the i-th iteration of the first neural network model by W training nodes using the i-th transmission strategy ; According to the W i-th communication tail time length, determine the communication tail time length corresponding to the i-th transmission strategy, W is an integer greater than or equal to 1.
在一种可能的设计中,生成第i传输策略,包括:通过第二神经网络模型生成第i传输策略。In a possible design, generating the i-th transmission strategy includes: generating the i-th transmission strategy through a second neural network model.
在一种可能的设计中,根据第i传输策略对应的通信拖尾时长生成第i+1传输策略,包括:根据第i传输策略对应的通信拖尾时长对第二神经网络模型的参数进行更新,并通过更新后的第二神经网络模型生成第i+1传输策略。In a possible design, generating the i+1th transmission strategy according to the communication tailing duration corresponding to the i-th transmission strategy includes: updating the parameters of the second neural network model according to the communication tailing duration corresponding to the i-th transmission strategy , And generate the i+1th transmission strategy through the updated second neural network model.
采用上述方法,将第i传输策略对应的通信拖尾时长作为强化学习的奖励来更新第二神经网络模型的参数,由于第二神经网络模型具有较强的学习能力,因此通过不断执行这一过程,能够实现将通信拖尾时长收敛到最优值。Using the above method, the communication tailing duration corresponding to the i-th transmission strategy is used as the reward of reinforcement learning to update the parameters of the second neural network model. Since the second neural network model has strong learning capabilities, this process is continuously executed , Can achieve the convergence of the communication tail time to the optimal value.
在一种可能的设计中,第i传输策略包括第一神经网络模型的每一层的子传输策略,第n层的子传输策略包括第一信息和第二信息,第一信息用于指示计算得到第n层参数的梯度后是否发起传输,第二信息用于指示传输所使用的逻辑拓扑;其中,n=1,2,……,N,N为第一神经网络模型的层数,N为大于或等于1的整数。In a possible design, the i-th transmission strategy includes the sub-transmission strategy of each layer of the first neural network model, and the sub-transmission strategy of the n-th layer includes first information and second information, and the first information is used to indicate calculation Whether to initiate transmission after obtaining the gradient of the n-th layer parameters, the second information is used to indicate the logical topology used for transmission; where n=1, 2,..., N, N is the number of layers of the first neural network model, N Is an integer greater than or equal to 1.
在一种可能的设计中,通过第二神经网络模型生成第n+1层的子传输策略,具体为:将第n层的子传输策略的第二信息作为第二神经网络模型的输入,生成第n+1层的子传输策略的第一信息,以及将第n+1层的子传输策略的第一信息作为第二神经网络模型的输入,生成第n+1层的子传输策略的第二信息。In a possible design, generating the sub-transmission strategy of the n+1th layer through the second neural network model is specifically: taking the second information of the sub-transmission strategy of the nth layer as the input of the second neural network model to generate The first information of the sub-transmission strategy of the n+1th layer and the first information of the sub-transmission strategy of the n+1th layer are used as the input of the second neural network model to generate the first information of the sub-transmission strategy of the n+1th layer Two information.
在一种可能的设计中,生成第i传输策略,包括:通过Q-learning算法中用于记录状态-动作的Q表生成第i传输策略;Q表中包括P个状态和Q个动作,P个状态分别对应P个数据量阈值,Q个动作分别对应由状态转移量和逻辑拓扑构成Q个组合;其中,P、Q均为大于或等于1的整数。In a possible design, generating the i-th transmission strategy includes: generating the i-th transmission strategy through the Q table used to record state-actions in the Q-learning algorithm; the Q table includes P states and Q actions, P Each state corresponds to P data volume thresholds, and Q actions respectively correspond to Q combinations composed of state transition amount and logic topology; among them, P and Q are integers greater than or equal to 1.
在一种可能的设计中,生成第i+1传输策略,包括:根据第i传输策略对应的通信拖尾时长,对Q表进行更新,并通过更新后的Q表生成第i+1传输策略。In a possible design, generating the i+1th transmission strategy includes: updating the Q table according to the communication tailing duration corresponding to the ith transmission strategy, and generating the i+1th transmission strategy through the updated Q table .
采用上述方法,通过构建Q-learning算法中的Q表,即将数据量阈值作为Q表中状态维度的信息,将状态转移量和逻辑拓扑作为Q表中动作维度的信息,实现了利用Q表生成第i传输策略;进一步地,通过将第i传输策略对应的通信拖尾时长作为强化学习的奖励来更新Q-learning中的Q表,由于Q-Learning算法能够根据当前状态来采取动作,获得相 应的奖励之后,再去改进这些动作,从而能够做出更优的动作,即得到更优的传输策略。Using the above method, by constructing the Q table in the Q-learning algorithm, the data volume threshold is used as the state dimension information in the Q table, and the state transition amount and logical topology are used as the action dimension information in the Q table. The i-th transmission strategy; further, the Q table in Q-learning is updated by using the communication tailing duration corresponding to the i-th transmission strategy as a reward for reinforcement learning. Since the Q-Learning algorithm can take actions according to the current state, obtain the corresponding After the reward, we can improve these actions, so as to be able to make better actions, that is, to get a better transmission strategy.
在一种可能的设计中,第i传输策略包括第三信息和第四信息;其中,第三信息用于指示传输第一神经网络模型的第i次迭代所得到的各层参数的梯度的第i数据量阈值,第i数据量阈值用于确定传输第一神经网络模型的第i次迭代所得到的各层参数的梯度的传输时机;第四信息用于指示每次传输所使用的逻辑拓扑。In a possible design, the i-th transmission strategy includes third information and fourth information; where the third information is used to indicate the transmission of the first neural network model obtained by the i-th iteration of the gradient of each layer parameter i data volume threshold, the i-th data volume threshold is used to determine the transmission timing of the gradient of each layer parameter obtained in the i-th iteration of the first neural network model; the fourth information is used to indicate the logical topology used for each transmission .
在一种可能的设计中,通过Q表生成第i传输策略,包括:根据Q表,得到在对应第i-1数据量阈值的状态下执行Q个动作的奖励值,并根据Q个动作的奖励值确定第i目标动作,并生成第i传输策略;第i数据量阈值为第i目标动作对应的组合中的状态转移量和第i-1数据量阈值的和,每次传输所使用的逻辑拓扑为第i目标动作对应的组合中的逻辑拓扑。In a possible design, generating the i-th transmission strategy through the Q table includes: according to the Q table, obtaining the reward value for executing Q actions under the state corresponding to the i-1th data volume threshold, and according to the Q actions The reward value determines the i-th target action and generates the i-th transmission strategy; the i-th data amount threshold is the sum of the state transition amount in the combination corresponding to the i-th target action and the i-1th data amount threshold, which is used for each transmission The logical topology is the logical topology in the combination corresponding to the i-th target action.
在一种可能的设计中,P个数据量阈值中的最大数据量阈值是根据第一神经网络模型的参数量确定的,和/或,P个传输数据量中的最小数据量阈值是根据预设传输效率确定的。In a possible design, the maximum data volume threshold among the P data volume thresholds is determined according to the parameter value of the first neural network model, and/or the minimum data volume threshold among the P data volume thresholds is determined according to the preset Let the transmission efficiency be determined.
第二方面,本申请实施例提供一种确定传输策略的方法,该方法可以由计算节点执行,该方法包括:In the second aspect, an embodiment of the present application provides a method for determining a transmission strategy. The method may be executed by a computing node. The method includes:
生成第i传输策略,第i传输策略用于传输第一神经网络模型的第i次迭代所得到的各层参数的梯度;确定第i传输策略对应的通信拖尾时长,第i传输策略对应的通信拖尾时长用于指示第一神经网络模型的第i次迭代的结束时刻和第i+1次迭代的开始时刻之间的时长;若确定第i传输策略对应的通信拖尾时长大于第一阈值,则根据第i传输策略对应的通信拖尾时长生成第i+1传输策略,第i+1传输策略用于传输第一神经网络模型的第i+1次迭代所得到的各层参数的梯度;其中,i=1,2,……,X-1,X为第一神经网络模型的迭代次数,X为大于1的整数。Generate the i-th transmission strategy, the i-th transmission strategy is used to transmit the gradient of each layer parameter obtained in the i-th iteration of the first neural network model; determine the communication tailing time corresponding to the i-th transmission strategy, and the i-th transmission strategy The communication tail time is used to indicate the time between the end time of the i-th iteration of the first neural network model and the start time of the i+1th iteration; if it is determined that the communication tail time corresponding to the i-th transmission strategy is greater than the first Threshold, the i+1th transmission strategy is generated according to the communication tailing duration corresponding to the i-th transmission strategy. The i+1th transmission strategy is used to transmit the parameters of each layer obtained in the i+1th iteration of the first neural network model Gradient; where i=1, 2,..., X-1, X is the number of iterations of the first neural network model, and X is an integer greater than 1.
在一种可能的设计中,该方法还包括:In a possible design, the method also includes:
若确定第i传输策略对应的通信拖尾时长小于或等于第一阈值,则将第i传输策略作为第i+1传输策略。If it is determined that the communication tail duration corresponding to the i-th transmission strategy is less than or equal to the first threshold, the i-th transmission strategy is used as the (i+1)th transmission strategy.
采用上述方法,计算节点通过多轮强化学习,若生成的第i传输策略能够使得分布式训练的效率较高(即第i传输策略为一个较优的传输策略),则可以不再生成新的传输策略,相应地,训练节点可以在第一神经网络模型的后续迭代中,使用同一传输策略(即第i传输策略)来传输梯度。采用这种方式,能够有效降低计算节点的处理负担,且训练节点可以基于同一传输策略来传输梯度而无需再接收计算节点新生成的传输策略,能够有效提高分布式训练的效率。Using the above method, the computing node through multiple rounds of reinforcement learning, if the generated i-th transmission strategy can make the efficiency of distributed training higher (that is, the i-th transmission strategy is a better transmission strategy), it can no longer generate new Transmission strategy. Accordingly, the training node can use the same transmission strategy (that is, the i-th transmission strategy) to transmit the gradient in subsequent iterations of the first neural network model. In this way, the processing burden of the computing node can be effectively reduced, and the training node can transmit gradients based on the same transmission strategy without receiving the newly generated transmission strategy of the computing node, which can effectively improve the efficiency of distributed training.
第三方面,本申请实施例提供一种传输梯度的方法,该方法可以由训练节点执行,该方法包括:In a third aspect, an embodiment of the present application provides a method for transmitting gradients. The method may be executed by a training node. The method includes:
获取第i传输策略,第i传输策略用于传输第一神经网络模型的第i次迭代所得到的各层参数的梯度;Obtain the i-th transmission strategy, which is used to transmit the gradient of the parameters of each layer obtained in the i-th iteration of the first neural network model;
使用第i传输策略传输第一神经网络模型的第i次迭代所得到的各层参数的梯度。The i-th transmission strategy is used to transmit the gradient of the parameters of each layer obtained in the i-th iteration of the first neural network model.
在一种可能的设计中,第i传输策略包括第一神经网络模型的每一层的子传输策略,第n层的子传输策略包括第一信息和第二信息,第一信息用于指示计算得到第n层参数的梯度后是否发起传输,第二信息用于指示传输所使用的逻辑拓扑;In a possible design, the i-th transmission strategy includes the sub-transmission strategy of each layer of the first neural network model, and the sub-transmission strategy of the n-th layer includes first information and second information, and the first information is used to indicate calculation Whether to initiate transmission after obtaining the gradient of the n-th layer parameter, the second information is used to indicate the logical topology used for transmission;
使用第i传输策略传输第一神经网络模型的第i次迭代所得到的各层参数的梯度,包括:在第i次迭代中,计算得到每一层参数的梯度后,根据每一层的子传输策略对每一层 参数的梯度进行传输。比如,第n层的子传输策略的第一信息用于指示计算得到第n层参数的梯度后发起传输,则在计算得到第n层参数的梯度后,根据第n层的子传输策略的第二信息所指示的逻辑拓扑传输待传输的梯度。Use the i-th transmission strategy to transmit the gradient of each layer parameter obtained in the i-th iteration of the first neural network model, including: in the i-th iteration, after calculating the gradient of each layer parameter, according to the sub-layer of each layer The transmission strategy transmits the gradient of each layer parameter. For example, the first information of the sub-transmission strategy of the nth layer is used to indicate that the transmission is initiated after the gradient of the n-th layer parameter is calculated. After the gradient of the n-th layer parameter is calculated, the sub-transmission strategy of the nth layer is The logical topology indicated by the second information transmits the gradient to be transmitted.
在一种可能的设计中,第i传输策略包括第三信息和第四信息;其中,第三信息用于指示传输第一神经网络模型的第i次迭代所得到的各层参数的梯度的第i数据量阈值,第i数据量阈值用于确定传输第一神经网络模型的第i次迭代所得到的各层参数的梯度的传输时机;第四信息用于指示每次传输所使用的逻辑拓扑;In a possible design, the i-th transmission strategy includes third information and fourth information; where the third information is used to indicate the transmission of the first neural network model obtained by the i-th iteration of the gradient of each layer parameter i data volume threshold, the i-th data volume threshold is used to determine the transmission timing of the gradient of each layer parameter obtained in the i-th iteration of the first neural network model; the fourth information is used to indicate the logical topology used for each transmission ;
使用第i传输策略传输第一神经网络模型的第i次迭代所得到的各层参数的梯度,包括:在第i次迭代中,计算得到每一层参数的梯度后,若确定将待传输的梯度的数据量大于或等于第i数据量阈值,则使用第四信息所指示的逻辑拓扑传输待传输的梯度。Use the i-th transmission strategy to transmit the gradient of the parameters of each layer obtained in the i-th iteration of the first neural network model, including: in the i-th iteration, after calculating the gradient of each layer parameter, if it is determined to transfer the If the data amount of the gradient is greater than or equal to the i-th data amount threshold, the logical topology indicated by the fourth information is used to transmit the gradient to be transmitted.
在一种可能的设计中,获取第i传输策略,包括:接收计算节点发送的第i传输策略;In a possible design, obtaining the i-th transmission strategy includes: receiving the i-th transmission strategy sent by the computing node;
使用第i传输策略传输第一神经网络模型的第i次迭代所得到的各层参数的梯度之后,还包括:将使用第i传输策略传输第一神经网络模型的第i次迭代所得到的各层参数的梯度的第i通信拖尾时长发送给计算节点。After using the i-th transmission strategy to transmit the gradients of the parameters of each layer obtained in the i-th iteration of the first neural network model, it also includes: transmitting the i-th iterations of the first neural network model using the i-th transmission strategy The i-th communication tail duration of the gradient of the layer parameter is sent to the computing node.
在一种可能的设计中,将使用第i传输策略传输第一神经网络模型的第i次迭代所得到的各层参数的梯度的第i通信拖尾时长发送给计算节点之后,还包括:In a possible design, after sending the i-th communication tailing duration of the gradients of the parameters of each layer obtained by the i-th iteration of the first neural network model using the i-th transmission strategy to the computing node, it also includes:
接收计算节点发送的第i+1传输策略,第i+1传输策略是根据第i传输策略对应的通信拖尾时长生成的,第i传输策略对应的通信拖尾时长是根据第i通信拖尾时长得到的,i=1,2,……,X-1,X为第一神经网络模型的迭代次数。Receive the i+1th transmission strategy sent by the computing node, the i+1th transmission strategy is generated according to the communication tailing duration corresponding to the i-th transmission strategy, and the communication tailing duration corresponding to the i-th transmission strategy is based on the i-th communication tailing Time length is obtained, i=1, 2,..., X-1, X is the number of iterations of the first neural network model.
第四方面,本申请实施例提供一种装置,该装置可以是计算节点或训练节点,或者也可以计算节点或训练节点所在的计算机设备,或者也可以是设置在该计算机设备中的半导体芯片。该装置具有实现上述第一方面至第三方面中的各种可能的设计的功能。该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的单元或模块。In a fourth aspect, embodiments of the present application provide a device, which may be a computing node or a training node, or a computer device where the computing node or training node is located, or a semiconductor chip set in the computer device. The device has the function of realizing various possible designs in the first to third aspects. This function can be realized by hardware, or by hardware executing corresponding software. The hardware or software includes one or more units or modules corresponding to the above-mentioned functions.
第五方面,本申请实施例提供一种装置,该装置包括处理器、存储器以及存储在存储器上并可在处理器上运行的指令,当该指令被运行时,使得该装置执行上述第一方面至第三方面中的各种可能的设计中所述的方法。In a fifth aspect, an embodiment of the present application provides a device that includes a processor, a memory, and an instruction stored on the memory and executable on the processor. When the instruction is executed, the device executes the first aspect. To the methods described in the various possible designs in the third aspect.
第六方面,本申请实施例提供一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行如上述第一方面至第三方面中的任一种可能的设计中所述的方法。In a sixth aspect, an embodiment of the present application provides a computer-readable storage medium, including instructions, which when run on a computer, cause the computer to execute as described in any one of the possible designs of the first to third aspects. The method described.
第七方面,本申请实施例提供一种计算机程序产品,当其在计算机上运行时,使得计算机执行如上述第一方面至第三方面中的任一种可能的设计中所述的方法。In a seventh aspect, the embodiments of the present application provide a computer program product, which when running on a computer, causes the computer to execute the method described in any one of the possible designs of the first aspect to the third aspect.
本申请的这些方面或其他方面在以下实施例的描述中会更加简明易懂。These and other aspects of the application will be more concise and understandable in the description of the following embodiments.
附图说明Description of the drawings
图1为本申请实施例提供的一种人工智能主体框架示意图;Figure 1 is a schematic diagram of an artificial intelligence main framework provided by an embodiment of the application;
图2a为本申请实施例提供的第一神经网络模型示意图;FIG. 2a is a schematic diagram of a first neural network model provided by an embodiment of this application;
图2b为本申请实施例提供的中心化的分布式训练系统示意图;Figure 2b is a schematic diagram of a centralized distributed training system provided by an embodiment of the application;
图2c为本申请实施例提供的去中心化的分布式训练系统示意图;Figure 2c is a schematic diagram of a decentralized distributed training system provided by an embodiment of the application;
图2d为本申请实施例提供的去中心化的分布式训练系统中的传输示意图;Figure 2d is a schematic diagram of transmission in the decentralized distributed training system provided by an embodiment of the application;
图2e为本申请实施例提供的计算与通信异步并行的一种可能的示意图;2e is a possible schematic diagram of asynchronous parallel computing and communication provided by an embodiment of this application;
图2f为本申请实施例提供的传输的数据量和通信拖尾时长的关系示意图;FIG. 2f is a schematic diagram of the relationship between the amount of transmitted data and the communication tail duration provided by an embodiment of the application;
图3为本申请实施适用的一种架构示意图;Figure 3 is a schematic diagram of an architecture applicable to the implementation of this application;
图4为本申请实施例提供的一种确定传输策略的方法所对应的流程示意图;FIG. 4 is a schematic flowchart corresponding to a method for determining a transmission strategy provided by an embodiment of the application;
图5a为第二神经网络模型生成传输策略的示意图;Figure 5a is a schematic diagram of a second neural network model generating a transmission strategy;
图5b为本申请实施例提供的实现方式1的整体示意图;FIG. 5b is an overall schematic diagram of Implementation Mode 1 provided by an embodiment of the application;
图6a为确定数据量阈值的示意图;Figure 6a is a schematic diagram of determining the data volume threshold;
图6b为本申请实施例提供的实现方式2的整体示意图;FIG. 6b is an overall schematic diagram of Implementation Mode 2 provided by an embodiment of this application;
图7为本申请实施例中所涉及的确定传输策略的装置的可能的示例性框图;FIG. 7 is a possible exemplary block diagram of a device for determining a transmission strategy involved in an embodiment of this application;
图8为本申请实施例提供的一种确定传输策略的装置示意图。FIG. 8 is a schematic diagram of an apparatus for determining a transmission strategy provided by an embodiment of the application.
具体实施方式detailed description
为了使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请作进一步地详细描述。In order to make the objectives, technical solutions, and advantages of the present application clearer, the present application will be further described in detail below with reference to the accompanying drawings.
首先,对本申请中的部分用语进行解释说明,以便于本领域技术人员理解。First, some terms in this application are explained to facilitate the understanding of those skilled in the art.
(1)人工神经网络(artificial neural network,ANN),简称神经网络(neural network,NN)或类神经网络,在机器学习和认知科学领域,是一种模仿生物神经网络(动物的中枢神经系统,特别是大脑)的结构和功能的数学模型或计算模型,用于对函数进行估计或近似。神经网络由大量的人工神经元联结进行计算。大多数情况下人工神经网络能在外界信息的基础上改变内部结构,是一种自适应系统,通俗的讲就是具备学习功能。(1) Artificial neural network (artificial neural network, ANN), referred to as neural network (NN) or neural network, in the field of machine learning and cognitive science, is a kind of imitating biological neural network (animal's central nervous system) , Especially the mathematical model or calculation model of the structure and function of the brain, used to estimate or approximate the function. The neural network is calculated by connecting a large number of artificial neurons. In most cases, the artificial neural network can change the internal structure on the basis of external information. It is an adaptive system, and it has a learning function in general.
(2)损失函数(loss function),在统计学中是一种衡量损失和错误程度的函数,在神经网络中,可以理解为衡量模型预测出的值和训练数据的标签值的差异程度的函数,可以以最小化损失函数为目标来训练神经网络模型。(2) Loss function (loss function), in statistics, is a function to measure the degree of loss and error. In neural networks, it can be understood as a function to measure the difference between the value predicted by the model and the label value of the training data , The neural network model can be trained with the goal of minimizing the loss function.
(3)梯度下降法(gradient descent)是一个一阶最优化算法,通常也称为最速下降法。要使用梯度下降法找到一个函数的局部极小值,必须向函数上当前点对应梯度(或者是近似梯度)的反方向的规定步长距离点进行迭代搜索。(3) Gradient descent is a first-order optimization algorithm, which is usually called the steepest descent method. To use the gradient descent method to find the local minimum of a function, it is necessary to iteratively search for the specified step distance point in the opposite direction of the gradient (or approximate gradient) corresponding to the current point on the function.
(4)梯度(Gradient):是指一个向量(矢量),梯度下降法中某个函数在某个点的方向导数沿着该方向取得的最大值。在神经网络中,可以基于每个参数的梯度对每个参数进行更新,从而逐渐逼近神经网络的损失函数的最小值。(4) Gradient (Gradient): refers to a vector (vector), the maximum value of the directional derivative of a certain function at a certain point in the gradient descent method. In a neural network, each parameter can be updated based on the gradient of each parameter, thereby gradually approaching the minimum value of the loss function of the neural network.
(5)反向传播(backpropagation,BP)算法:是“误差反向传播算法”的简称,是一种与最优化方法(如梯度下降法)结合使用的,用来训练人工神经网络的常见方法。该方法可以针对神经网络中的所有权重计算损失函数的梯度,并将梯度反馈给最优化方法,用来更新权重以最小化损失函数。(5) Backpropagation (BP) algorithm: short for "error backpropagation algorithm", it is a common method used to train artificial neural networks in combination with optimization methods (such as gradient descent) . This method can calculate the gradient of the loss function for all the weights in the neural network, and feed the gradient back to the optimization method to update the weights to minimize the loss function.
(6)训练节点:也可以称为工作体(worker)或工作节点,训练节点可以是GPU,也可以是中央处理器(central processing unit,CPU),具体不做限定。计算节点:可以是GPU,也可以是CPU,具体不做限定。(6) Training node: It can also be called a worker or a working node. The training node can be a GPU or a central processing unit (CPU), which is not specifically limited. Computing node: It can be a GPU or a CPU, which is not limited.
(7)GPU:又称显示核心、视觉处理器、显示芯片或绘图芯片,是一种专门在个人计算机、工作站、游戏机和一些移动设备(如平板电脑、智能手机等)上运行绘图运算工作的微处理器。(7) GPU: also known as display core, vision processor, display chip or graphics chip, it is a kind of graphics operation that is specially run on personal computers, workstations, game consoles and some mobile devices (such as tablet computers, smart phones, etc.) Microprocessor.
(8)本申请实施例中涉及的第一、第二等各种数字编号仅为描述方便进行的区分,并不用来限制本申请实施例的范围,也不表示先后顺序。“和/或”,描述关联对象的关联关 系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。“至少一个”是指一个或者多个。至少两个是指两个或者多个。“至少一个”、“任意一个”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b,或c中的至少一项(个、种),可以表示:a,b,c,a-b,a-c,b-c,或a-b-c,其中a,b,c可以是单个,也可以是多个。(8) The various numerical numbers such as first and second involved in the embodiments of the present application are only for easy distinction for description, and are not used to limit the scope of the embodiments of the present application, and do not indicate a sequence. "And/or" describes the association relationship of the associated objects, indicating that there can be three types of relationships, for example, A and/or B, which can mean: A alone, A and B at the same time, and B alone. "At least one" means one or more. At least two means two or more. "At least one", "any one" or similar expressions refer to any combination of these items, including any combination of single item (a) or plural items (a). For example, at least one (piece, species) of a, b, or c can mean: a, b, c, ab, ac, bc, or abc, where a, b, and c can be single or Multiple.
图1示出一种人工智能主体框架示意图,该主体框架描述了人工智能系统总体工作流程,适用于通用的人工智能领域需求。Figure 1 shows a schematic diagram of an artificial intelligence main framework, which describes the overall workflow of the artificial intelligence system and is suitable for general artificial intelligence field requirements.
下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主体框架进行阐述。The following describes the main framework of the above artificial intelligence from two dimensions: "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis).
“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。“IT价值链”从智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。"Intelligent Information Chain" reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has gone through the condensing process of "data-information-knowledge-wisdom". "IT value chain" from the intelligent underlying infrastructure, information (providing and processing technology realization) to the system industrial ecological process, reflecting the value artificial intelligence brings to the information technology industry.
(1)基础设施:基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片(CPU、NPU、GPU、ASIC、FPGA等硬件加速芯片)提供;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的智能芯片进行计算。(1) Infrastructure: Infrastructure provides computing power support for artificial intelligence systems, realizes communication with the outside world, and realizes support through the basic platform. Communicate with the outside through sensors; computing capabilities are provided by smart chips (CPU, NPU, GPU, ASIC, FPGA and other hardware acceleration chips); the basic platform includes distributed computing framework and network related platform guarantees and support, which can include cloud storage and Computing, interconnection network, etc. For example, sensors communicate with external parties to obtain data, and these data are provided to the smart chip provided by the basic platform for calculation.
(2)数据:基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。(2) Data: The data in the upper layer of the infrastructure is used to indicate the data source in the field of artificial intelligence. The data involves graphics, images, voice, and text, as well as IoT data of traditional devices, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
(3)数据处理:数据处理通常包括数据训练(比如深度学习、强化学习)、搜索、推理、决策等方式。(3) Data processing: Data processing usually includes data training (such as deep learning, reinforcement learning), search, reasoning, and decision-making.
其中,深度学习和强化学习是人工智能的重要部分,深度学习和强化学习均属于机器学习。其中,深度学习是指使用现有数据来训练算法以查找解决相应问题的模式,然后使用这种模式来对新数据进行预测。强化学习主要是通过反复试验来学习,即通过有限次地执行行动以得到最大化奖励从而确定最佳答案。深度学习和强化学习之间的区别在于:深度学习是从训练集中学习,然后将学习到的知识应用于新数据集,是一种静态学习;而强化学习是通过连续的反馈来调整自身的动作以获得最优结果,是一种不断试错的过程,是一种动态学习。需要说明的是,深度学习和强化学习并不是相互排斥的概念,二者可以结合使用,比如可以在强化学习中使用深度学习。Among them, deep learning and reinforcement learning are important parts of artificial intelligence, and both deep learning and reinforcement learning belong to machine learning. Among them, deep learning refers to the use of existing data to train algorithms to find patterns that solve the corresponding problems, and then use this pattern to predict new data. Reinforcement learning is mainly learning through trial and error, that is, to determine the best answer by performing actions a limited number of times to get the maximum reward. The difference between deep learning and reinforcement learning is that deep learning is learning from a training set, and then applying the learned knowledge to a new data set, which is a static learning; while reinforcement learning uses continuous feedback to adjust its own actions Obtaining the best results is a process of constant trial and error and dynamic learning. It should be noted that deep learning and reinforcement learning are not mutually exclusive concepts. The two can be used in combination. For example, deep learning can be used in reinforcement learning.
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。Reasoning refers to the process of simulating human intelligent reasoning in a computer or intelligent system, using formal information to conduct machine thinking and solving problems based on reasoning control strategies. The typical function is search and matching. Decision-making refers to the decision-making process of intelligent information after reasoning, and usually provides functions such as classification, ranking, and prediction.
(4)通用能力:对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如翻译、文本的分析、计算机视觉的处理、语音识别、图像的识别等等。(4) General capabilities: After the above-mentioned data processing is performed on the data, some general capabilities can be formed based on the results of the data processing, such as an algorithm or a general system, such as translation, text analysis, and computer vision processing , Voice recognition, image recognition, etc.
(5)智能产品及行业应用:智能产品及行业应用指人工智能系统在各领域的产品和 应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能制造、智能交通、智能家居、智能医疗、智能安防、自动驾驶,平安城市,智能终端等。(5) Smart products and industry applications: Smart products and industry applications refer to the products and applications of artificial intelligence systems in various fields. It is an encapsulation of the overall solution of artificial intelligence, productizing intelligent information decision-making and realizing landing applications. Its application fields Mainly include: smart manufacturing, smart transportation, smart home, smart medical, smart security, autonomous driving, safe city, smart terminal, etc.
本申请实施例将主要研究图1所示意的框架中的数据训练部分的内容,进一步地,主要研究使用训练数据对第一神经网络模型进行分布式训练过程中如何对计算出的梯度进行传输,以提高分布式训练的效率的问题。The embodiment of this application will mainly study the content of the data training part in the framework shown in FIG. 1, and further study how to transmit the calculated gradient during the distributed training process of the first neural network model using training data. To improve the efficiency of distributed training.
图2a为第一神经网络模型示意图,如图2a所示,第一神经网络模型中包括多层,每层包括至少一个参数。第一神经网络模型的训练是指根据海量的训练数据确定出最优参数值,使第一神经网络模型根据训练数据得到的实际输出数据与期望输出数据的差异符合要求。在图2a中,第一神经网络模型包括N个层(layer),分别为第1层至第N层,第一神经网络模型中的每个层有其对应的先后顺序。如图2a所示,第1层为直接接收训练数据的层,第N层为直接输出数据的层。在一个示例中,可以使用反向传播算法对第一神经网络模型进行训练,具体包括(针对于一次迭代来说):输入训练数据;根据训练数据从第1层至第N层计算实际输出数据(即为前向计算);根据实际输出数据与期望输出数据的差异,计算损失函数值;根据损失函数值计算第N层至第1层的参数的梯度,并采用梯度对参数更新(即为反向计算)。Fig. 2a is a schematic diagram of the first neural network model. As shown in Fig. 2a, the first neural network model includes multiple layers, and each layer includes at least one parameter. The training of the first neural network model refers to determining the optimal parameter value according to the massive training data, so that the difference between the actual output data and the expected output data obtained by the first neural network model according to the training data meets the requirements. In FIG. 2a, the first neural network model includes N layers, which are the first layer to the Nth layer, and each layer in the first neural network model has a corresponding sequence. As shown in Figure 2a, the first layer is a layer that directly receives training data, and the Nth layer is a layer that directly outputs data. In an example, the first neural network model can be trained using the backpropagation algorithm, which specifically includes (for one iteration): input training data; calculate the actual output data from the first layer to the Nth layer according to the training data (That is, forward calculation); calculate the loss function value according to the difference between the actual output data and the expected output data; calculate the gradient of the parameters from the Nth layer to the first layer according to the loss function value, and use the gradient to update the parameters (that is Reverse calculation).
以采用多个训练节点对第一神经网络模型进行数据并行的分布式训练为例,在多个训练节点对第一神经网络模型的一次迭代中,针对任一参数(比如参数a),多个训练节点计算出的参数a的梯度可能是不同的,多个训练节点需要将计算出的参数a的梯度进行传输,以便确定梯度平均值。进而,多个训练节点可以得到采用梯度平均值更新后的参数a。多个训练节点在对各层的参数进行更新后,分别使用更新后的各层的各个参数对第一神经网络模型进行下一次迭代。Taking the data-parallel distributed training of the first neural network model using multiple training nodes as an example, in one iteration of the first neural network model by multiple training nodes, for any parameter (such as parameter a), multiple The gradient of the parameter a calculated by the training node may be different, and multiple training nodes need to transmit the gradient of the calculated parameter a in order to determine the gradient average value. Furthermore, multiple training nodes can obtain the parameter a updated with the gradient average value. After multiple training nodes update the parameters of each layer, they respectively use the updated parameters of each layer to perform the next iteration of the first neural network model.
进一步地,在采用多个训练节点对第一神经网络模型进行数据并行的分布式训练时,可以采用中心化的方式进行分布式训练,也可以采用去中心化的方式进行分布式训练。下面分别两种方式进行具体描述。Further, when multiple training nodes are used to perform data-parallel distributed training on the first neural network model, a centralized manner may be used for distributed training, or a decentralized manner may be used for distributed training. The following two methods are described in detail.
图2b为中心化的分布式训练系统示意图,该分布式训练系统中包括中心服务器(也可以称为参数服务器(parameter server)或中心节点)和至少一个训练节点(比如图2b中所示意出的训练节点1、训练节点2和训练节点3),参数服务器可以和至少一个训练节点进行通信。其中,每个训练节点都拥有一个第一神经网络模型的副本,进而每个训练节点可以使用指定的数据块(shards)来训练第一神经网络模型。Figure 2b is a schematic diagram of a centralized distributed training system. The distributed training system includes a central server (also called a parameter server or central node) and at least one training node (such as the one shown in Figure 2b). For training node 1, training node 2, and training node 3), the parameter server can communicate with at least one training node. Among them, each training node has a copy of the first neural network model, and each training node can use designated data blocks (shards) to train the first neural network model.
具体来说,在一个示例中,分布式训练过程为:将训练数据集分割成多个数据块,针对于第j个数据块,可以将第j个数据块划分为3个小批量(mini-batches)数据,并分别由3个训练节点对3个mini-batches数据进行训练。每个训练节点在训练过程中,可以依据同样的规则将计算得到的参数的梯度发送给参数服务器(比如,每个训练节点在计算得到一层参数的梯度后,则将该层参数的梯度发送给参数服务器),相应地,以参数a为例,参数服务器可以根据参数a的梯度确定参数a的梯度平均值,进而基于参数a的梯度平均值对参数a进行更新(具体更新方式不做限定),并将更新后的参数a反馈给3个训练节点。如此,3个训练节点可以完成对第一神经网络模型的各层参数的更新,并可以基于更新后的参数,使用第j+1个数据块中的mini-batches数据进行下一次的迭代。Specifically, in an example, the distributed training process is: the training data set is divided into multiple data blocks, for the j-th data block, the j-th data block can be divided into 3 mini-batches (mini- batches) data, and 3 mini-batches data are trained by 3 training nodes. During the training process, each training node can send the gradient of the calculated parameter to the parameter server according to the same rules (for example, after each training node calculates the gradient of a layer of parameters, it sends the gradient of that layer of parameters) To the parameter server), correspondingly, taking parameter a as an example, the parameter server can determine the gradient average of parameter a according to the gradient of parameter a, and then update parameter a based on the gradient average of parameter a (the specific update method is not limited ), and feed back the updated parameter a to the 3 training nodes. In this way, the three training nodes can complete the update of the parameters of each layer of the first neural network model, and can use the mini-batches data in the j+1 data block to perform the next iteration based on the updated parameters.
图2c为去中心化的分布式训练系统示意图,该分布式训练系统中包括至少一个训练节 点(比如图2c中所示意出的训练节点1、训练节点2、训练节点3、训练节点4和训练节点5)。至少一个训练节点之间可以相互通信,比如传输梯度。Figure 2c is a schematic diagram of a decentralized distributed training system that includes at least one training node (such as the training node 1, training node 2, training node 3, training node 4, and training node shown in Figure 2c). Node 5). At least one training node can communicate with each other, such as transmitting gradients.
示例性地,至少一个训练节点之间可以有相互传输数据的顺序,比如训练节点1只能将数据传输给训练节点2,训练节点2只能将数据传输给训练节点3,训练节点3只能将数据传输给训练节点1。多个训练节点间传输数据的顺序可以是预先配置的,也可以是训练节点根据特定规则计算确定的。Exemplarily, at least one training node may have a mutual data transmission sequence. For example, training node 1 can only transmit data to training node 2, training node 2 can only transmit data to training node 3, and training node 3 can only Transmit the data to the training node 1. The sequence of data transmission among multiple training nodes can be pre-configured, or it can be calculated and determined by the training node according to specific rules.
具体来说,在一个示例中,分布式训练过程可以为:将训练数据集分割成多个数据块,针对于第j个数据块,可以将第j个数据块划分为5个小批量(mini-batches)数据,并分别由5个训练节点对5个mini-batches数据进行训练。每个训练节点在训练过程中,可以依据传输数据的顺序将计算得到的参数的梯度发送其它训练节点。其中,一个训练节点在将计算得到的梯度传输给另一训练节点时,可以将梯度按数量分为多个组,每个组为一个切片,每个切片包括至少一条梯度,切片的数量与对第一神经网络模型进行训练的训练节点的数量相同。如果有5个训练节点,则将缓存的梯度切分为5片。在将缓存的梯度分为多个切片时,每个切片中包含的梯度的数量一般是相同的,当然也存在不能均匀切分的情况,那每个切片中包含的梯度的数量可以大致相同,需要说明的是,每个训练节点采用相同的切片规则进行切分。举个例子,设共有5个训练节点,若每个训练节点缓存有待传输的10个梯度,则可以是每个切片中包括2个梯度,若每个训练节点缓存有待传输的11个梯度,可以采用2、2、2、2、3的方式切为5片,4个切片中均包括2个梯度,1个切片中包括3个梯度。Specifically, in an example, the distributed training process can be: dividing the training data set into multiple data blocks, for the jth data block, the jth data block can be divided into 5 mini-batches (mini -batches) data, and 5 mini-batches data are trained by 5 training nodes. During the training process of each training node, the gradient of the calculated parameter can be sent to other training nodes according to the sequence of data transmission. Among them, when a training node transmits the calculated gradient to another training node, the gradient can be divided into multiple groups according to the number. Each group is a slice, and each slice includes at least one gradient. The number of training nodes for training the first neural network model is the same. If there are 5 training nodes, the buffered gradient is divided into 5 slices. When the gradient of the cache is divided into multiple slices, the number of gradients contained in each slice is generally the same. Of course, there are also cases that cannot be uniformly divided. Then the number of gradients contained in each slice can be roughly the same. It should be noted that each training node uses the same slicing rule for segmentation. For example, suppose there are 5 training nodes in total. If each training node buffers 10 gradients to be transmitted, then each slice can include 2 gradients. If each training node buffers 11 gradients to be transmitted, you can Cut into 5 slices using 2, 2, 2, 2, 3, 4 slices all include 2 gradients, and 1 slice includes 3 gradients.
如图2d所示,训练节点i将缓存的梯度切为5片,每个切片分别为ai-ei,i为1-5,ai-ei为切片的标识,即训练节点1将缓存的梯度切为5片,每个切片分别为a1-e1,训练节点2将缓存的梯度切为5片,每个切片分别为a2-e2,……。针对同一切片标识(例如a、或b、或c、或d、或e),该切片标识对应的切片中包括的梯度对应的参数是一致的,例如,切片a1中包括两个梯度,该两个梯分别为参数R和参数Y对应的梯度,则切片a2、切片a3、切片a4、切片a5中也均包括两个梯度,该两个梯度也分别为参数R和参数Y对应的梯度。As shown in Figure 2d, training node i cuts the cached gradient into 5 slices, each slice is ai-ei, i is 1-5, ai-ei is the identifier of the slice, that is, training node 1 cuts the cached gradient There are 5 slices, each slice is a1-e1, the training node 2 cuts the buffered gradient into 5 slices, and each slice is a2-e2,... For the same slice identifier (for example, a, or b, or c, or d, or e), the parameters corresponding to the gradient included in the slice corresponding to the slice identifier are consistent. For example, the slice a1 includes two gradients, and the two Each ladder is the gradient corresponding to the parameter R and the parameter Y, and slice a2, slice a3, slice a4, and slice a5 also include two gradients, which are also gradients corresponding to the parameter R and the parameter Y, respectively.
训练节点1在将缓存的梯度传输给训练节点2时,训练节点1首先将切片a1发送给训练节点2;训练节点2在接收到训练节点1发送的切片a1后,将接收到的切片a1与自身确定的切片a2相加得到的和,作为一个切片a1+a2发送给训练节点3;切片a1中包括参数R和参数Y对应的梯度,分别为r1和y1,切片a2中包括参数R和参数Y对应的梯度,分别为r2和y2。训练节点2将切片a1和切片a2相加得到的和作为一个切片a1+a2发送给训练节点3,可以为针对参数R,将其对应的两个梯度r1和r2相加得到的和作为参数R的梯度携带在切片a1+a2中发送给训练节点3,针对参数Y,将其对应的两个梯度y1和y2相加得到的和作为参数Y的梯度携带在切片a1+a2中发送给训练节点3。训练节点3在接收到训练节点2发送的切片a1+a2后,将接收到的切片a1+a2与自身确定的切片a3相加得到的和,作为一个切片a1+a2+a3发送给训练节点4;训练节点4在接收到训练节点3发送的切片a1+a2+a3后,将接收到的切片a1+a2+a3与自身确定的切片a4相加得到的和,作为一个切片a1+a2+a3+a4发送给训练节点5;训练节点5在接收到训练节点4发送的切片a1+a2+a3+a4后,可以将接收到的切片a1+a2+a3+a4与自身确定的切片a5相加得到和,作为一个切片a1+a2+a3+a4+a5发送给训练节点1,训练节点5也根据切片(a1+a2+a3+a4)与自身确定的切片a5,计算梯度平均值,训练节点5将根据切片a计算的梯度平均值发送 给训练节点1。再由训练节点1将根据切片a计算的梯度平均值发送给训练节点2,训练节点2将根据切片a计算的梯度平均值发送给训练节点3,训练节点3将根据切片a计算的梯度平均值发送给训练节点4。如此,训练节点1至训练节点5均获取到了根据切片a计算的梯度平均值,则训练节点1至训练节点5可以采用根据切片a计算的梯度平均值对切片a对应的参数R和参数Y的数值进行更新,以便下一次迭代时使用。When training node 1 transmits the buffered gradient to training node 2, training node 1 first sends slice a1 to training node 2; after training node 2 receives slice a1 sent by training node 1, it compares the received slice a1 with The sum obtained by adding the slice a2 determined by itself is sent to the training node 3 as a slice a1+a2; the slice a1 includes the gradients corresponding to the parameter R and the parameter Y, which are r1 and y1, respectively, and the slice a2 includes the parameter R and the parameter The gradients corresponding to Y are r2 and y2 respectively. Training node 2 sends the sum obtained by adding slice a1 and slice a2 to training node 3 as a slice a1+a2, which can be the sum obtained by adding two corresponding gradients r1 and r2 for parameter R as parameter R The gradient of is carried in slice a1+a2 and sent to training node 3. For parameter Y, the sum of the two corresponding gradients y1 and y2 is added as the gradient of parameter Y and carried in slice a1+a2 and sent to training node 3. After the training node 3 receives the slice a1+a2 sent by the training node 2, the sum obtained by adding the received slice a1+a2 to the slice a3 determined by itself is sent to the training node 4 as a slice a1+a2+a3 ; After the training node 4 receives the slice a1+a2+a3 sent by the training node 3, the sum obtained by adding the received slice a1+a2+a3 to the slice a4 determined by itself is used as a slice a1+a2+a3 +a4 is sent to the training node 5; after the training node 5 receives the slice a1+a2+a3+a4 sent by the training node 4, it can add the received slice a1+a2+a3+a4 to the slice a5 determined by itself Get the sum and send it as a slice a1+a2+a3+a4+a5 to the training node 1. The training node 5 also calculates the gradient average according to the slice (a1+a2+a3+a4) and the slice a5 determined by itself, and trains the node 5 Send the gradient average calculated from slice a to the training node 1. Then training node 1 sends the gradient average calculated according to slice a to training node 2, training node 2 sends the gradient average calculated according to slice a to training node 3, and training node 3 will send the gradient average calculated according to slice a Send to training node 4. In this way, training node 1 to training node 5 have obtained the gradient average value calculated according to slice a, and training node 1 to training node 5 can use the gradient average value calculated according to slice a to compare the parameter R and parameter Y corresponding to slice a. The value is updated for use in the next iteration.
训练节点2在将缓存的梯度传输给训练节点3时,训练节点2首先将切片b2发送给训练节点3;训练节点3在接收到训练节点2发送的切片b2后,将接收到的切片b2与自身确定的切片b3相加得到的和作为一个切片b2+b3发送给训练节点3;……;依次类推,采用与上述类似的过程,直至训练节点1将计算的和作为一个切片(b1+b2+b3+b4+b5)发送给训练节点5,也可以是训练节点1根据切片b计算梯度平均值,训练节点1将根据切片b计算的梯度平均值发送给训练节点2。再由训练节点2将根据切片b计算的梯度平均值发送给训练节点3,训练节点3将根据切片b计算的梯度平均值发送给训练节点4,训练节点4将根据切片b计算的梯度平均值发送给训练节点5。如此,训练节点1至训练节点5均获取到了根据切片b计算的梯度平均值,则训练节点1至训练节点5可以采用根据切片b计算的梯度平均值对切片b对应的参数值进行更新,以便下一次迭代时使用。When training node 2 transmits the buffered gradient to training node 3, training node 2 first sends slice b2 to training node 3. After training node 3 receives slice b2 sent by training node 2, it compares the received slice b2 with The sum obtained by adding the slice b3 determined by itself is sent to the training node 3 as a slice b2+b3;...; and so on, using a process similar to the above, until the training node 1 regards the calculated sum as a slice (b1+b2 +b3+b4+b5) is sent to training node 5. It can also be that training node 1 calculates the gradient average value according to slice b, and training node 1 sends the gradient average value calculated according to slice b to training node 2. Then training node 2 sends the gradient average calculated according to slice b to training node 3, training node 3 sends the gradient average calculated according to slice b to training node 4, and training node 4 sends the gradient average calculated according to slice b Send to training node 5. In this way, training node 1 to training node 5 all obtain the gradient average value calculated according to slice b, then training node 1 to training node 5 can use the gradient average value calculated according to slice b to update the parameter value corresponding to slice b, so that Used in the next iteration.
类似地,训练节点3首先将切片c3发送给训练节点4;……;由训练节点2根据切片c,计算梯度平均值,训练节点2将根据切片c计算的梯度平均值发送给训练节点3,……;直至训练节点1至训练节点5均获取到了根据切片c计算的梯度平均值,则训练节点1至训练节点5可以采用根据切片c计算的梯度平均值对切片c对应的参数值进行更新。Similarly, training node 3 first sends slice c3 to training node 4; ...; training node 2 calculates the gradient average value according to slice c, and training node 2 sends the gradient average value calculated according to slice c to training node 3. ...; until training node 1 to training node 5 obtain the gradient average value calculated according to slice c, training node 1 to training node 5 can use the gradient average value calculated according to slice c to update the parameter value corresponding to slice c .
类似地,训练节点4首先将切片d4发送给训练节点5;……;由训练节点3根据切片di,计算梯度平均值,训练节点3将根据切片d计算的梯度平均值发送给训练节点4,……;直至训练节点1至训练节点5均获取到了根据切片d计算的梯度平均值,则训练节点1至训练节点5可以采用根据切片d计算的梯度平均值对切片d对应的参数值进行更新。Similarly, training node 4 first sends slice d4 to training node 5;...; training node 3 calculates the average gradient according to slice di, and training node 3 sends the average gradient calculated according to slice d to training node 4. ...; until training node 1 to training node 5 obtain the gradient average value calculated according to slice d, training node 1 to training node 5 can use the gradient average value calculated according to slice d to update the parameter value corresponding to slice d .
类似地,训练节点5首先将切片e5发送给训练节点1;……;由训练节点4根据切片e,计算梯度平均值,训练节点4将根据切片e计算的梯度平均值发送给训练节点5,……;直至训练节点1至训练节点5均确定出了根据切片e计算的梯度平均值,则训练节点1至训练节点5可以采用根据切片e计算的梯度平均值对切片e对应的参数值进行更新。Similarly, training node 5 first sends slice e5 to training node 1; ...; training node 4 calculates the gradient average according to slice e, and training node 4 sends the gradient average calculated according to slice e to training node 5. ……; Until the training node 1 to training node 5 have determined the gradient average calculated according to slice e, then training node 1 to training node 5 can use the gradient average calculated according to slice e to perform the parameter value corresponding to slice e Update.
根据上述的方式确定出梯度平均值,并采用梯度平均值对参数值进行更新后,可以基于更新后的参数,使用第j+1个数据块中的mini-batches数据进行下一次的迭代。After determining the gradient average value according to the above method, and using the gradient average value to update the parameter value, the mini-batches data in the j+1th data block can be used for the next iteration based on the updated parameter.
可以理解地,上述所描述基于图2c的训练过程仅是以训练节点1至训练节点5之间使用一种可能的逻辑拓扑(即环(ring))传输梯度为例来描述,在其它可能的示例中,训练节点1至训练节点5之间也可以使用其它可能的逻辑拓扑来传输梯度,具体不做限定。It is understandable that the training process described above based on Figure 2c is only described by using a possible logical topology (i.e. ring) transmission gradient between training node 1 and training node 5 as an example. In the example, other possible logical topologies can also be used to transmit the gradient between training node 1 and training node 5, which is not specifically limited.
需要说明的是,上述图2b和图2c中仅示意出少量的训练节点,具体实施中,训练节点的数量可以远远大于5个。进一步地,以去中心化的分布式训练系统为例,该系统中可以包括一个或多个计算机设备,每个计算机设备中可以部署有一个或多个训练节点。部署在同一计算机设备中的训练节点之间可以通过通信总线进行通信,部署在不同计算机设备中的训练节点之间可以通过网络(比如无线网络)进行通信。It should be noted that only a small number of training nodes are shown in Figures 2b and 2c. In specific implementation, the number of training nodes may be far greater than 5. Further, taking a decentralized distributed training system as an example, the system may include one or more computer devices, and each computer device may be deployed with one or more training nodes. Training nodes deployed in the same computer device can communicate through a communication bus, and training nodes deployed in different computer devices can communicate through a network (such as a wireless network).
根据上述图2b和图2c的介绍可知,训练节点需要在神经网络模型训练的反向计算过程中,将梯度发送出去做聚合操作(即求取梯度平均值以便于对参数进行更新)。进一步地,为了提升训练效率,训练节点还需要将第一神经网络训练的逐层计算和梯度传输重叠 (overlap)执行,即计算与通信异步并行。参见图2e,为计算与通信异步并行的一种可能的示意图。如图2e所示,在第一神经网络模型的第i次迭代中,计算得到第N层参数的梯度后,可以将第N层参数的梯度发送出去(传输时延表示为τ N);计算得到第N-1层参数的梯度后,可以将第N-1层参数的梯度发送出去(传输时延表示为τ N-1);以此类推,计算得到第1层参数的梯度后,可以将第1层参数的梯度发送出去(传输时延表示为τ 1)。如此,将第一神经网络模型的各层参数的梯度均发送出去后,可实现第一神经网络模型的各层参数的更新,进而执行下一次迭代。然而,在第一神经网络训练的逐层计算和梯度传输异步并行时,由于最后一次传输需要在计算得到第1层参数的梯度后执行,从而导致存在通信拖尾现象,若第i次迭代的通信拖尾时长(即τ 1)较长,则使得第i次迭代和第i+1次迭代之间的时间间隔较长,导致对第一神经网络模型进行分布式训练的效率较低。 According to the introduction of Fig. 2b and Fig. 2c, the training node needs to send the gradient out for aggregation operation during the reverse calculation process of neural network model training (that is, obtain the average value of the gradient in order to update the parameters). Further, in order to improve training efficiency, the training node also needs to overlap the layer-by-layer calculation and gradient transmission of the first neural network training, that is, the calculation and the communication are asynchronous and parallel. Refer to Figure 2e, which is a possible schematic diagram of asynchronous parallel computing and communication. As shown in Figure 2e, in the i-th iteration of the first neural network model, after the gradient of the N-th layer parameter is calculated, the gradient of the N-th layer parameter can be sent out (transmission delay is expressed as τ N ); After obtaining the gradient of the parameters of the N-1 layer, the gradient of the parameters of the N-1 layer can be sent out (transmission delay is expressed as τ N-1 ); and so on, after calculating the gradient of the parameters of the first layer, you can Send out the gradient of the first layer parameters (transmission delay is expressed as τ 1 ). In this way, after the gradients of the parameters of each layer of the first neural network model are all sent out, the parameters of each layer of the first neural network model can be updated, and the next iteration can be executed. However, when the layer-by-layer calculation of the first neural network training and the gradient transmission are asynchronously parallel, the last transmission needs to be executed after the gradient of the first layer parameter is calculated, which leads to the phenomenon of communication tailing. If the i-th iteration is The communication tailing time (ie, τ 1 ) is longer, which makes the time interval between the i-th iteration and the i+1-th iteration longer, resulting in lower efficiency of distributed training for the first neural network model.
在分布式训练场景中,影响通信拖尾时长的因素很多,比如分布式训练系统中训练节点的个数、物理组网方式、传输数据量(即一次传输的梯度的数据量)、每次传输梯度所使用的逻辑拓扑、通信调度开销、网络阻塞延迟等。其中,分布式训练系统的物理组网方式可以有多种,比如可能涉及无线带宽(Infiniband)、通过以太网使用远程直接内存访问(remote direct memory access,RDMA)(RDMA over converged ethernet,RoCE)、高速串行计算机扩展总线(peripheral component interconnect express,PCIe)、NVLink互连等。传输梯度所使用的逻辑拓扑可以有多种,比如逻辑树(logical tree)、环(ring)、二分和倍增(halving&doubling)、层次环(hierarchical ring)、混合拓扑(hybrid topology)等。In the distributed training scenario, there are many factors that affect the length of communication tailing, such as the number of training nodes in the distributed training system, the physical networking method, the amount of transmitted data (that is, the amount of data in the gradient of one transmission), and each transmission The logic topology used by the gradient, communication scheduling overhead, network congestion delay, etc. Among them, the distributed training system can have multiple physical networking methods, such as wireless bandwidth (Infiniband), remote direct memory access (RDMA) (RDMA overconverged ethernet, RoCE), High-speed serial computer expansion bus (peripheral component interconnect express, PCIe), NVLink interconnect, etc. There are many kinds of logical topologies used in the transmission gradient, such as logical trees, rings, halving & doubling, hierarchical rings, hybrid topology, etc.
进一步地,影响通信拖尾时长的各个因素之间可能会相互作用,比如在分布式训练系统的物理组网方式和训练节点的个数确定的情况下,若传输梯度所使用的逻辑拓扑不同(比如topo0和topo1),则传输的数据量和通信拖尾时长曲线会呈现不同的趋势,如图2f示意,当传输的数据量小于M0时,采用topo0传输的传输时延小于topo1;反之,则采用topo1传输的传输时延小于topo0。Further, various factors that affect the length of communication tailing may interact with each other. For example, when the physical networking mode of the distributed training system and the number of training nodes are determined, if the logical topology used for the transmission gradient is different ( For example, topo0 and topo1), the transmitted data volume and the communication tail time curve will show different trends, as shown in Figure 2f, when the transmitted data volume is less than M0, the transmission delay using topo0 is less than topo1; otherwise, The transmission delay using topo1 is less than topo0.
此外,图2e中所示意的传输梯度的方式是每计算得到一层参数的梯度,即可传输该层参数的梯度,但是不同的神经网络模型的参数量和层间参数分布往往是不同的。比如,对于同一神经网络模型来说,有些层参数较多,而有些层参数较少,对于较少参数的层,发起传输的效率显然不高(考虑通信开销等因素),因此,可以将计算得到的梯度积攒到一定数量再集中发起一次通信传输,比如可以计算得到两层或者三层参数的梯度后发起传输,以便于提高传输效率。又比如,对于不同的神经网络模型来说,有些神经网络模型的参数分布比较均匀,而有些神经网络模型的参数可能会集中在某几层,从而可能会产生突发性传输。In addition, the method of transmitting the gradient shown in FIG. 2e is that each time the gradient of a layer of parameters is calculated, the gradient of the parameters of the layer can be transmitted, but the parameter amount and the parameter distribution between layers of different neural network models are often different. For example, for the same neural network model, some layers have more parameters, and some layers have fewer parameters. For layers with fewer parameters, the efficiency of initiating transmission is obviously not high (considering factors such as communication overhead), therefore, you can calculate The obtained gradients are accumulated to a certain number and then a communication transmission is initiated intensively. For example, a two-layer or three-layer parameter gradient can be calculated to initiate transmission, so as to improve transmission efficiency. For another example, for different neural network models, the parameter distribution of some neural network models is relatively uniform, and the parameters of some neural network models may be concentrated in certain layers, which may cause sudden transmission.
综合上述内容可知,为了提高分布式训练效率,需要针对不同的分布式训练系统和神经网络模型设计不同的传输策略。Based on the above content, in order to improve the efficiency of distributed training, different transmission strategies need to be designed for different distributed training systems and neural network models.
目前,一些深度学习框架(如tensorflow)和第三方库(如Horovod、OpenMPI)提供了一些自定义机制的传输策略。比如,Horovod允许用户设定一个数据量阈值或者时间阈值,在反向计算过程中,如果积攒的梯度的数据量达到数据量阈值或者传输时间间隔达到时间阈值,则发起传输。然而,由于该方式并没有提供任何依据来确定数据量阈值或者时间阈值,从而导致用户根据自身经验设置的数据量阈值或者时间阈值可能不够合理,从而无法达到提高分布式训练效率的目的。At present, some deep learning frameworks (such as tensorflow) and third-party libraries (such as Horovod, OpenMPI) provide transmission strategies with some custom mechanisms. For example, Horovod allows users to set a data volume threshold or time threshold. In the reverse calculation process, if the accumulated gradient data reaches the data volume threshold or the transmission time interval reaches the time threshold, the transmission is initiated. However, since this method does not provide any basis to determine the data volume threshold or time threshold, the data volume threshold or time threshold set by the user based on his own experience may not be reasonable enough, thus failing to achieve the purpose of improving the efficiency of distributed training.
基于此,本申请实施例提供一种确定传输策略的方法,具体来说,包括:生成第i传 输策略,第i传输策略用于传输第一神经网络模型的第i次迭代所得到的各层参数的梯度;确定第i传输策略对应的通信拖尾时长,第i传输策略对应的通信拖尾时长用于指示所述第一神经网络模型的第i次迭代的结束时刻和第i+1次迭代的开始时刻之间的时长;根据第i传输策略对应的通信拖尾时长生成第i+1传输策略,所述第i+1传输策略用于传输第一神经网络模型的第i+1次迭代所得到的各层参数的梯度。上述方法可以由计算节点来执行,如此,计算节点在生成第i传输策略后,可以得到第i传输策略对应的通信拖尾时长,从而可以基于第i传输策略对应的通信拖尾时长完成一轮强化学习,使得生成的第i+1传输策略趋向于最优的传输策略(即使得通信拖尾时长最短的传输策略),有利于提高分布式训练的效率。Based on this, an embodiment of the present application provides a method for determining a transmission strategy, which specifically includes: generating an i-th transmission strategy, and the i-th transmission strategy is used to transmit the layers obtained from the i-th iteration of the first neural network model The gradient of the parameter; determine the communication tail time corresponding to the i-th transmission strategy, and the communication tail time corresponding to the i-th transmission strategy is used to indicate the end time of the i-th iteration and the i+1-th time of the first neural network model The duration between the beginning of iterations; the i+1th transmission strategy is generated according to the communication tailing duration corresponding to the i-th transmission strategy, and the i+1th transmission strategy is used to transmit the i+1th time of the first neural network model The gradient of each layer parameter obtained by iteration. The above method can be executed by a computing node. In this way, after the computing node generates the i-th transmission strategy, it can obtain the communication tailing duration corresponding to the i-th transmission strategy, so that a round can be completed based on the communication tailing duration corresponding to the i-th transmission strategy Reinforcement learning makes the generated i+1th transmission strategy tend to the optimal transmission strategy (that is, the transmission strategy that makes the communication tailing time the shortest), which is beneficial to improve the efficiency of distributed training.
进一步地,计算节点可以通过与对第一神经网络模型进行分布式训练的多个训练节点进行交互,进而基于多个训练节点在完成一次迭代后反馈的通信拖尾时长来不断尝试并更新传输策略,从而能够通过强化学习来智能化、自动化的产生接近最优的传输策略,提高分布式训练的效率。Further, the computing node can interact with multiple training nodes that perform distributed training on the first neural network model, and then continuously try and update the transmission strategy based on the communication tailing time that multiple training nodes feedback after completing an iteration. , Which can intelligently and automatically generate a transmission strategy close to the optimal through reinforcement learning and improve the efficiency of distributed training.
图3为本申请实施例适用的一种架构示意图,如图3所示,包括计算节点和分布式训练系统,其中分布式训练系统可以为图2b所示的中心化的分布式训练系统,或者也可以为图2c所示的去中心化的分布式训练系统,又或者也可以其它可能的分布式训练系统,具体不做限定;图3中仅是以分布式训练系统为图2c所示的去中心化的分布式训练系统为例。FIG. 3 is a schematic diagram of an architecture to which the embodiments of the application are applicable. As shown in FIG. 3, it includes a computing node and a distributed training system. The distributed training system may be the centralized distributed training system shown in FIG. 2b, or It can also be the decentralized distributed training system shown in Figure 2c, or it can also be other possible distributed training systems, which are not specifically limited; in Figure 3 only the distributed training system is shown in Figure 2c Take a decentralized distributed training system as an example.
在一个示例中,计算节点中可以包括智能体执行器(agent executor),训练节点中可以包括评估器(estimator),比如训练节点1中包括评估器1、训练节点2中包括评估器2、……、训练节点5中包括评估器5。In an example, the computing node may include an agent executor, and the training node may include an estimator, for example, training node 1 includes evaluator 1, training node 2 includes evaluator 2,... …, the training node 5 includes an evaluator 5.
具体来说,智能体执行器可以是一组强化学习网络或者算法,其主要用于生成第i传输策略,然后,将第i传输策略分别发送给评估器1至评估器5,并根据评估器1至评估器5反馈回来的奖励(reward)(具体可以为通信拖尾时长),来更新自身的参数。以评估器1为例,评估器1主要用于获取智能体执行器生成的第i传输策略,然后启动第一神经网络模型的一次迭代,以及根据第i传输策略传输反向计算得到的梯度,测量通信拖尾时长,并将通信拖尾时长作为奖励反馈给智能体执行器;其它评估器可以参照评估器1的描述,不再赘述。如此,经过反复的迭代,智能体执行器可以从真实的模型训练环境中不断学习和进化,最终趋向于产生最优的传输策略。Specifically, the agent executor can be a set of reinforcement learning networks or algorithms, which are mainly used to generate the i-th transmission strategy, and then send the i-th transmission strategy to the evaluator 1 to evaluator 5, and according to the evaluator The rewards (specifically, the communication tail time) fed back from 1 to 5 are used to update its own parameters. Taking evaluator 1 as an example, evaluator 1 is mainly used to obtain the i-th transmission strategy generated by the agent executor, and then start an iteration of the first neural network model, and transmit the reverse calculated gradient according to the i-th transmission strategy. The communication tail time is measured, and the communication tail time is fed back to the agent executor as a reward; other evaluators can refer to the description of evaluator 1 and will not be repeated. In this way, after repeated iterations, the agent executor can continuously learn and evolve from the real model training environment, and eventually tend to produce the optimal transmission strategy.
需要说明的是,上述示例是以每个训练节点对应一个评估器为例来进行描述的,在其它可能的示例中,也可以是多个训练节点对应一个评估器,具体不做限定。It should be noted that the above example is described by taking one evaluator corresponding to each training node as an example. In other possible examples, multiple training nodes may correspond to one evaluator, which is not specifically limited.
基于图3所示意的架构,图4为本申请实施例提供的一种确定传输策略的方法所对应的流程示意图,如图4所示,包括:Based on the architecture shown in FIG. 3, FIG. 4 is a schematic flowchart corresponding to a method for determining a transmission strategy provided by an embodiment of the application, as shown in FIG. 4, including:
步骤401,计算节点生成第i传输策略,并将第i传输策略分别发送给W个训练节点,W个训练节点用于对第一神经网络模型进行分布式训练;第i传输策略用于W个训练节点传输第一神经网络模型的第i次迭代所得到的各层参数的梯度。Step 401: The computing node generates the i-th transmission strategy, and sends the i-th transmission strategy to W training nodes respectively. The W training nodes are used for distributed training of the first neural network model; the i-th transmission strategy is used for W training nodes. The training node transmits the gradient of each layer parameter obtained in the i-th iteration of the first neural network model.
步骤402,训练节点接收第i传输策略,并将使用第i传输策略传输第一神经网络模型的第i次迭代所得到的各层参数的梯度的第i通信拖尾时长发送给计算节点。此处所描述的训练节点可以为W个训练节点中的任一个训练节点。In step 402, the training node receives the i-th transmission strategy, and transmits the i-th communication tailing duration of the gradient of each layer parameter obtained in the i-th iteration of the first neural network model using the i-th transmission strategy to the computing node. The training node described here may be any one of the W training nodes.
步骤403,计算节点接收W个训练节点发送的W个第i通信拖尾时长。Step 403: The computing node receives W i-th communication tailing durations sent by W training nodes.
本申请实施例中,训练节点得到的第i通信拖尾时长(即第i次迭代的通信拖尾时长) 可以是根据训练节点对第一神经网络模型进行第i次迭代的结束时刻和第i+1次迭代的开始时刻之间的时长得到的。在一个示例中,训练节点得到的第i通信拖尾时长可以等于训练节点对第一神经网络模型进行第i次迭代的结束时刻和第i+1次迭代的开始时刻之间的时长。以训练节点1为例,训练节点1得到的第i通信拖尾时长可以等于训练节点1对第一神经网络模型进行第i次迭代的结束时刻和第i+1次迭代的开始时刻之间的时长。In the embodiment of this application, the i-th communication tailing time obtained by the training node (that is, the communication tailing time of the i-th iteration) may be the end time and the i-th iteration of the i-th iteration of the first neural network model performed by the training node. +1 time between the start of the iteration. In an example, the i-th communication tail duration obtained by the training node may be equal to the duration between the end time of the i-th iteration of the training node on the first neural network model and the start time of the i+1-th iteration. Taking training node 1 as an example, the i-th communication tail time obtained by training node 1 can be equal to the time between the end time of the i-th iteration of the first neural network model by training node 1 and the start time of the i+1th iteration duration.
步骤404,计算节点根据W个第i通信拖尾时长,确定第i传输策略对应的通信拖尾时长。第i传输策略对应的通信拖尾时长用于指示所述第一神经网络模型的第i次迭代的结束时刻和第i+1次迭代的开始时刻之间的时长,由于存在W个训练节点对第一神经网络模型进行分布式训练,因此,也可以理解为:第i传输策略对应的通信拖尾时长反映了W个训练节点对第一神经网络模型进行第i次迭代的结束时刻和第i+1次迭代的开始时刻之间的时长。Step 404: The computing node determines the communication tail time corresponding to the i-th transmission strategy according to the W i-th communication tail time lengths. The communication tail duration corresponding to the i-th transmission strategy is used to indicate the duration between the end time of the i-th iteration and the start time of the i+1-th iteration of the first neural network model. Since there are W training node pairs The first neural network model performs distributed training. Therefore, it can also be understood as: the communication tailing duration corresponding to the i-th transmission strategy reflects the end time and the i-th iteration of the i-th iteration of the first neural network model by W training nodes The length of time between the start of the +1 iteration.
本申请实施例中,计算节点根据W个第i通信拖尾时长,确定第i传输策略对应的通信拖尾时长的具体实现方式可以有多种,比如计算节点可以将W个第i通信拖尾时长的平均值确定为第i传输策略对应的通信拖尾时长,即第i传输策略对应的通信拖尾时长等于W个训练节点对第一神经网络模型进行第i次迭代的结束时刻和第i+1次迭代的开始时刻之间的时长的平均值。In the embodiment of the present application, the computing node may determine the communication tailing duration corresponding to the i-th transmission strategy according to the W i-th communication tailing duration according to the specific implementation manners. For example, the computing node can tail the W i-th communication The average of the time length is determined as the communication tailing time corresponding to the i-th transmission strategy, that is, the communication tailing time corresponding to the i-th transmission strategy is equal to the end time of the i-th iteration of the first neural network model by W training nodes and the i-th The average of the duration between the start of the +1 iteration.
步骤405,计算节点根据第i传输策略对应的通信拖尾时长生成第i+1传输策略,并将第i+1传输策略发送给W个训练节点,第i+1传输策略用于W个训练节点传输第一神经网络模型的第i+1次迭代所得到的各层参数的梯度。Step 405: The computing node generates an i+1th transmission strategy according to the communication tailing duration corresponding to the i-th transmission strategy, and sends the i+1th transmission strategy to W training nodes, and the i+1th transmission strategy is used for W trainings The node transmits the gradient of each layer parameter obtained from the i+1th iteration of the first neural network model.
其中,W为大于或等于1的整数,i=1,2,……,X-1,X为第一神经网络模型的迭代次数,X为大于1的整数。本申请实施例中,X的取值可以是根据第一神经网络模型的训练数据集所分割的数据块的个数得到的,每个数据块用于第一神经网络模型进行一次迭代(具体可以参见上述图2b和图2c中的描述)。在一个示例中,X等于第一神经网络模型的训练数据集所分割的数据块的个数。Among them, W is an integer greater than or equal to 1, i=1, 2,..., X-1, X is the number of iterations of the first neural network model, and X is an integer greater than 1. In the embodiment of this application, the value of X can be obtained according to the number of data blocks divided by the training data set of the first neural network model, and each data block is used for one iteration of the first neural network model (specifically, it can be See the description in Figure 2b and Figure 2c above). In an example, X is equal to the number of data blocks divided by the training data set of the first neural network model.
具体来说,针对于上述步骤405,在一个示例中,计算节点若确定第i传输策略对应的通信拖尾时长大于第一阈值,则可以根据第i传输策略对应的通信拖尾时长生成第i+1传输策略;计算节点若确定第i传输策略对应的通信拖尾时长小于或等于第一阈值,则可以将第i传输策略作为第i+1传输策略,也就是说,计算节点通过多轮强化学习,生成的第i传输策略能够使得分布式训练的效率较高(即第i传输策略为一个较优的传输策略)后,可以不再生成新的传输策略,相应地,训练节点可以在第一神经网络模型的后续迭代中,使用同一传输策略(即第i传输策略)来传输梯度。其中,第一阈值可以根据实际需要和经验来设置。采用这种方式,能够有效降低计算节点的处理负担,且训练节点可以基于同一传输策略来传输梯度而无需再接收计算节点新生成的传输策略,能够有效提高分布式训练的效率。Specifically, for the above step 405, in an example, if the computing node determines that the communication tail time corresponding to the i-th transmission strategy is greater than the first threshold, it can generate the i-th communication tail time corresponding to the i-th transmission strategy. +1 transmission strategy; if the computing node determines that the communication tail duration corresponding to the i-th transmission strategy is less than or equal to the first threshold, the i-th transmission strategy can be used as the i+1-th transmission strategy, that is, the computing node passes through multiple rounds Reinforcement learning, the generated i-th transmission strategy can make distributed training more efficient (that is, after the i-th transmission strategy is a better transmission strategy), no new transmission strategy can be generated. Accordingly, the training node can be In subsequent iterations of the first neural network model, the same transmission strategy (that is, the i-th transmission strategy) is used to transmit the gradient. Among them, the first threshold can be set according to actual needs and experience. In this way, the processing burden of the computing node can be effectively reduced, and the training node can transmit gradients based on the same transmission strategy without receiving the newly generated transmission strategy of the computing node, which can effectively improve the efficiency of distributed training.
需要说明的是,上述示例中是通过将第i传输策略对应的通信拖尾时长与第一阈值进行比较来确定第i传输策略是否为较优的传输策略,在其它可能的示例中,也可以通过其它方式来确定第i传输策略是否为较优的传输策略,具体不做限定。It should be noted that in the above example, the communication tailing duration corresponding to the i-th transmission strategy is compared with the first threshold to determine whether the i-th transmission strategy is a better transmission strategy. In other possible examples, it may be It is determined by other methods whether the i-th transmission strategy is a better transmission strategy, which is not specifically limited.
在又一个示例中,考虑到影响通信拖尾时长的因素中可能会存在一些动态且不稳定的因素(比如通信调度开销、网络阻塞延迟等),这些动态且不稳定的因素会导致计算节点确定出的传输策略无法有效地适用于第一神经网络模型的多次迭代,因此,在第一神经网 络模型的整体训练过程中,计算节点均可以根据前一次迭代的通信拖尾时长来确定后一次迭代过程中所使用的传输策略,即计算节点在确定出第i传输策略对应的通信拖尾时长后,可以直接根据第i传输策略对应的通信拖尾时长生成第i+1传输策略,而无需判断第i传输策略对应的通信拖尾时长是否大于第一阈值,从而能够及时依据一些因素的变化来调整传输策略,提高分布式训练的效率。下文中将主要基于该示例来进行描述。In another example, considering that there may be some dynamic and unstable factors (such as communication scheduling overhead, network congestion delay, etc.) that affect the length of communication tailing time, these dynamic and unstable factors will cause the computing node to determine The proposed transmission strategy cannot be effectively applied to multiple iterations of the first neural network model. Therefore, in the overall training process of the first neural network model, the computing nodes can determine the next one based on the communication tail time of the previous iteration. The transmission strategy used in the iterative process, that is, after the computing node determines the communication tail time corresponding to the i-th transmission strategy, it can directly generate the i+1-th transmission strategy according to the communication tail time corresponding to the i-th transmission strategy without It is determined whether the communication tail duration corresponding to the i-th transmission strategy is greater than the first threshold, so that the transmission strategy can be adjusted in time according to changes in some factors, and the efficiency of distributed training is improved. The description will be mainly based on this example below.
本申请实施例中,计算节点可以基于多种可能的强化学习方式来生成传输策略,下面示例性地描述两种可能的实现方式。In the embodiment of the present application, the computing node may generate a transmission strategy based on a variety of possible reinforcement learning methods. The following exemplarily describes two possible implementation methods.
(1)实现方式一(1) Implementation method one
计算节点可以通过第二神经网络模型生成第i传输策略,以及根据第i传输策略对应的通信拖尾时长对第二神经网络模型的参数进行更新,并根据更新后的第二神经网络模型生成第i+1传输策略。其中,第二神经网络模型可以为循环神经网络(recurrent neural network,RNN)模型,比如长短期记忆网络(long short-term memory,LSTM)。计算节点对第二神经网络模型的参数进行更新的方式可以有多种,比如可以采用近端策略优化(proximal policy optimization,PPO)算法或异步优势行为批评(asynchronous advantage actor-critic,A3C)算法进行更新。采用上述方法,将第i传输策略对应的通信拖尾时长作为强化学习的奖励来更新第二神经网络模型的参数,由于第二神经网络模型(比如RNN模型)具有较强的学习能力,因此通过不断执行这一过程,能够实现将通信拖尾时长收敛到最优值。The computing node can generate the i-th transmission strategy through the second neural network model, and update the parameters of the second neural network model according to the communication tailing duration corresponding to the i-th transmission strategy, and generate the second neural network model according to the updated second neural network model. i+1 transmission strategy. Wherein, the second neural network model may be a recurrent neural network (RNN) model, such as a long short-term memory network (LSTM). There are many ways for the computing node to update the parameters of the second neural network model. For example, it can use the proximal policy optimization (PPO) algorithm or the asynchronous advantage actor-critic (A3C) algorithm. Update. Using the above method, the communication tailing duration corresponding to the i-th transmission strategy is used as the reward of reinforcement learning to update the parameters of the second neural network model. Since the second neural network model (such as the RNN model) has a strong learning ability, it is passed Continuously performing this process can achieve the convergence of the communication tail time to the optimal value.
示例性地,第i传输策略可以包括第一神经网络模型的每一层的子传输策略,第n层的子传输策略可以包括第一信息和第二信息,第一信息用于指示计算得到第n层参数的梯度后是否发起传输,第二信息用于指示传输所使用的逻辑拓扑。在一个示例中,第i传输策略可以为以{第一信息(是否通信),第二信息(逻辑拓扑)}的形式不断重复一定次数(次数取决于第一神经网络模型的层数)的序列,{第一信息(是否通信),第二信息(逻辑拓扑)}可以理解为第一神经网络模型的其中一层的子传输策略。举个例子,第一神经网络模型包括3层,则第i传输策略为[{是,topo0}、{是,topo1}、{是,topo1}],如此,训练节点接收到第i传输策略后,若计算得到第一神经网络模型的第3层参数的梯度,则使用topo0发送第3层参数的梯度,若计算得到第一神经网络模型的第2层参数的梯度,则使用topo1发送第2层参数的梯度,若计算得到第一神经网络模型的第1层参数的梯度,则使用topo1发送第1层参数的梯度。本申请实施例中,可以预先设置逻辑拓扑空间,逻辑拓扑空间中可以包括多种供选择的逻辑拓扑,上述第二信息所指示的传输所使用的逻辑拓扑可以为逻辑拓扑空间的其中一种逻辑拓扑。Exemplarily, the i-th transmission strategy may include the sub-transmission strategy of each layer of the first neural network model, the sub-transmission strategy of the n-th layer may include the first information and the second information, and the first information is used to indicate the calculated Whether to initiate transmission after the gradient of the n-layer parameters, the second information is used to indicate the logical topology used for transmission. In an example, the i-th transmission strategy may be a sequence that is repeated a certain number of times (the number depends on the number of layers of the first neural network model) in the form of {first information (whether to communicate), second information (logical topology)} , {The first information (whether to communicate), the second information (logical topology)} can be understood as the sub-transmission strategy of one layer of the first neural network model. For example, the first neural network model includes 3 layers, and the i-th transmission strategy is [{Yes, topo0}, {Yes, topo1}, {Yes, topo1}]. Thus, after the training node receives the i-th transmission strategy If the gradient of the third layer parameter of the first neural network model is calculated, use topo0 to send the gradient of the third layer parameter, if the gradient of the second layer parameter of the first neural network model is calculated, then use topo1 to send the second The gradient of the layer parameters, if the gradient of the first layer parameter of the first neural network model is calculated, then topo1 is used to send the gradient of the first layer parameter. In the embodiment of the present application, a logical topological space may be preset, and the logical topological space may include multiple logical topologies for selection, and the logical topology used for transmission indicated by the second information may be one of the logical topological spaces. Topology.
具体来说,计算节点可以通过第二神经网络模型以自激励循环的方式生成传输策略,下面结合图5a对生成传输策略的具体实现过程进行说明。如图5a所示,第二神经网络模型的初始输入可以为随机内容(比如可以为逻辑拓扑空间中随机的一个逻辑拓扑的标识),进而根据初始输入可以生成第1层的子传输策略的第一信息,将第1层的子传输策略的第一信息作为第二神经网络模型的输入,可以生成第1层的子传输策略的第二信息;将第1层的子传输策略的第二信息作为第二神经网络模型的输入,可以生成第2层的子传输策略的第一信息,依次类推。Specifically, the computing node can generate a transmission strategy in a self-excitation cycle through the second neural network model. The specific implementation process of generating the transmission strategy will be described below with reference to FIG. 5a. As shown in Figure 5a, the initial input of the second neural network model can be random content (for example, it can be an identifier of a random logical topology in the logical topology space), and then the first layer of the sub-transmission strategy can be generated according to the initial input. One information, the first information of the sub-transmission strategy of the first layer is used as the input of the second neural network model, and the second information of the sub-transmission strategy of the first layer can be generated; the second information of the sub-transmission strategy of the first layer As the input of the second neural network model, the first information of the sub-transmission strategy of the second layer can be generated, and so on.
可以看出,第二神经网络模型每两个时间步可以生成第一神经网络模型的其中一层的子传输策略,进而经过n*2个时间步可以生成第1传输策略;计算节点将第1传输策略发 送给W个训练节点(对应步骤401),相应地,训练节点接收第1传输策略,使用第1传输策略传输第一神经网络模型的第1次迭代所得到的各层参数的梯度,并将使用第1传输策略传输第一神经网络模型的第1次迭代所得到的各层参数的梯度的第1通信拖尾时长发送给计算节点(对应步骤402);计算节点接收W个训练节点发送的第1通信拖尾时长,并确定第1传输策略对应的通信拖尾时长(对应步骤403和步骤404);计算节点根据第1传输策略对应的通信拖尾时长对第二神经网络模型进行更新,并基于更新后的神经网络模型,经过n*2个时间步生成第2传输策略(对应步骤405),计算节点将第2传输策略发送给W个训练节点,从而循环执行上述步骤401至步骤405,直到完成第一神经网络模型的训练。It can be seen that the second neural network model can generate the sub-transmission strategy of one layer of the first neural network model every two time steps, and then the first transmission strategy can be generated after n*2 time steps; the computing node will generate the first transmission strategy. The transmission strategy is sent to W training nodes (corresponding to step 401). Accordingly, the training node receives the first transmission strategy, and uses the first transmission strategy to transmit the gradient of each layer parameter obtained in the first iteration of the first neural network model, And use the first transmission strategy to transmit the first communication tail duration of the gradient of each layer parameter obtained in the first iteration of the first neural network model to the computing node (corresponding to step 402); the computing node receives W training nodes The first communication tailing time sent, and the communication tailing time corresponding to the first transmission strategy is determined (corresponding to step 403 and step 404); the computing node performs the second neural network model according to the communication tailing time corresponding to the first transmission strategy Update, and based on the updated neural network model, generate the second transmission strategy after n*2 time steps (corresponding to step 405), and the computing node sends the second transmission strategy to W training nodes, thereby cyclically executing the above steps 401 to Step 405, until the training of the first neural network model is completed.
图5b为本申请实施例提供的实现方式1的整体示意图。如图5b所示,可以由智能体执行器运行第二神经网络模型和参数更新算法,评价器则负责在第一神经网络模型的每一层的反向计算中添加一个通信算子(send operator))(比如评价器1负责在训练节点1的第一神经网络模型的每一层的反向计算中添加一个通信算子),进而根据智能体执行器产生的传输策略,控制通信算子是否发起传输以及采用哪种逻辑拓扑。第一神经网络模型的第i次迭代结束时,评价器将第i传输策略对应的通信拖尾时长反馈给智能体执行器,智能体执行器将第i传输策略对应的通信拖尾时长视为与环境互动而得到的“奖励”,并根据策略梯度(policy gradient)方法(比如PPO)计算策略梯度,进而更新第二神经网络模型的参数,产生新的传输策略(即第i+1传输策略),完成一轮强化学习。如此,经过一段时间的反复迭代,智能体执行器可以针对特定的物理组网方式和第一神经网络模型生成近似最优的传输策略。FIG. 5b is an overall schematic diagram of Implementation Mode 1 provided by an embodiment of the application. As shown in Figure 5b, the second neural network model and parameter update algorithm can be run by the agent executor, and the evaluator is responsible for adding a communication operator to the reverse calculation of each layer of the first neural network model. )) (For example, the evaluator 1 is responsible for adding a communication operator to the reverse calculation of each layer of the first neural network model of the training node 1), and then according to the transmission strategy generated by the agent executor, whether the communication operator is controlled Initiate the transmission and which logical topology to use. At the end of the i-th iteration of the first neural network model, the evaluator feeds back the communication tailing time corresponding to the i-th transmission strategy to the agent executor, and the agent executor regards the communication tailing time corresponding to the i-th transmission strategy as The "reward" obtained by interacting with the environment is calculated according to the policy gradient method (such as PPO), and then the parameters of the second neural network model are updated to generate a new transmission strategy (i.e., the i+1th transmission strategy) ), complete a round of reinforcement learning. In this way, after a period of repeated iterations, the agent executor can generate an approximately optimal transmission strategy for the specific physical networking mode and the first neural network model.
(2)实现方式二(2) Implementation mode two
计算节点可以通过Q-learning算法中用于记录状态-动作的Q表(Q-Table)生成第i传输策略;以及,根据所述第i传输策略对应的通信拖尾时长,对所述Q表进行更新,并通过更新后的Q表生成所述第i+1传输策略。其中,Q表中包括P个状态和Q个动作,P个状态分别对应P个数据量阈值,Q个动作分别对应由状态转移量和逻辑拓扑构成Q个组合;其中,P、Q均为大于或等于1的整数。采用上述方法,通过构建Q-learning算法中的Q表,即将数据量阈值作为Q表中状态维度的信息,将状态转移量和逻辑拓扑作为Q表中动作维度的信息,实现了利用Q表生成第i传输策略;进一步地,通过将第i传输策略对应的通信拖尾时长作为强化学习的奖励来更新Q-learning中的Q表,由于Q-Learning算法能够根据当前状态来采取动作,获得相应的奖励之后,再去改进这些动作,从而能够做出更优的动作,即得到更优的传输策略。The computing node can generate the i-th transmission strategy through the Q-table (Q-Table) used to record the state-action in the Q-learning algorithm; and, according to the communication tail time corresponding to the i-th transmission strategy, compare the Q-table Update, and generate the i+1th transmission strategy through the updated Q table. Among them, the Q table includes P states and Q actions. P states correspond to P data volume thresholds, and Q actions correspond to Q combinations composed of state transition amounts and logic topology; among them, P and Q are both greater than Or an integer equal to 1. Using the above method, by constructing the Q table in the Q-learning algorithm, the data volume threshold is used as the state dimension information in the Q table, and the state transition amount and logical topology are used as the action dimension information in the Q table. The i-th transmission strategy; further, the Q table in Q-learning is updated by using the communication tailing duration corresponding to the i-th transmission strategy as a reward for reinforcement learning. Since the Q-Learning algorithm can take actions according to the current state, obtain the corresponding After the reward, we can improve these actions, so as to be able to make better actions, that is, to get a better transmission strategy.
在一个示例中,参见图6a所示,计算节点可以根据预设传输效率,确定P个传输数据量中的最小数据量阈值,比如预设传输效率可以为最低可接受的传输效率,根据最低可接受的传输效率,找到传输量最小的逻辑拓扑,将该逻辑拓扑对应的数据量作为最小数据量阈值。计算节点可以根据第一神经网络模型的参数量,确定P个传输数据量中的最大数据量阈值,比如将第一神经网络模型的参数量的某一比例值(比如50%或80%)确定为最大数据量阈值。进一步地,P个数据量阈值中的任意两个数据量阈值的差值可以为预设步长(可以表示为m)的整数倍,如此可得P=(Mmax-Mmin)/m+1。In an example, referring to FIG. 6a, the computing node may determine the minimum data amount threshold of the P transmission data amounts according to the preset transmission efficiency. For example, the preset transmission efficiency may be the lowest acceptable transmission efficiency. Accept the transmission efficiency, find the logical topology with the smallest transmission volume, and use the data volume corresponding to the logical topology as the minimum data volume threshold. The computing node can determine the maximum data volume threshold of the P transmission data volume according to the parameter volume of the first neural network model, for example, determine a certain ratio (such as 50% or 80%) of the parameter volume of the first neural network model Is the maximum data volume threshold. Further, the difference between any two data amount thresholds in the P data amount thresholds may be an integer multiple of the preset step size (which can be expressed as m), so that P=(Mmax-Mmin)/m+1 can be obtained.
参见表1所示,为Q表的一种示例。See Table 1 for an example of the Q table.
表1:Q表示例Table 1: Example of Q
Figure PCTCN2019076359-appb-000001
Figure PCTCN2019076359-appb-000001
表1中,+m:表示在当前状态的基础上+m,比如当前状态为Mmin,则+m表示转移到状态Mmin+m。-m:表示在当前状态的基础上-m,比如当前状态为Mmax,则-m表示转移到状态Mmax-m。0:表示保持当前状态阈值M不变。Topok:表示传输所使用的逻辑拓扑。Q(s1,a1)表示当前状态为Mmin时,执行+m,Topo1这一动作的奖励;其中,执行+m,Topo1这一动作是指生成传输策略a,传输策略a中包括的数据量阈值为Mmin+m,每次传输所使用的逻辑拓扑为Topo1。In Table 1, +m: means +m based on the current state, for example, the current state is Mmin, then +m means transition to the state Mmin+m. -m: means -m on the basis of the current state, for example, the current state is Mmax, then -m means transition to the state Mmax-m. 0: Means to keep the current state threshold M unchanged. Topok: Represents the logical topology used for transmission. Q(s1,a1) represents the reward for performing the action +m,Topo1 when the current state is Mmin; among them, the action of performing +m,Topo1 refers to the generation of transmission strategy a, the data volume threshold included in transmission strategy a It is Mmin+m, and the logical topology used for each transmission is Topo1.
示例性地,第i传输策略包括第三信息和第四信息;其中,第三信息用于指示传输第一神经网络模型的第i次迭代所得到的各层参数的梯度的第i数据量阈值,第i数据量阈值用于确定传输第一神经网络模型的第i次迭代所得到的各层参数的梯度的传输时机;第四信息用于指示每次传输所使用的逻辑拓扑。比如,训练节点在计算得到第一神经网络模型的第n层参数的梯度后,若确定积攒的梯度的数据量(即待传输的梯度的数据量)大于或等于第i数据量阈值,则发起传输,并使用第四信息所指示的逻辑拓扑进行传输。Exemplarily, the i-th transmission strategy includes third information and fourth information; wherein the third information is used to indicate the i-th data volume threshold of the gradient of each layer parameter obtained in the i-th iteration of the first neural network model. , The i-th data amount threshold is used to determine the transmission timing of the gradient of each layer parameter obtained in the i-th iteration of the first neural network model; the fourth information is used to indicate the logical topology used for each transmission. For example, after the training node calculates the gradient of the n-th layer parameter of the first neural network model, if it is determined that the data volume of the accumulated gradient (that is, the data volume of the gradient to be transmitted) is greater than or equal to the i-th data volume threshold, it initiates And use the logical topology indicated by the fourth information for transmission.
具体来说,计算节点在生成第1传输策略之前,可以生成初始化的Q表,初始化的Q表中的Q(s1,a1)、Q(s1,a2)……的取值可以为服从高斯分布得到的一组随机数。计算节点根据初始化的Q表,将Mmin对应的状态作为当前状态,以及根据当前状态下执行Q个动作的奖励值确定第1目标动作(比如为+m,topo2),进而生成第1传输策略并发送给W个训练节点(对应步骤401);其中,第1数据量阈值为第1目标动作对应的组合中的状态转移量和所述当前状态对应的数据量阈值的和(即Mmin+m),所述每次传输所使用的逻辑拓扑为第1目标动作对应的组合中的逻辑拓扑(即topo2)。相应地,训练节点接收第1传输策略,使用第1传输策略传输第一神经网络模型的第1次迭代所得到的各层参数的梯度,并将使用第1传输策略传输第一神经网络模型的第1次迭代所得到的各层参数的梯度的第1通信拖尾时长发送给计算节点(对应步骤402);计算节点接收W个训练节点发送的第1通信拖尾时长,并确定第1传输策略对应的通信拖尾时长(对应步骤403和步骤404);计算节点根据第1传输策略对应的通信拖尾时长对初始化的Q表进行更新,并基于更新后的Q表生成第2传输策略(对应步骤405),计算节点将第2传输策略发送给W个训练节点,从而循环执行上述步骤401至步骤405,直到完成第一神经网络模型的训练。Specifically, the computing node can generate an initialized Q table before generating the first transmission strategy, and the values of Q(s1,a1), Q(s1,a2)... in the initialized Q table can be Gaussian distribution A set of random numbers obtained. The computing node uses the state corresponding to Mmin as the current state according to the initialized Q table, and determines the first target action (for example, +m, topo2) according to the reward value of Q actions executed in the current state, and then generates the first transmission strategy and Sent to W training nodes (corresponding to step 401); where the first data amount threshold is the sum of the state transition amount in the combination corresponding to the first target action and the data amount threshold corresponding to the current state (ie Mmin+m) , The logical topology used in each transmission is the logical topology in the combination corresponding to the first target action (ie topo2). Correspondingly, the training node receives the first transmission strategy, uses the first transmission strategy to transmit the gradients of the parameters of each layer obtained in the first iteration of the first neural network model, and uses the first transmission strategy to transmit the first neural network model The first communication tail duration of the gradient of each layer parameter obtained in the first iteration is sent to the computing node (corresponding to step 402); the computing node receives the first communication tail duration sent by W training nodes, and determines the first transmission The communication tailing time corresponding to the strategy (corresponding to steps 403 and 404); the computing node updates the initialized Q table according to the communication tailing time corresponding to the first transmission strategy, and generates the second transmission strategy based on the updated Q table ( Corresponding to step 405), the computing node sends the second transmission strategy to the W training nodes, so as to execute the above steps 401 to 405 in a loop until the training of the first neural network model is completed.
需要说明的是,计算节点根据Q个动作的奖励值确定第i目标动作的实现方式可以有多种,比如计算节点可以将Q个动作中奖励值最大的动作确定为第i目标动作,或者,计算节点也可以将Q个动作中奖励值次大的动作确定为第i目标动作,具体不做限定。计算节点根据第i传输策略对应的通信拖尾时长对Q表进行更新的实现方式可以有多种,比如计算节点可以采用贝尔曼方程对Q表进行更新,具体不做限定。It should be noted that there may be multiple ways for the computing node to determine the i-th target action based on the reward values of the Q actions. For example, the computing node can determine the action with the largest reward value among the Q actions as the i-th target action, or, The computing node may also determine the action with the second highest reward value among the Q actions as the i-th target action, which is not specifically limited. There may be multiple implementation ways for the computing node to update the Q table according to the communication tailing duration corresponding to the i-th transmission strategy. For example, the computing node can use the Bellman equation to update the Q table, which is not specifically limited.
图6b为本申请实施例提供的实现方式2的整体示意图。如图6b所示,可以由智能体执行器运行一个Q-Learning算法及功能组件,通过与多个训练节点中的评估器进行交互, 来确定并修改数据量阈值,以及更新Q表;评价器负责控制梯度的传输,以及将通信拖尾时长作为奖励反馈给智能体执行器,比如评估器1负责控制训练节点1计算得到的梯度的传输,当训练节点1积攒的梯度的数据量超过数据量阈值时,则发起传输,并将训练节点1传输梯度的通信拖尾时长作为奖励反馈给执行器。FIG. 6b is an overall schematic diagram of implementation manner 2 provided by an embodiment of the application. As shown in Figure 6b, a Q-Learning algorithm and functional components can be run by the agent executor, and the data volume threshold can be determined and modified by interacting with evaluators in multiple training nodes, and the Q table can be updated; evaluator; Responsible for controlling the transmission of gradients, and feeding back the communication tail duration as a reward to the agent executor. For example, evaluator 1 is responsible for controlling the transmission of the gradient calculated by training node 1. When the amount of gradient data accumulated by training node 1 exceeds the amount of data When the threshold is reached, the transmission is initiated, and the communication tail duration of the transmission gradient of the training node 1 is fed back to the actuator as a reward.
上述主要基于执行流程的角度对本申请实施例提供的方案进行了介绍。可以理解的是,为了实现上述功能,计算节点和训练节点可以包括执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,本申请的实施例能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。The foregoing introduces the solutions provided in the embodiments of the present application mainly based on the perspective of the execution process. It can be understood that, in order to realize the above-mentioned functions, the computing node and the training node may include hardware structures and/or software modules corresponding to each function. Those skilled in the art should easily realize that in combination with the units and algorithm steps of the examples described in the embodiments disclosed herein, the embodiments of the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a certain function is executed by hardware or computer software-driven hardware depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.
在采用集成的单元(模块)的情况下,图7示出了本申请实施例中所涉及的确定传输策略的装置的可能的示例性框图,该装置700可以以软件的形式存在。装置700可以包括:生成单元702、确定单元703。生成单元702和确定单元703可以统称为处理单元,用于对装置700的动作进行控制管理。装置700还可以包括通信单元704,通信单元704用于支持装置700与其他节点的通信。可选地,通信单元704也可以称为收发单元,可以包括接收单元和/或发送单元,分别用于执行接收和发送操作。装置700还可以包括存储单元701,用于存储装置700的程序代码和/或数据。In the case of an integrated unit (module), FIG. 7 shows a possible exemplary block diagram of a device for determining a transmission strategy involved in an embodiment of the present application, and the device 700 may exist in the form of software. The apparatus 700 may include: a generating unit 702 and a determining unit 703. The generating unit 702 and the determining unit 703 may be collectively referred to as processing units, which are used to control and manage the actions of the apparatus 700. The apparatus 700 may further include a communication unit 704, which is configured to support communication between the apparatus 700 and other nodes. Optionally, the communication unit 704 may also be referred to as a transceiver unit, and may include a receiving unit and/or a sending unit, which are used to perform receiving and sending operations, respectively. The device 700 may further include a storage unit 701 for storing program codes and/or data of the device 700.
其中,生成单元702和确定单元703可以是处理器或控制器,其可以实现或执行结合本申请的实施例公开内容所描述的各种示例性的逻辑方框,模块和电路。通信单元704可以是通信接口、收发器或收发电路等,其中,该通信接口是统称,在具体实现中,该通信接口可以包括多个接口。存储单元701可以是存储器。The generating unit 702 and the determining unit 703 may be a processor or a controller, which may implement or execute various exemplary logical blocks, modules, and circuits described in conjunction with the disclosure of the embodiments of the present application. The communication unit 704 may be a communication interface, a transceiver, or a transceiver circuit, etc., where the communication interface is a general term. In a specific implementation, the communication interface may include multiple interfaces. The storage unit 701 may be a memory.
该装置700可以为上述任一实施例中的计算节点。生成单元702和确定单元703可以支持装置700执行上文中各方法示例中计算节点的动作。或者,生成单元702和确定单元703主要执行方法示例中的计算节点的内部动作,通信单元704可以支持装置700与训练节点之间的通信。例如,生成单元702用于执行图4中的步骤401和步骤405中生成第i+1传输策略的动作,确定单元703用于执行图4中的步骤404;通信单元704用于执行图4中的步骤403和步骤405中发送第i+1传输策略的动作。The apparatus 700 may be the computing node in any of the foregoing embodiments. The generating unit 702 and the determining unit 703 can support the apparatus 700 to execute the actions of the computing nodes in the above method examples. Alternatively, the generating unit 702 and the determining unit 703 mainly perform the internal actions of the computing node in the method example, and the communication unit 704 may support communication between the apparatus 700 and the training node. For example, the generating unit 702 is used to perform the action of generating the i+1th transmission strategy in step 401 and step 405 in FIG. 4, the determining unit 703 is used to perform step 404 in FIG. 4; the communication unit 704 is used to perform The action of sending the i+1th transmission strategy in step 403 and step 405.
具体地,在一个实施例中,生成单元702,用于生成第i传输策略,所述第i传输策略用于传输第一神经网络模型的第i次迭代所得到的各层参数的梯度;Specifically, in one embodiment, the generating unit 702 is configured to generate an i-th transmission strategy, and the i-th transmission strategy is used to transmit the gradient of each layer parameter obtained in the i-th iteration of the first neural network model;
确定单元703,用于确定第i传输策略对应的通信拖尾时长,所述第i传输策略对应的通信拖尾时长用于指示所述第一神经网络模型的第i次迭代的结束时刻和第i+1次迭代的开始时刻之间的时长;The determining unit 703 is configured to determine the communication tailing duration corresponding to the i-th transmission strategy, and the communication tailing duration corresponding to the i-th transmission strategy is used to indicate the end time and the first iteration of the i-th iteration of the first neural network model. The length of time between the start of i+1 iterations;
所述生成单元702,还用于根据第i传输策略对应的通信拖尾时长生成第i+1传输策略,所述第i+1传输策略用于传输第一神经网络模型的第i+1次迭代所得到的各层参数的梯度;其中,i=1,2,……,X-1,X为所述第一神经网络模型的迭代次数,X为大于1的整数。The generating unit 702 is further configured to generate an i+1th transmission strategy according to the communication tail duration corresponding to the i-th transmission strategy, and the i+1th transmission strategy is used to transmit the i+1th time of the first neural network model The gradient of each layer parameter obtained by iteration; where i=1, 2, ..., X-1, X is the number of iterations of the first neural network model, and X is an integer greater than 1.
在一种可能的设计中,所述生成单元702具体用于:通过第二神经网络模型生成所述第i传输策略。In a possible design, the generating unit 702 is specifically configured to generate the i-th transmission strategy through a second neural network model.
在一种可能的设计中,所述生成单元702具体用于:根据所述第i传输策略对应的通 信拖尾时长对所述第二神经网络模型的参数进行更新,并通过更新后的第二神经网络模型生成所述第i+1传输策略。In a possible design, the generating unit 702 is specifically configured to: update the parameters of the second neural network model according to the communication tailing duration corresponding to the i-th transmission strategy, and pass the updated second The neural network model generates the i+1th transmission strategy.
在一种可能的设计中,所述第i传输策略包括所述第一神经网络模型的每一层的子传输策略,第n层的子传输策略包括第一信息和第二信息,所述第一信息用于指示计算得到所述第n层参数的梯度后是否发起传输,所述第二信息用于指示传输所使用的逻辑拓扑;In a possible design, the i-th transmission strategy includes the sub-transmission strategy of each layer of the first neural network model, and the sub-transmission strategy of the n-th layer includes first information and second information. One piece of information is used to indicate whether to initiate transmission after the gradient of the n-th layer parameter is calculated, and the second piece of information is used to indicate the logical topology used for transmission;
其中,n=1,2,……,N,N为所述第一神经网络模型的层数,N为大于或等于1的整数。Wherein, n=1, 2, ..., N, N is the number of layers of the first neural network model, and N is an integer greater than or equal to 1.
在一种可能的设计中,所述生成单元702通过第二神经网络模型生成第n+1层的子传输策略,具体为:In a possible design, the generating unit 702 generates the sub-transmission strategy of the n+1th layer through the second neural network model, specifically:
将所述第n层的子传输策略的第二信息作为所述第二神经网络模型的输入,生成所述第n+1层的子传输策略的第一信息,以及将所述第n+1层的子传输策略的第一信息作为所述第二神经网络模型的输入,生成所述第n+1层的子传输策略的第二信息。The second information of the sub-transmission strategy of the nth layer is used as the input of the second neural network model, the first information of the sub-transmission strategy of the n+1th layer is generated, and the n+1th layer is The first information of the sub-transmission strategy of the layer is used as the input of the second neural network model to generate the second information of the sub-transmission strategy of the n+1th layer.
在一种可能的设计中,所述生成单元702具体用于:In a possible design, the generating unit 702 is specifically configured to:
通过Q-learning算法中用于记录状态-动作的Q表生成第i传输策略;所述Q表中包括P个状态和Q个动作,所述P个状态分别对应P个数据量阈值,所述Q个动作分别对应由状态转移量和逻辑拓扑构成Q个组合;其中,P、Q均为大于或等于1的整数。The i-th transmission strategy is generated through the Q table used to record state-action in the Q-learning algorithm; the Q table includes P states and Q actions, and the P states correspond to P data volume thresholds, The Q actions respectively correspond to Q combinations formed by the amount of state transition and logic topology; among them, P and Q are integers greater than or equal to 1.
在一种可能的设计中,所述生成单元702具体用于:In a possible design, the generating unit 702 is specifically configured to:
根据所述第i传输策略对应的通信拖尾时长,对所述Q表进行更新,并通过更新后的Q表生成所述第i+1传输策略。The Q table is updated according to the communication tail duration corresponding to the i-th transmission strategy, and the i+1-th transmission strategy is generated through the updated Q table.
在一种可能的设计中,所述第i传输策略包括第三信息和第四信息;其中,第三信息用于指示传输所述第一神经网络模型的第i次迭代所得到的各层参数的梯度的第i数据量阈值,所述第i数据量阈值用于确定传输所述第一神经网络模型的第i次迭代所得到的各层参数的梯度的传输时机;所述第四信息用于指示每次传输所使用的逻辑拓扑。In a possible design, the i-th transmission strategy includes third information and fourth information; wherein, the third information is used to indicate transmission of the parameters of each layer obtained in the i-th iteration of the first neural network model The i-th data volume threshold of the gradient of the i-th data volume threshold is used to determine the transmission timing of the gradient of each layer parameter obtained by the i-th iteration of the first neural network model; the fourth information is used To indicate the logical topology used for each transmission.
在一种可能的设计中,所述生成单元702具体用于:In a possible design, the generating unit 702 is specifically configured to:
根据所述Q表,得到在对应第i-1数据量阈值的状态下执行所述Q个动作的奖励值,并根据所述Q个动作的奖励值确定第i目标动作,并生成所述第i传输策略;所述第i数据量阈值为所述第i目标动作对应的组合中的状态转移量和所述第i-1数据量阈值的和,所述每次传输所使用的逻辑拓扑为所述第i目标动作对应的组合中的逻辑拓扑。According to the Q table, the reward value for executing the Q actions under the state corresponding to the i-1th data volume threshold is obtained, and the ith target action is determined according to the reward value of the Q actions, and the i transmission strategy; the i-th data volume threshold is the sum of the state transition volume in the combination corresponding to the i-th target action and the i-1th data volume threshold, and the logical topology used for each transmission is The logical topology in the combination corresponding to the i-th target action.
在一种可能的设计中,所述P个数据量阈值中的最大数据量阈值是根据所述第一神经网络模型的参数量确定的,和/或,所述P个传输数据量中的最小数据量阈值是根据预设传输效率确定的。In a possible design, the maximum data amount threshold of the P data amount thresholds is determined according to the parameter amount of the first neural network model, and/or, the smallest of the P data amount thresholds The data volume threshold is determined according to the preset transmission efficiency.
需要说明的是,本申请实施例中对单元(模块)的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。在本申请的实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。It should be noted that the division of units (modules) in the embodiments of the present application is illustrative, and is only a logical function division, and there may be other division methods in actual implementation. The functional modules in the embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware or software functional modules.
所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本申 请各个实施例所述方法的全部或部分步骤。而前述的存储介质可以为存储器等各种可以存储程序代码的介质。If the integrated module is implemented in the form of a software function module and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solutions of the embodiments of the present application essentially or the part that contributes to the prior art or all or part of the technical solutions can be embodied in the form of software products, and the computer software products are stored in a storage The medium includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium may be various mediums capable of storing program codes, such as a memory.
参阅图8所示,为本申请实施例提供的一种确定传输策略的装置示意图,该装置可以是上述用于执行计算节点所执行的动作的计算机设备或者设置在该计算机设备中的半导体芯片。该装置800包括:存储器801、处理器802、通信接口803。其中,处理器802具有实现图7中生成单元702和确定单元703所执行的动作的功能。可选地,装置800还可以包括总线804。其中,通信接口803、处理器802以及存储器801可以通过通信线路804相互连接;通信线路804可以是外设部件互连标准(peripheral component interconnect,简称PCI)总线或扩展工业标准结构(extended industry standard architecture,简称EISA)总线等。所述通信线路804可以分为地址总线、数据总线、控制总线等。为便于表示,图8中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。Referring to FIG. 8, a schematic diagram of an apparatus for determining a transmission strategy provided by an embodiment of this application. The apparatus may be the above-mentioned computer device for executing actions performed by a computing node or a semiconductor chip provided in the computer device. The device 800 includes a memory 801, a processor 802, and a communication interface 803. Among them, the processor 802 has a function of implementing the actions performed by the generating unit 702 and the determining unit 703 in FIG. 7. Optionally, the apparatus 800 may further include a bus 804. Among them, the communication interface 803, the processor 802, and the memory 801 may be connected to each other through a communication line 804; the communication line 804 may be a peripheral component interconnection standard (peripheral component interconnect, PCI for short) bus or an extended industry standard architecture (extended industry standard architecture) , Referred to as EISA) bus and so on. The communication line 804 can be divided into an address bus, a data bus, a control bus, and so on. For ease of presentation, only one thick line is used in FIG. 8 to represent, but it does not mean that there is only one bus or one type of bus.
处理器802可以是一个或多个CPU(或GPU),或一个或多个用于控制本申请方案程序执行的集成电路。通信接口803,使用任何收发器一类的装置,用于与训练节点通信。存储器801可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信息和指令的其他类型的动态存储设备,也可以是电可擦可编程只读存储器(electrically er服务器able programmable read-only memory,EEPROM)、只读光盘(compact disc read-only memory,CD-ROM)或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。存储器可以是独立存在,通过通信线路804与处理器相连接。存储器也可以和处理器集成在一起。其中,存储器801用于存储执行本申请方案的计算机执行指令,并由处理器802来控制执行。处理器802用于执行存储器801中存储的计算机执行指令,从而实现本申请上述实施例提供的方法。The processor 802 may be one or more CPUs (or GPUs), or one or more integrated circuits for controlling the execution of programs in the solutions of the present application. The communication interface 803 uses any device such as a transceiver to communicate with the training node. The memory 801 may be read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), or other types that can store information and instructions The dynamic storage device can also be electrically erasable programmable read-only memory (electrically programmable read-only memory, EEPROM), compact disc read-only memory (CD-ROM) or other optical disk storage, Optical disc storage (including compact disc, laser disc, optical disc, digital versatile disc, Blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or can be used to carry or store desired program codes in the form of instructions or data structures and can Any other medium accessed by the computer, but not limited to this. The memory can exist independently and is connected to the processor through a communication line 804. The memory can also be integrated with the processor. The memory 801 is used to store computer-executed instructions for executing the solutions of the present application, and the processor 802 controls the execution. The processor 802 is configured to execute computer-executable instructions stored in the memory 801, so as to implement the method provided in the foregoing embodiment of the present application.
可选的,本申请实施例中的计算机执行指令也可以称之为应用程序代码,本申请实施例对此不作具体限定。Optionally, the computer-executable instructions in the embodiments of the present application may also be referred to as application program code, which is not specifically limited in the embodiments of the present application.
需要说明的是,上述方法和装置是基于同一发明构思的,由于方法和装置解决问题的原理相似,因此装置与方法的实施可以相互参见,重复之处不再赘述。It should be noted that the above-mentioned method and device are based on the same inventive concept. Since the method and the device have similar principles for solving the problem, the implementation of the device and the method can be referred to each other, and the repetition will not be repeated.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘Solid State Disk(SSD))等。本申请是参照根据本申请的方法、设备(系 统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented by software, it can be implemented in the form of a computer program product in whole or in part. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions described in the embodiments of the present invention are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, server, or data center. Transmission to another website site, computer, server or data center via wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center integrated with one or more available media. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a solid state disk (SSD)). This application is described with reference to flowcharts and/or block diagrams of methods, equipment (systems), and computer program products according to this application. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be implemented by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are generated It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device. The device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the application without departing from the scope of the application. In this way, if these modifications and variations of this application fall within the scope of the claims of this application and their equivalent technologies, this application also intends to include these modifications and variations.

Claims (23)

  1. 一种确定传输策略的方法,其特征在于,所述方法包括:A method for determining a transmission strategy, characterized in that the method includes:
    生成第i传输策略,所述第i传输策略用于传输第一神经网络模型的第i次迭代所得到的各层参数的梯度;Generating an i-th transmission strategy, where the i-th transmission strategy is used to transmit the gradient of each layer parameter obtained in the i-th iteration of the first neural network model;
    确定第i传输策略对应的通信拖尾时长,所述第i传输策略对应的通信拖尾时长用于指示所述第一神经网络模型的第i次迭代的结束时刻和第i+1次迭代的开始时刻之间的时长;The communication tailing duration corresponding to the i-th transmission strategy is determined, and the communication tailing duration corresponding to the i-th transmission strategy is used to indicate the end time of the i-th iteration and the i+1-th iteration of the first neural network model The length of time between the start moments;
    根据第i传输策略对应的通信拖尾时长生成第i+1传输策略,所述第i+1传输策略用于传输第一神经网络模型的第i+1次迭代所得到的各层参数的梯度;The i+1th transmission strategy is generated according to the communication tailing duration corresponding to the i-th transmission strategy, and the i+1th transmission strategy is used to transmit the gradient of each layer parameter obtained in the i+1th iteration of the first neural network model ;
    其中,i=1,2,……,X-1,X为所述第一神经网络模型的迭代次数,X为大于1的整数。Wherein, i=1, 2, ..., X-1, X is the number of iterations of the first neural network model, and X is an integer greater than 1.
  2. 根据权利要求1所述的方法,其特征在于,生成第i传输策略,包括:The method according to claim 1, wherein generating the i-th transmission strategy comprises:
    通过第二神经网络模型生成所述第i传输策略。The i-th transmission strategy is generated through the second neural network model.
  3. 根据权利要求2所述的方法,其特征在于,根据所述第i传输策略对应的通信拖尾时长生成第i+1传输策略,包括:The method according to claim 2, wherein generating the (i+1)th transmission strategy according to the communication tail duration corresponding to the i-th transmission strategy comprises:
    根据所述第i传输策略对应的通信拖尾时长对所述第二神经网络模型的参数进行更新,并通过更新后的第二神经网络模型生成所述第i+1传输策略。The parameters of the second neural network model are updated according to the communication tailing duration corresponding to the i-th transmission strategy, and the (i+1)th transmission strategy is generated through the updated second neural network model.
  4. 根据权利要求2或3所述的方法,其特征在于,所述第i传输策略包括所述第一神经网络模型的每一层的子传输策略,第n层的子传输策略包括第一信息和第二信息,所述第一信息用于指示计算得到所述第n层参数的梯度后是否发起传输,所述第二信息用于指示传输所使用的逻辑拓扑;The method according to claim 2 or 3, wherein the i-th transmission strategy includes the sub-transmission strategy of each layer of the first neural network model, and the sub-transmission strategy of the n-th layer includes the first information and Second information, the first information is used to indicate whether to initiate transmission after the gradient of the nth layer parameter is calculated, and the second information is used to indicate the logical topology used for transmission;
    其中,n=1,2,……,N,N为所述第一神经网络模型的层数,N为大于或等于1的整数。Wherein, n=1, 2, ..., N, N is the number of layers of the first neural network model, and N is an integer greater than or equal to 1.
  5. 根据权利要求4所述的方法,其特征在于,通过第二神经网络模型生成第n+1层的子传输策略,具体为:The method according to claim 4, wherein the sub-transmission strategy of the n+1th layer is generated through the second neural network model, specifically:
    将所述第n层的子传输策略的第二信息作为所述第二神经网络模型的输入,生成所述第n+1层的子传输策略的第一信息,以及将所述第n+1层的子传输策略的第一信息作为所述第二神经网络模型的输入,生成所述第n+1层的子传输策略的第二信息。The second information of the sub-transmission strategy of the nth layer is used as the input of the second neural network model, the first information of the sub-transmission strategy of the n+1th layer is generated, and the n+1th layer is The first information of the sub-transmission strategy of the layer is used as the input of the second neural network model to generate the second information of the sub-transmission strategy of the n+1th layer.
  6. 根据权利要求1所述的方法,其特征在于,生成第i传输策略,包括:The method according to claim 1, wherein generating the i-th transmission strategy comprises:
    通过Q-learning算法中用于记录状态-动作的Q表生成第i传输策略;所述Q表中包括P个状态和Q个动作,所述P个状态分别对应P个数据量阈值,所述Q个动作分别对应由状态转移量和逻辑拓扑构成Q个组合;其中,P、Q均为大于或等于1的整数。The i-th transmission strategy is generated through the Q table used to record state-action in the Q-learning algorithm; the Q table includes P states and Q actions, and the P states correspond to P data volume thresholds, The Q actions respectively correspond to Q combinations formed by the amount of state transition and logic topology; among them, P and Q are integers greater than or equal to 1.
  7. 根据权利要求6所述的方法,其特征在于,生成第i+1传输策略,包括:The method according to claim 6, wherein generating the (i+1)th transmission strategy comprises:
    根据所述第i传输策略对应的通信拖尾时长,对所述Q表进行更新,并通过更新后的Q表生成所述第i+1传输策略。The Q table is updated according to the communication tail duration corresponding to the i-th transmission strategy, and the i+1-th transmission strategy is generated through the updated Q table.
  8. 根据权利要求6或7所述的方法,其特征在于,所述第i传输策略包括第三信息和第四信息;其中,第三信息用于指示传输所述第一神经网络模型的第i次迭代所得到的各层参数的梯度的第i数据量阈值,所述第i数据量阈值用于确定传输所述第一神经网络模型的第i次迭代所得到的各层参数的梯度的传输时机;所述第四信息用于指示每次传输所使用的逻辑拓扑。The method according to claim 6 or 7, wherein the i-th transmission strategy includes third information and fourth information; wherein the third information is used to indicate the i-th transmission of the first neural network model The i-th data amount threshold of the gradient of each layer parameter obtained by iteration, the i-th data amount threshold is used to determine the transmission timing of the gradient of each layer parameter obtained in the i-th iteration of the first neural network model ; The fourth information is used to indicate the logical topology used for each transmission.
  9. 根据权利要求8所述的方法,其特征在于,通过所述Q表生成第i传输策略,包 括:The method according to claim 8, wherein the generating the i-th transmission strategy through the Q table comprises:
    根据所述Q表,得到在对应第i-1数据量阈值的状态下执行所述Q个动作的奖励值,并根据所述Q个动作的奖励值确定第i目标动作,并生成所述第i传输策略;所述第i数据量阈值为所述第i目标动作对应的组合中的状态转移量和所述第i-1数据量阈值的和,所述每次传输所使用的逻辑拓扑为所述第i目标动作对应的组合中的逻辑拓扑。According to the Q table, the reward value for executing the Q actions under the state corresponding to the i-1th data volume threshold is obtained, and the ith target action is determined according to the reward value of the Q actions, and the i transmission strategy; the i-th data volume threshold is the sum of the state transition volume in the combination corresponding to the i-th target action and the i-1th data volume threshold, and the logical topology used for each transmission is The logical topology in the combination corresponding to the i-th target action.
  10. 根据权利要求6所述的方法,其特征在于,所述P个数据量阈值中的最大数据量阈值是根据所述第一神经网络模型的参数量确定的,和/或,所述P个传输数据量中的最小数据量阈值是根据预设传输效率确定的。The method according to claim 6, wherein the maximum data amount threshold of the P data amount thresholds is determined according to the parameter amount of the first neural network model, and/or, the P transmissions The minimum data amount threshold in the data amount is determined according to the preset transmission efficiency.
  11. 一种确定传输策略的装置,其特征在于,所述装置包括:A device for determining a transmission strategy, characterized in that the device comprises:
    生成单元,用于生成第i传输策略,所述第i传输策略用于传输第一神经网络模型的第i次迭代所得到的各层参数的梯度;A generating unit, configured to generate an i-th transmission strategy, where the i-th transmission strategy is used to transmit the gradient of each layer parameter obtained in the i-th iteration of the first neural network model;
    确定单元,用于确定第i传输策略对应的通信拖尾时长,所述第i传输策略对应的通信拖尾时长用于指示所述第一神经网络模型的第i次迭代的结束时刻和第i+1次迭代的开始时刻之间的时长;The determining unit is configured to determine the communication tailing duration corresponding to the i-th transmission strategy, and the communication tailing duration corresponding to the i-th transmission strategy is used to indicate the end time and the i-th iteration of the first neural network model. The length of time between the start of +1 iteration;
    所述生成单元,还用于根据第i传输策略对应的通信拖尾时长生成第i+1传输策略,所述第i+1传输策略用于传输第一神经网络模型的第i+1次迭代所得到的各层参数的梯度;The generating unit is further configured to generate an i+1th transmission strategy according to the communication tailing duration corresponding to the i-th transmission strategy, and the i+1th transmission strategy is used to transmit the i+1th iteration of the first neural network model The obtained gradient of each layer parameter;
    其中,i=1,2,……,X-1,X为所述第一神经网络模型的迭代次数,X为大于1的整数。Wherein, i=1, 2, ..., X-1, X is the number of iterations of the first neural network model, and X is an integer greater than 1.
  12. 根据权利要求11所述的装置,其特征在于,所述生成单元具体用于:通过第二神经网络模型生成所述第i传输策略。The device according to claim 11, wherein the generating unit is specifically configured to generate the i-th transmission strategy through a second neural network model.
  13. 根据权利要求12所述的装置,其特征在于,所述生成单元具体用于:根据所述第i传输策略对应的通信拖尾时长对所述第二神经网络模型的参数进行更新,并通过更新后的第二神经网络模型生成所述第i+1传输策略。The device according to claim 12, wherein the generating unit is specifically configured to: update the parameters of the second neural network model according to the communication tailing duration corresponding to the i-th transmission strategy, and update The latter second neural network model generates the i+1th transmission strategy.
  14. 根据权利要求12或13所述的装置,其特征在于,所述第i传输策略包括所述第一神经网络模型的每一层的子传输策略,第n层的子传输策略包括第一信息和第二信息,所述第一信息用于指示计算得到所述第n层参数的梯度后是否发起传输,所述第二信息用于指示传输所使用的逻辑拓扑;The device according to claim 12 or 13, wherein the i-th transmission strategy includes a sub-transmission strategy of each layer of the first neural network model, and the n-th layer sub-transmission strategy includes first information and Second information, the first information is used to indicate whether to initiate transmission after the gradient of the nth layer parameter is calculated, and the second information is used to indicate the logical topology used for transmission;
    其中,n=1,2,……,N,N为所述第一神经网络模型的层数,N为大于或等于1的整数。Wherein, n=1, 2, ..., N, N is the number of layers of the first neural network model, and N is an integer greater than or equal to 1.
  15. 根据权利要求14所述的装置,其特征在于,所述生成单元通过第二神经网络模型生成第n+1层的子传输策略,具体为:The device according to claim 14, wherein the generating unit generates the sub-transmission strategy of the n+1th layer through the second neural network model, specifically:
    将所述第n层的子传输策略的第二信息作为所述第二神经网络模型的输入,生成所述第n+1层的子传输策略的第一信息,以及将所述第n+1层的子传输策略的第一信息作为所述第二神经网络模型的输入,生成所述第n+1层的子传输策略的第二信息。The second information of the sub-transmission strategy of the nth layer is used as the input of the second neural network model, the first information of the sub-transmission strategy of the n+1th layer is generated, and the n+1th layer is The first information of the sub-transmission strategy of the layer is used as the input of the second neural network model to generate the second information of the sub-transmission strategy of the n+1th layer.
  16. 根据权利要求11所述的装置,其特征在于,所述生成单元具体用于:The device according to claim 11, wherein the generating unit is specifically configured to:
    通过Q-learning算法中用于记录状态-动作的Q表生成第i传输策略;所述Q表中包括P个状态和Q个动作,所述P个状态分别对应P个数据量阈值,所述Q个动作分别对应由状态转移量和逻辑拓扑构成Q个组合;其中,P、Q均为大于或等于1的整数。The i-th transmission strategy is generated through the Q table used to record state-action in the Q-learning algorithm; the Q table includes P states and Q actions, and the P states correspond to P data volume thresholds, The Q actions respectively correspond to Q combinations formed by the amount of state transition and logic topology; among them, P and Q are integers greater than or equal to 1.
  17. 根据权利要求16所述的装置,其特征在于,所述生成单元具体用于:The device according to claim 16, wherein the generating unit is specifically configured to:
    根据所述第i传输策略对应的通信拖尾时长,对所述Q表进行更新,并通过更新后的Q表生成所述第i+1传输策略。The Q table is updated according to the communication tail duration corresponding to the i-th transmission strategy, and the i+1-th transmission strategy is generated through the updated Q table.
  18. 根据权利要求16或17所述的装置,其特征在于,所述第i传输策略包括第三信 息和第四信息;其中,第三信息用于指示传输所述第一神经网络模型的第i次迭代所得到的各层参数的梯度的第i数据量阈值,所述第i数据量阈值用于确定传输所述第一神经网络模型的第i次迭代所得到的各层参数的梯度的传输时机;所述第四信息用于指示每次传输所使用的逻辑拓扑。The device according to claim 16 or 17, wherein the i-th transmission strategy includes third information and fourth information; wherein the third information is used to indicate the i-th transmission of the first neural network model The i-th data amount threshold of the gradient of each layer parameter obtained by iteration, the i-th data amount threshold is used to determine the transmission timing of the gradient of each layer parameter obtained in the i-th iteration of the first neural network model ; The fourth information is used to indicate the logical topology used for each transmission.
  19. 根据权利要求18所述的装置,其特征在于,所述生成单元具体用于:The device according to claim 18, wherein the generating unit is specifically configured to:
    根据所述Q表,得到在对应第i-1数据量阈值的状态下执行所述Q个动作的奖励值,并根据所述Q个动作的奖励值确定第i目标动作,并生成所述第i传输策略;所述第i数据量阈值为所述第i目标动作对应的组合中的状态转移量和所述第i-1数据量阈值的和,所述每次传输所使用的逻辑拓扑为所述第i目标动作对应的组合中的逻辑拓扑。According to the Q table, the reward value for executing the Q actions under the state corresponding to the i-1th data volume threshold is obtained, and the ith target action is determined according to the reward value of the Q actions, and the i transmission strategy; the i-th data volume threshold is the sum of the state transition volume in the combination corresponding to the i-th target action and the i-1th data volume threshold, and the logical topology used for each transmission is The logical topology in the combination corresponding to the i-th target action.
  20. 根据权利要求16所述的装置,其特征在于,所述P个数据量阈值中的最大数据量阈值是根据所述第一神经网络模型的参数量确定的,和/或,所述P个传输数据量中的最小数据量阈值是根据预设传输效率确定的。The device according to claim 16, wherein the maximum data amount threshold in the P data amount thresholds is determined according to the parameter amount of the first neural network model, and/or, the P transmissions The minimum data amount threshold in the data amount is determined according to the preset transmission efficiency.
  21. 一种确定传输策略的装置,其特征在于,所述装置包括处理器、存储器以及存储在存储器上并可在处理器上运行的指令,当所述指令被运行时,使得所述装置执行如权利要求1至10中任一项所述的方法。A device for determining a transmission strategy, characterized in that the device includes a processor, a memory, and instructions stored in the memory and executable on the processor. When the instructions are executed, the device executes The method of any one of 1 to 10 is required.
  22. 一种计算机可读存储介质,其特征在于,包括指令,当其在计算机上运行时,使得计算机执行如权利要求1至10中任一项所述的方法。A computer-readable storage medium, characterized by comprising instructions, which when run on a computer, causes the computer to execute the method according to any one of claims 1 to 10.
  23. 一种计算机程序产品,其特征在于,当其在计算机上运行时,使得计算机执行如权利要求1至10中任一项所述的方法。A computer program product, which is characterized in that when it runs on a computer, the computer executes the method according to any one of claims 1 to 10.
PCT/CN2019/076359 2019-02-27 2019-02-27 Method and apparatus for determining transmission policy WO2020172825A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2019/076359 WO2020172825A1 (en) 2019-02-27 2019-02-27 Method and apparatus for determining transmission policy
CN201980091568.XA CN113412494B (en) 2019-02-27 2019-02-27 Method and device for determining transmission strategy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/076359 WO2020172825A1 (en) 2019-02-27 2019-02-27 Method and apparatus for determining transmission policy

Publications (1)

Publication Number Publication Date
WO2020172825A1 true WO2020172825A1 (en) 2020-09-03

Family

ID=72238780

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/076359 WO2020172825A1 (en) 2019-02-27 2019-02-27 Method and apparatus for determining transmission policy

Country Status (2)

Country Link
CN (1) CN113412494B (en)
WO (1) WO2020172825A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113570067A (en) * 2021-07-23 2021-10-29 北京百度网讯科技有限公司 Synchronization method, device and program product of distributed system
CN113610241A (en) * 2021-08-03 2021-11-05 曙光信息产业(北京)有限公司 Distributed training method, device, equipment and storage medium for deep learning model
CN114205300A (en) * 2021-12-02 2022-03-18 南开大学 Flow scheduling method capable of guaranteeing flow transmission deadline under condition of incomplete flow information
US11416743B2 (en) * 2019-04-25 2022-08-16 International Business Machines Corporation Swarm fair deep reinforcement learning
CN115829053A (en) * 2022-11-25 2023-03-21 北京百度网讯科技有限公司 Model operation strategy determination method and device, electronic equipment and storage medium
WO2023040794A1 (en) * 2021-09-15 2023-03-23 华为技术有限公司 Communication method and communication apparatus
CN116962438A (en) * 2023-09-21 2023-10-27 浪潮电子信息产业股份有限公司 Gradient data synchronization method, system, electronic equipment and readable storage medium

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116233857A (en) * 2021-12-02 2023-06-06 华为技术有限公司 Communication method and communication device
CN117221944A (en) * 2022-06-02 2023-12-12 华为技术有限公司 Communication method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105894087A (en) * 2015-01-26 2016-08-24 华为技术有限公司 System and method for training parameter set in neural network
WO2019007388A1 (en) * 2017-07-06 2019-01-10 Huawei Technologies Co., Ltd. System and method for deep learning and wireless network optimization using deep learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105894087A (en) * 2015-01-26 2016-08-24 华为技术有限公司 System and method for training parameter set in neural network
WO2019007388A1 (en) * 2017-07-06 2019-01-10 Huawei Technologies Co., Ltd. System and method for deep learning and wireless network optimization using deep learning

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11416743B2 (en) * 2019-04-25 2022-08-16 International Business Machines Corporation Swarm fair deep reinforcement learning
CN113570067A (en) * 2021-07-23 2021-10-29 北京百度网讯科技有限公司 Synchronization method, device and program product of distributed system
CN113570067B (en) * 2021-07-23 2022-08-02 北京百度网讯科技有限公司 Synchronization method and device of distributed system
CN113610241A (en) * 2021-08-03 2021-11-05 曙光信息产业(北京)有限公司 Distributed training method, device, equipment and storage medium for deep learning model
WO2023040794A1 (en) * 2021-09-15 2023-03-23 华为技术有限公司 Communication method and communication apparatus
CN114205300A (en) * 2021-12-02 2022-03-18 南开大学 Flow scheduling method capable of guaranteeing flow transmission deadline under condition of incomplete flow information
CN114205300B (en) * 2021-12-02 2023-09-22 南开大学 Flow scheduling method capable of guaranteeing coflow transmission deadline under condition of incomplete flow information
CN115829053A (en) * 2022-11-25 2023-03-21 北京百度网讯科技有限公司 Model operation strategy determination method and device, electronic equipment and storage medium
CN115829053B (en) * 2022-11-25 2023-09-19 北京百度网讯科技有限公司 Model operation strategy determination method and device, electronic equipment and storage medium
CN116962438A (en) * 2023-09-21 2023-10-27 浪潮电子信息产业股份有限公司 Gradient data synchronization method, system, electronic equipment and readable storage medium
CN116962438B (en) * 2023-09-21 2024-01-23 浪潮电子信息产业股份有限公司 Gradient data synchronization method, system, electronic equipment and readable storage medium

Also Published As

Publication number Publication date
CN113412494B (en) 2023-03-17
CN113412494A (en) 2021-09-17

Similar Documents

Publication Publication Date Title
WO2020172825A1 (en) Method and apparatus for determining transmission policy
WO2018099085A1 (en) Neural network model training method and device, and chip
US10482380B2 (en) Conditional parallel processing in fully-connected neural networks
CN111858009A (en) Task scheduling method of mobile edge computing system based on migration and reinforcement learning
CN103544528A (en) BP neural-network classification method based on Hadoop
CN108111335A (en) A kind of method and system dispatched and link virtual network function
EP3889846A1 (en) Deep learning model training method and system
CN111917642B (en) SDN intelligent routing data transmission method for distributed deep reinforcement learning
CN110033081A (en) A kind of method and apparatus of determining learning rate
CN111340192B (en) Network path allocation model training method, path allocation method and device
CN113128657A (en) Multi-agent behavior decision method and device, electronic equipment and storage medium
CN115587633A (en) Personalized federal learning method based on parameter layering
Li et al. Model-distributed dnn training for memory-constrained edge computing devices
CN115022231B (en) Optimal path planning method and system based on deep reinforcement learning
CN115150335A (en) Optimal flow segmentation method and system based on deep reinforcement learning
KR102463147B1 (en) Massively parallel deep learning method and apparatus
CN115600693A (en) Machine learning model training method, machine learning model recognition method, related device and electronic equipment
CN114995157A (en) Anti-synchronization optimization control method of multi-agent system under cooperative competition relationship
CN114022731A (en) Federal learning node selection method based on DRL
US11715010B2 (en) Cross replica reduction on networks having degraded nodes
Naseh et al. Enabling Intelligent Vehicular Networks Through Distributed Learning in the Non-Terrestrial Networks 6G Vision
KR20200126212A (en) Deep learning-based dynamic routing technology to achieve maximum user request throughput and minimum intra-communication latency in high performance computing environments with interconnection network based multi-node cluster
CN113705858B (en) Shortest path planning method, system, equipment and storage medium for multiple target areas
CN110598835B (en) Automatic path-finding method for trolley based on Gaussian variation genetic algorithm optimization neural network
KR102644669B1 (en) Federated learning method and system for enhanced learning converges speed

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19916888

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19916888

Country of ref document: EP

Kind code of ref document: A1