WO2020172825A1

WO2020172825A1 - Method and apparatus for determining transmission policy

Info

Publication number: WO2020172825A1
Application number: PCT/CN2019/076359
Authority: WO
Inventors: 范礼; 王海彬
Original assignee: 华为技术有限公司
Priority date: 2019-02-27
Filing date: 2019-02-27
Publication date: 2020-09-03
Also published as: CN113412494B; CN113412494A

Abstract

A method and an apparatus for determining a transmission policy, relating to data training in the field of artificial intelligence. The method for determining a transmission policy comprises: generating an ith transmission policy, the ith transmission policy being used to transmit the gradient of parameters of layers obtained by an ith iteration of a first neural network model; determining a communication trailing duration corresponding to the ith transmission policy, the communication trailing duration corresponding to the ith transmission policy being used to indicate a duration between the end time of the ith iteration of the first neural network model and the start time of the (i+1)th iteration; and generating an (i+1)th transmission policy according to the communication trailing duration corresponding to the ith transmission policy. Thus, after the ith transmission policy is generated, a communication trailing duration corresponding to the ith transmission policy can be obtained, so a round of reinforcement learning can be completed on the basis of the communication trailing duration corresponding to the ith transmission policy, so that the generated (i+1)th transmission policy tends toward an optimal transmission policy, facilitating the improvement of the efficiency of distributed training.

Description

Method and device for determining transmission strategy

Technical field

This application relates to the field of artificial intelligence technology, and in particular to a method and device for determining a transmission strategy.

Background technique

Artificial intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence. Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, and basic AI theories.

With the rapid development of artificial intelligence, neural network models trained on large data sets have achieved breakthrough improvements and widespread applications in many fields. Because training a neural network model through continuous iterations is a typical computationally intensive task and requires a lot of calculations, the training process of the neural network model is very time-consuming. Although graphics processing unit (GPU) hardware technology, network model structure and training methods have all made progress in recent years, the fact that single machine (or single node) training takes too long cannot be avoided. Secondly, research shows that the scale of training data has a linear growth relationship with the performance of the neural network model. In the future, the scale of training data may reach PB or ZB levels. As the scale of the training data and the model parameters of the neural network model becomes larger and larger, the growth rate of the memory (or video memory) of a single machine will not match it. As a result, the single-machine neural network model training can no longer meet the requirements.

Because distributed training has good flexibility and scalability, and can effectively combine single-machine resources, distributed training has become an effective means to solve the above problems. There are two main strategies for distributed training, model parallel training and data parallel training. Among them, model parallel is to divide the neural network model into multiple parts, and each part is handed over to each training node for training, but there is a lot of communication between training nodes, and there are certain difficulties in cutting and dividing the model; while data parallel training is The training data is divided into multiple training data sets and handed over to multiple training nodes for training without cutting the partition model. Therefore, data parallel training is an effective strategy for distributed training on large-scale training data.

In data parallel training, multiple training nodes need to send the calculated gradients for aggregation during the reverse calculation process of network training. However, how to transmit the gradient calculated by multiple training nodes to improve the efficiency of distributed training requires further research.

Summary of the invention

The embodiments of the present application provide a method and device for determining a transmission strategy, and transmitting gradients through the determined transmission strategy can effectively improve the efficiency of distributed training.

In the first aspect, an embodiment of the present application provides a method for determining a transmission strategy, the method may be executed by a computing node, and the method includes:

Generate the i-th transmission strategy, the i-th transmission strategy is used to transmit the gradient of each layer parameter obtained in the i-th iteration of the first neural network model; determine the communication tailing time corresponding to the i-th transmission strategy, and the i-th transmission strategy The communication tail time is used to indicate the time between the end time of the i-th iteration of the first neural network model and the start time of the i+1th iteration; furthermore, the communication tail time corresponding to the i-th transmission strategy is used to generate the first i+1 transmission strategy, the i+1th transmission strategy is used to transmit the gradient of each layer parameter obtained in the i+1th iteration of the first neural network model; where i=1, 2, ..., X-1 , X is the number of iterations of the first neural network model, and X is an integer greater than 1.

Using the above method, after the computing node generates the i-th transmission strategy, it can obtain the communication tail time corresponding to the i-th transmission strategy, so that a round of reinforcement learning can be completed based on the communication tail time corresponding to the i-th transmission strategy, so that the generated first The i+1 transmission strategy tends to the optimal transmission strategy (that is, the transmission strategy that makes the communication tailing time the shortest), which helps to improve the efficiency of distributed training.

In a possible design, after generating the i-th transmission strategy, the method further includes: sending the i-th transmission strategy to W training nodes, and the W training nodes are used for distributed training of the first neural network model;

Determining the communication tailing time corresponding to the i-th transmission strategy includes: receiving the i-th communication tailing time of the parameter gradients of each layer obtained by the i-th iteration of the first neural network model by W training nodes using the i-th transmission strategy ; According to the W i-th communication tail time length, determine the communication tail time length corresponding to the i-th transmission strategy, W is an integer greater than or equal to 1.

In a possible design, generating the i-th transmission strategy includes: generating the i-th transmission strategy through a second neural network model.

In a possible design, generating the i+1th transmission strategy according to the communication tailing duration corresponding to the i-th transmission strategy includes: updating the parameters of the second neural network model according to the communication tailing duration corresponding to the i-th transmission strategy , And generate the i+1th transmission strategy through the updated second neural network model.

Using the above method, the communication tailing duration corresponding to the i-th transmission strategy is used as the reward of reinforcement learning to update the parameters of the second neural network model. Since the second neural network model has strong learning capabilities, this process is continuously executed , Can achieve the convergence of the communication tail time to the optimal value.

In a possible design, the i-th transmission strategy includes the sub-transmission strategy of each layer of the first neural network model, and the sub-transmission strategy of the n-th layer includes first information and second information, and the first information is used to indicate calculation Whether to initiate transmission after obtaining the gradient of the n-th layer parameters, the second information is used to indicate the logical topology used for transmission; where n=1, 2,..., N, N is the number of layers of the first neural network model, N Is an integer greater than or equal to 1.

In a possible design, generating the sub-transmission strategy of the n+1th layer through the second neural network model is specifically: taking the second information of the sub-transmission strategy of the nth layer as the input of the second neural network model to generate The first information of the sub-transmission strategy of the n+1th layer and the first information of the sub-transmission strategy of the n+1th layer are used as the input of the second neural network model to generate the first information of the sub-transmission strategy of the n+1th layer Two information.

In a possible design, generating the i-th transmission strategy includes: generating the i-th transmission strategy through the Q table used to record state-actions in the Q-learning algorithm; the Q table includes P states and Q actions, P Each state corresponds to P data volume thresholds, and Q actions respectively correspond to Q combinations composed of state transition amount and logic topology; among them, P and Q are integers greater than or equal to 1.

In a possible design, generating the i+1th transmission strategy includes: updating the Q table according to the communication tailing duration corresponding to the ith transmission strategy, and generating the i+1th transmission strategy through the updated Q table .

Using the above method, by constructing the Q table in the Q-learning algorithm, the data volume threshold is used as the state dimension information in the Q table, and the state transition amount and logical topology are used as the action dimension information in the Q table. The i-th transmission strategy; further, the Q table in Q-learning is updated by using the communication tailing duration corresponding to the i-th transmission strategy as a reward for reinforcement learning. Since the Q-Learning algorithm can take actions according to the current state, obtain the corresponding After the reward, we can improve these actions, so as to be able to make better actions, that is, to get a better transmission strategy.

In a possible design, the i-th transmission strategy includes third information and fourth information; where the third information is used to indicate the transmission of the first neural network model obtained by the i-th iteration of the gradient of each layer parameter i data volume threshold, the i-th data volume threshold is used to determine the transmission timing of the gradient of each layer parameter obtained in the i-th iteration of the first neural network model; the fourth information is used to indicate the logical topology used for each transmission .

In a possible design, generating the i-th transmission strategy through the Q table includes: according to the Q table, obtaining the reward value for executing Q actions under the state corresponding to the i-1th data volume threshold, and according to the Q actions The reward value determines the i-th target action and generates the i-th transmission strategy; the i-th data amount threshold is the sum of the state transition amount in the combination corresponding to the i-th target action and the i-1th data amount threshold, which is used for each transmission The logical topology is the logical topology in the combination corresponding to the i-th target action.

In a possible design, the maximum data volume threshold among the P data volume thresholds is determined according to the parameter value of the first neural network model, and/or the minimum data volume threshold among the P data volume thresholds is determined according to the preset Let the transmission efficiency be determined.

In the second aspect, an embodiment of the present application provides a method for determining a transmission strategy. The method may be executed by a computing node. The method includes:

Generate the i-th transmission strategy, the i-th transmission strategy is used to transmit the gradient of each layer parameter obtained in the i-th iteration of the first neural network model; determine the communication tailing time corresponding to the i-th transmission strategy, and the i-th transmission strategy The communication tail time is used to indicate the time between the end time of the i-th iteration of the first neural network model and the start time of the i+1th iteration; if it is determined that the communication tail time corresponding to the i-th transmission strategy is greater than the first Threshold, the i+1th transmission strategy is generated according to the communication tailing duration corresponding to the i-th transmission strategy. The i+1th transmission strategy is used to transmit the parameters of each layer obtained in the i+1th iteration of the first neural network model Gradient; where i=1, 2,..., X-1, X is the number of iterations of the first neural network model, and X is an integer greater than 1.

In a possible design, the method also includes:

If it is determined that the communication tail duration corresponding to the i-th transmission strategy is less than or equal to the first threshold, the i-th transmission strategy is used as the (i+1)th transmission strategy.

Using the above method, the computing node through multiple rounds of reinforcement learning, if the generated i-th transmission strategy can make the efficiency of distributed training higher (that is, the i-th transmission strategy is a better transmission strategy), it can no longer generate new Transmission strategy. Accordingly, the training node can use the same transmission strategy (that is, the i-th transmission strategy) to transmit the gradient in subsequent iterations of the first neural network model. In this way, the processing burden of the computing node can be effectively reduced, and the training node can transmit gradients based on the same transmission strategy without receiving the newly generated transmission strategy of the computing node, which can effectively improve the efficiency of distributed training.

In a third aspect, an embodiment of the present application provides a method for transmitting gradients. The method may be executed by a training node. The method includes:

Obtain the i-th transmission strategy, which is used to transmit the gradient of the parameters of each layer obtained in the i-th iteration of the first neural network model;

The i-th transmission strategy is used to transmit the gradient of the parameters of each layer obtained in the i-th iteration of the first neural network model.

In a possible design, the i-th transmission strategy includes the sub-transmission strategy of each layer of the first neural network model, and the sub-transmission strategy of the n-th layer includes first information and second information, and the first information is used to indicate calculation Whether to initiate transmission after obtaining the gradient of the n-th layer parameter, the second information is used to indicate the logical topology used for transmission;

Use the i-th transmission strategy to transmit the gradient of each layer parameter obtained in the i-th iteration of the first neural network model, including: in the i-th iteration, after calculating the gradient of each layer parameter, according to the sub-layer of each layer The transmission strategy transmits the gradient of each layer parameter. For example, the first information of the sub-transmission strategy of the nth layer is used to indicate that the transmission is initiated after the gradient of the n-th layer parameter is calculated. After the gradient of the n-th layer parameter is calculated, the sub-transmission strategy of the nth layer is The logical topology indicated by the second information transmits the gradient to be transmitted.

In a possible design, the i-th transmission strategy includes third information and fourth information; where the third information is used to indicate the transmission of the first neural network model obtained by the i-th iteration of the gradient of each layer parameter i data volume threshold, the i-th data volume threshold is used to determine the transmission timing of the gradient of each layer parameter obtained in the i-th iteration of the first neural network model; the fourth information is used to indicate the logical topology used for each transmission ；

Use the i-th transmission strategy to transmit the gradient of the parameters of each layer obtained in the i-th iteration of the first neural network model, including: in the i-th iteration, after calculating the gradient of each layer parameter, if it is determined to transfer the If the data amount of the gradient is greater than or equal to the i-th data amount threshold, the logical topology indicated by the fourth information is used to transmit the gradient to be transmitted.

In a possible design, obtaining the i-th transmission strategy includes: receiving the i-th transmission strategy sent by the computing node;

After using the i-th transmission strategy to transmit the gradients of the parameters of each layer obtained in the i-th iteration of the first neural network model, it also includes: transmitting the i-th iterations of the first neural network model using the i-th transmission strategy The i-th communication tail duration of the gradient of the layer parameter is sent to the computing node.

In a possible design, after sending the i-th communication tailing duration of the gradients of the parameters of each layer obtained by the i-th iteration of the first neural network model using the i-th transmission strategy to the computing node, it also includes:

Receive the i+1th transmission strategy sent by the computing node, the i+1th transmission strategy is generated according to the communication tailing duration corresponding to the i-th transmission strategy, and the communication tailing duration corresponding to the i-th transmission strategy is based on the i-th communication tailing Time length is obtained, i=1, 2,..., X-1, X is the number of iterations of the first neural network model.

In a fourth aspect, embodiments of the present application provide a device, which may be a computing node or a training node, or a computer device where the computing node or training node is located, or a semiconductor chip set in the computer device. The device has the function of realizing various possible designs in the first to third aspects. This function can be realized by hardware, or by hardware executing corresponding software. The hardware or software includes one or more units or modules corresponding to the above-mentioned functions.

In a fifth aspect, an embodiment of the present application provides a device that includes a processor, a memory, and an instruction stored on the memory and executable on the processor. When the instruction is executed, the device executes the first aspect. To the methods described in the various possible designs in the third aspect.

In a sixth aspect, an embodiment of the present application provides a computer-readable storage medium, including instructions, which when run on a computer, cause the computer to execute as described in any one of the possible designs of the first to third aspects. The method described.

In a seventh aspect, the embodiments of the present application provide a computer program product, which when running on a computer, causes the computer to execute the method described in any one of the possible designs of the first aspect to the third aspect.

These and other aspects of the application will be more concise and understandable in the description of the following embodiments.

Description of the drawings

Figure 1 is a schematic diagram of an artificial intelligence main framework provided by an embodiment of the application;

FIG. 2a is a schematic diagram of a first neural network model provided by an embodiment of this application;

Figure 2b is a schematic diagram of a centralized distributed training system provided by an embodiment of the application;

Figure 2c is a schematic diagram of a decentralized distributed training system provided by an embodiment of the application;

Figure 2d is a schematic diagram of transmission in the decentralized distributed training system provided by an embodiment of the application;

2e is a possible schematic diagram of asynchronous parallel computing and communication provided by an embodiment of this application;

FIG. 2f is a schematic diagram of the relationship between the amount of transmitted data and the communication tail duration provided by an embodiment of the application;

Figure 3 is a schematic diagram of an architecture applicable to the implementation of this application;

FIG. 4 is a schematic flowchart corresponding to a method for determining a transmission strategy provided by an embodiment of the application;

Figure 5a is a schematic diagram of a second neural network model generating a transmission strategy;

FIG. 5b is an overall schematic diagram of Implementation Mode 1 provided by an embodiment of the application;

Figure 6a is a schematic diagram of determining the data volume threshold;

FIG. 6b is an overall schematic diagram of Implementation Mode 2 provided by an embodiment of this application;

FIG. 7 is a possible exemplary block diagram of a device for determining a transmission strategy involved in an embodiment of this application;

FIG. 8 is a schematic diagram of an apparatus for determining a transmission strategy provided by an embodiment of the application.

detailed description

In order to make the objectives, technical solutions, and advantages of the present application clearer, the present application will be further described in detail below with reference to the accompanying drawings.

First, some terms in this application are explained to facilitate the understanding of those skilled in the art.

(1) Artificial neural network (artificial neural network, ANN), referred to as neural network (NN) or neural network, in the field of machine learning and cognitive science, is a kind of imitating biological neural network (animal's central nervous system) , Especially the mathematical model or calculation model of the structure and function of the brain, used to estimate or approximate the function. The neural network is calculated by connecting a large number of artificial neurons. In most cases, the artificial neural network can change the internal structure on the basis of external information. It is an adaptive system, and it has a learning function in general.

(2) Loss function (loss function), in statistics, is a function to measure the degree of loss and error. In neural networks, it can be understood as a function to measure the difference between the value predicted by the model and the label value of the training data , The neural network model can be trained with the goal of minimizing the loss function.

(3) Gradient descent is a first-order optimization algorithm, which is usually called the steepest descent method. To use the gradient descent method to find the local minimum of a function, it is necessary to iteratively search for the specified step distance point in the opposite direction of the gradient (or approximate gradient) corresponding to the current point on the function.

(4) Gradient (Gradient): refers to a vector (vector), the maximum value of the directional derivative of a certain function at a certain point in the gradient descent method. In a neural network, each parameter can be updated based on the gradient of each parameter, thereby gradually approaching the minimum value of the loss function of the neural network.

(5) Backpropagation (BP) algorithm: short for "error backpropagation algorithm", it is a common method used to train artificial neural networks in combination with optimization methods (such as gradient descent) . This method can calculate the gradient of the loss function for all the weights in the neural network, and feed the gradient back to the optimization method to update the weights to minimize the loss function.

(6) Training node: It can also be called a worker or a working node. The training node can be a GPU or a central processing unit (CPU), which is not specifically limited. Computing node: It can be a GPU or a CPU, which is not limited.

(7) GPU: also known as display core, vision processor, display chip or graphics chip, it is a kind of graphics operation that is specially run on personal computers, workstations, game consoles and some mobile devices (such as tablet computers, smart phones, etc.) Microprocessor.

(8) The various numerical numbers such as first and second involved in the embodiments of the present application are only for easy distinction for description, and are not used to limit the scope of the embodiments of the present application, and do not indicate a sequence. "And/or" describes the association relationship of the associated objects, indicating that there can be three types of relationships, for example, A and/or B, which can mean: A alone, A and B at the same time, and B alone. "At least one" means one or more. At least two means two or more. "At least one", "any one" or similar expressions refer to any combination of these items, including any combination of single item (a) or plural items (a). For example, at least one (piece, species) of a, b, or c can mean: a, b, c, ab, ac, bc, or abc, where a, b, and c can be single or Multiple.

Figure 1 shows a schematic diagram of an artificial intelligence main framework, which describes the overall workflow of the artificial intelligence system and is suitable for general artificial intelligence field requirements.

The following describes the main framework of the above artificial intelligence from two dimensions: "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis).

"Intelligent Information Chain" reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has gone through the condensing process of "data-information-knowledge-wisdom". "IT value chain" from the intelligent underlying infrastructure, information (providing and processing technology realization) to the system industrial ecological process, reflecting the value artificial intelligence brings to the information technology industry.

(1) Infrastructure: Infrastructure provides computing power support for artificial intelligence systems, realizes communication with the outside world, and realizes support through the basic platform. Communicate with the outside through sensors; computing capabilities are provided by smart chips (CPU, NPU, GPU, ASIC, FPGA and other hardware acceleration chips); the basic platform includes distributed computing framework and network related platform guarantees and support, which can include cloud storage and Computing, interconnection network, etc. For example, sensors communicate with external parties to obtain data, and these data are provided to the smart chip provided by the basic platform for calculation.

(2) Data: The data in the upper layer of the infrastructure is used to indicate the data source in the field of artificial intelligence. The data involves graphics, images, voice, and text, as well as IoT data of traditional devices, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.

(3) Data processing: Data processing usually includes data training (such as deep learning, reinforcement learning), search, reasoning, and decision-making.

Among them, deep learning and reinforcement learning are important parts of artificial intelligence, and both deep learning and reinforcement learning belong to machine learning. Among them, deep learning refers to the use of existing data to train algorithms to find patterns that solve the corresponding problems, and then use this pattern to predict new data. Reinforcement learning is mainly learning through trial and error, that is, to determine the best answer by performing actions a limited number of times to get the maximum reward. The difference between deep learning and reinforcement learning is that deep learning is learning from a training set, and then applying the learned knowledge to a new data set, which is a static learning; while reinforcement learning uses continuous feedback to adjust its own actions Obtaining the best results is a process of constant trial and error and dynamic learning. It should be noted that deep learning and reinforcement learning are not mutually exclusive concepts. The two can be used in combination. For example, deep learning can be used in reinforcement learning.

Reasoning refers to the process of simulating human intelligent reasoning in a computer or intelligent system, using formal information to conduct machine thinking and solving problems based on reasoning control strategies. The typical function is search and matching. Decision-making refers to the decision-making process of intelligent information after reasoning, and usually provides functions such as classification, ranking, and prediction.

(4) General capabilities: After the above-mentioned data processing is performed on the data, some general capabilities can be formed based on the results of the data processing, such as an algorithm or a general system, such as translation, text analysis, and computer vision processing , Voice recognition, image recognition, etc.

(5) Smart products and industry applications: Smart products and industry applications refer to the products and applications of artificial intelligence systems in various fields. It is an encapsulation of the overall solution of artificial intelligence, productizing intelligent information decision-making and realizing landing applications. Its application fields Mainly include: smart manufacturing, smart transportation, smart home, smart medical, smart security, autonomous driving, safe city, smart terminal, etc.

The embodiment of this application will mainly study the content of the data training part in the framework shown in FIG. 1, and further study how to transmit the calculated gradient during the distributed training process of the first neural network model using training data. To improve the efficiency of distributed training.

Fig. 2a is a schematic diagram of the first neural network model. As shown in Fig. 2a, the first neural network model includes multiple layers, and each layer includes at least one parameter. The training of the first neural network model refers to determining the optimal parameter value according to the massive training data, so that the difference between the actual output data and the expected output data obtained by the first neural network model according to the training data meets the requirements. In FIG. 2a, the first neural network model includes N layers, which are the first layer to the Nth layer, and each layer in the first neural network model has a corresponding sequence. As shown in Figure 2a, the first layer is a layer that directly receives training data, and the Nth layer is a layer that directly outputs data. In an example, the first neural network model can be trained using the backpropagation algorithm, which specifically includes (for one iteration): input training data; calculate the actual output data from the first layer to the Nth layer according to the training data (That is, forward calculation); calculate the loss function value according to the difference between the actual output data and the expected output data; calculate the gradient of the parameters from the Nth layer to the first layer according to the loss function value, and use the gradient to update the parameters (that is Reverse calculation).

Taking the data-parallel distributed training of the first neural network model using multiple training nodes as an example, in one iteration of the first neural network model by multiple training nodes, for any parameter (such as parameter a), multiple The gradient of the parameter a calculated by the training node may be different, and multiple training nodes need to transmit the gradient of the calculated parameter a in order to determine the gradient average value. Furthermore, multiple training nodes can obtain the parameter a updated with the gradient average value. After multiple training nodes update the parameters of each layer, they respectively use the updated parameters of each layer to perform the next iteration of the first neural network model.

Further, when multiple training nodes are used to perform data-parallel distributed training on the first neural network model, a centralized manner may be used for distributed training, or a decentralized manner may be used for distributed training. The following two methods are described in detail.

Figure 2b is a schematic diagram of a centralized distributed training system. The distributed training system includes a central server (also called a parameter server or central node) and at least one training node (such as the one shown in Figure 2b). For training node 1, training node 2, and training node 3), the parameter server can communicate with at least one training node. Among them, each training node has a copy of the first neural network model, and each training node can use designated data blocks (shards) to train the first neural network model.

Specifically, in an example, the distributed training process is: the training data set is divided into multiple data blocks, for the j-th data block, the j-th data block can be divided into 3 mini-batches (mini- batches) data, and 3 mini-batches data are trained by 3 training nodes. During the training process, each training node can send the gradient of the calculated parameter to the parameter server according to the same rules (for example, after each training node calculates the gradient of a layer of parameters, it sends the gradient of that layer of parameters) To the parameter server), correspondingly, taking parameter a as an example, the parameter server can determine the gradient average of parameter a according to the gradient of parameter a, and then update parameter a based on the gradient average of parameter a (the specific update method is not limited ), and feed back the updated parameter a to the 3 training nodes. In this way, the three training nodes can complete the update of the parameters of each layer of the first neural network model, and can use the mini-batches data in the j+1 data block to perform the next iteration based on the updated parameters.

Figure 2c is a schematic diagram of a decentralized distributed training system that includes at least one training node (such as the training node 1, training node 2, training node 3, training node 4, and training node shown in Figure 2c). Node 5). At least one training node can communicate with each other, such as transmitting gradients.

Exemplarily, at least one training node may have a mutual data transmission sequence. For example, training node 1 can only transmit data to training node 2, training node 2 can only transmit data to training node 3, and training node 3 can only Transmit the data to the training node 1. The sequence of data transmission among multiple training nodes can be pre-configured, or it can be calculated and determined by the training node according to specific rules.

Specifically, in an example, the distributed training process can be: dividing the training data set into multiple data blocks, for the jth data block, the jth data block can be divided into 5 mini-batches (mini -batches) data, and 5 mini-batches data are trained by 5 training nodes. During the training process of each training node, the gradient of the calculated parameter can be sent to other training nodes according to the sequence of data transmission. Among them, when a training node transmits the calculated gradient to another training node, the gradient can be divided into multiple groups according to the number. Each group is a slice, and each slice includes at least one gradient. The number of training nodes for training the first neural network model is the same. If there are 5 training nodes, the buffered gradient is divided into 5 slices. When the gradient of the cache is divided into multiple slices, the number of gradients contained in each slice is generally the same. Of course, there are also cases that cannot be uniformly divided. Then the number of gradients contained in each slice can be roughly the same. It should be noted that each training node uses the same slicing rule for segmentation. For example, suppose there are 5 training nodes in total. If each training node buffers 10 gradients to be transmitted, then each slice can include 2 gradients. If each training node buffers 11 gradients to be transmitted, you can Cut into 5 slices using 2, 2, 2, 2, 3, 4 slices all include 2 gradients, and 1 slice includes 3 gradients.

As shown in Figure 2d, training node i cuts the cached gradient into 5 slices, each slice is ai-ei, i is 1-5, ai-ei is the identifier of the slice, that is, training node 1 cuts the cached gradient There are 5 slices, each slice is a1-e1, the training node 2 cuts the buffered gradient into 5 slices, and each slice is a2-e2,... For the same slice identifier (for example, a, or b, or c, or d, or e), the parameters corresponding to the gradient included in the slice corresponding to the slice identifier are consistent. For example, the slice a1 includes two gradients, and the two Each ladder is the gradient corresponding to the parameter R and the parameter Y, and slice a2, slice a3, slice a4, and slice a5 also include two gradients, which are also gradients corresponding to the parameter R and the parameter Y, respectively.

When training node 1 transmits the buffered gradient to training node 2, training node 1 first sends slice a1 to training node 2; after training node 2 receives slice a1 sent by training node 1, it compares the received slice a1 with The sum obtained by adding the slice a2 determined by itself is sent to the training node 3 as a slice a1+a2; the slice a1 includes the gradients corresponding to the parameter R and the parameter Y, which are r1 and y1, respectively, and the slice a2 includes the parameter R and the parameter The gradients corresponding to Y are r2 and y2 respectively. Training node 2 sends the sum obtained by adding slice a1 and slice a2 to training node 3 as a slice a1+a2, which can be the sum obtained by adding two corresponding gradients r1 and r2 for parameter R as parameter R The gradient of is carried in slice a1+a2 and sent to training node 3. For parameter Y, the sum of the two corresponding gradients y1 and y2 is added as the gradient of parameter Y and carried in slice a1+a2 and sent to training node 3. After the training node 3 receives the slice a1+a2 sent by the training node 2, the sum obtained by adding the received slice a1+a2 to the slice a3 determined by itself is sent to the training node 4 as a slice a1+a2+a3 ; After the training node 4 receives the slice a1+a2+a3 sent by the training node 3, the sum obtained by adding the received slice a1+a2+a3 to the slice a4 determined by itself is used as a slice a1+a2+a3 +a4 is sent to the training node 5; after the training node 5 receives the slice a1+a2+a3+a4 sent by the training node 4, it can add the received slice a1+a2+a3+a4 to the slice a5 determined by itself Get the sum and send it as a slice a1+a2+a3+a4+a5 to the training node 1. The training node 5 also calculates the gradient average according to the slice (a1+a2+a3+a4) and the slice a5 determined by itself, and trains the node 5 Send the gradient average calculated from slice a to the training node 1. Then training node 1 sends the gradient average calculated according to slice a to training node 2, training node 2 sends the gradient average calculated according to slice a to training node 3, and training node 3 will send the gradient average calculated according to slice a Send to training node 4. In this way, training node 1 to training node 5 have obtained the gradient average value calculated according to slice a, and training node 1 to training node 5 can use the gradient average value calculated according to slice a to compare the parameter R and parameter Y corresponding to slice a. The value is updated for use in the next iteration.

When training node 2 transmits the buffered gradient to training node 3, training node 2 first sends slice b2 to training node 3. After training node 3 receives slice b2 sent by training node 2, it compares the received slice b2 with The sum obtained by adding the slice b3 determined by itself is sent to the training node 3 as a slice b2+b3;...; and so on, using a process similar to the above, until the training node 1 regards the calculated sum as a slice (b1+b2 +b3+b4+b5) is sent to training node 5. It can also be that training node 1 calculates the gradient average value according to slice b, and training node 1 sends the gradient average value calculated according to slice b to training node 2. Then training node 2 sends the gradient average calculated according to slice b to training node 3, training node 3 sends the gradient average calculated according to slice b to training node 4, and training node 4 sends the gradient average calculated according to slice b Send to training node 5. In this way, training node 1 to training node 5 all obtain the gradient average value calculated according to slice b, then training node 1 to training node 5 can use the gradient average value calculated according to slice b to update the parameter value corresponding to slice b, so that Used in the next iteration.

Similarly, training node 3 first sends slice c3 to training node 4; ...; training node 2 calculates the gradient average value according to slice c, and training node 2 sends the gradient average value calculated according to slice c to training node 3. ...; until training node 1 to training node 5 obtain the gradient average value calculated according to slice c, training node 1 to training node 5 can use the gradient average value calculated according to slice c to update the parameter value corresponding to slice c .

Similarly, training node 4 first sends slice d4 to training node 5;...; training node 3 calculates the average gradient according to slice di, and training node 3 sends the average gradient calculated according to slice d to training node 4. ...; until training node 1 to training node 5 obtain the gradient average value calculated according to slice d, training node 1 to training node 5 can use the gradient average value calculated according to slice d to update the parameter value corresponding to slice d .

Similarly, training node 5 first sends slice e5 to training node 1; ...; training node 4 calculates the gradient average according to slice e, and training node 4 sends the gradient average calculated according to slice e to training node 5. ……; Until the training node 1 to training node 5 have determined the gradient average calculated according to slice e, then training node 1 to training node 5 can use the gradient average calculated according to slice e to perform the parameter value corresponding to slice e Update.

After determining the gradient average value according to the above method, and using the gradient average value to update the parameter value, the mini-batches data in the j+1th data block can be used for the next iteration based on the updated parameter.

It is understandable that the training process described above based on Figure 2c is only described by using a possible logical topology (i.e. ring) transmission gradient between training node 1 and training node 5 as an example. In the example, other possible logical topologies can also be used to transmit the gradient between training node 1 and training node 5, which is not specifically limited.

It should be noted that only a small number of training nodes are shown in Figures 2b and 2c. In specific implementation, the number of training nodes may be far greater than 5. Further, taking a decentralized distributed training system as an example, the system may include one or more computer devices, and each computer device may be deployed with one or more training nodes. Training nodes deployed in the same computer device can communicate through a communication bus, and training nodes deployed in different computer devices can communicate through a network (such as a wireless network).

According to the introduction of Fig. 2b and Fig. 2c, the training node needs to send the gradient out for aggregation operation during the reverse calculation process of neural network model training (that is, obtain the average value of the gradient in order to update the parameters). Further, in order to improve training efficiency, the training node also needs to overlap the layer-by-layer calculation and gradient transmission of the first neural network training, that is, the calculation and the communication are asynchronous and parallel. Refer to Figure 2e, which is a possible schematic diagram of asynchronous parallel computing and communication. As shown in Figure 2e, in the i-th iteration of the first neural network model, after the gradient of the N-th layer parameter is calculated, the gradient of the N-th layer parameter can be sent out (transmission delay is expressed as τ _N ); After obtaining the gradient of the parameters of the N-1 layer, the gradient of the parameters of the N-1 layer can be sent out (transmission delay is expressed as τ _N-1 ); and so on, after calculating the gradient of the parameters of the first layer, you can Send out the gradient of the first layer parameters (transmission delay is expressed as τ ₁ ). In this way, after the gradients of the parameters of each layer of the first neural network model are all sent out, the parameters of each layer of the first neural network model can be updated, and the next iteration can be executed. However, when the layer-by-layer calculation of the first neural network training and the gradient transmission are asynchronously parallel, the last transmission needs to be executed after the gradient of the first layer parameter is calculated, which leads to the phenomenon of communication tailing. If the i-th iteration is The communication tailing time (ie, τ ₁ ) is longer, which makes the time interval between the i-th iteration and the i+1-th iteration longer, resulting in lower efficiency of distributed training for the first neural network model.

In the distributed training scenario, there are many factors that affect the length of communication tailing, such as the number of training nodes in the distributed training system, the physical networking method, the amount of transmitted data (that is, the amount of data in the gradient of one transmission), and each transmission The logic topology used by the gradient, communication scheduling overhead, network congestion delay, etc. Among them, the distributed training system can have multiple physical networking methods, such as wireless bandwidth (Infiniband), remote direct memory access (RDMA) (RDMA overconverged ethernet, RoCE), High-speed serial computer expansion bus (peripheral component interconnect express, PCIe), NVLink interconnect, etc. There are many kinds of logical topologies used in the transmission gradient, such as logical trees, rings, halving & doubling, hierarchical rings, hybrid topology, etc.

Further, various factors that affect the length of communication tailing may interact with each other. For example, when the physical networking mode of the distributed training system and the number of training nodes are determined, if the logical topology used for the transmission gradient is different ( For example, topo0 and topo1), the transmitted data volume and the communication tail time curve will show different trends, as shown in Figure 2f, when the transmitted data volume is less than M0, the transmission delay using topo0 is less than topo1; otherwise, The transmission delay using topo1 is less than topo0.

In addition, the method of transmitting the gradient shown in FIG. 2e is that each time the gradient of a layer of parameters is calculated, the gradient of the parameters of the layer can be transmitted, but the parameter amount and the parameter distribution between layers of different neural network models are often different. For example, for the same neural network model, some layers have more parameters, and some layers have fewer parameters. For layers with fewer parameters, the efficiency of initiating transmission is obviously not high (considering factors such as communication overhead), therefore, you can calculate The obtained gradients are accumulated to a certain number and then a communication transmission is initiated intensively. For example, a two-layer or three-layer parameter gradient can be calculated to initiate transmission, so as to improve transmission efficiency. For another example, for different neural network models, the parameter distribution of some neural network models is relatively uniform, and the parameters of some neural network models may be concentrated in certain layers, which may cause sudden transmission.

Based on the above content, in order to improve the efficiency of distributed training, different transmission strategies need to be designed for different distributed training systems and neural network models.

At present, some deep learning frameworks (such as tensorflow) and third-party libraries (such as Horovod, OpenMPI) provide transmission strategies with some custom mechanisms. For example, Horovod allows users to set a data volume threshold or time threshold. In the reverse calculation process, if the accumulated gradient data reaches the data volume threshold or the transmission time interval reaches the time threshold, the transmission is initiated. However, since this method does not provide any basis to determine the data volume threshold or time threshold, the data volume threshold or time threshold set by the user based on his own experience may not be reasonable enough, thus failing to achieve the purpose of improving the efficiency of distributed training.

Based on this, an embodiment of the present application provides a method for determining a transmission strategy, which specifically includes: generating an i-th transmission strategy, and the i-th transmission strategy is used to transmit the layers obtained from the i-th iteration of the first neural network model The gradient of the parameter; determine the communication tail time corresponding to the i-th transmission strategy, and the communication tail time corresponding to the i-th transmission strategy is used to indicate the end time of the i-th iteration and the i+1-th time of the first neural network model The duration between the beginning of iterations; the i+1th transmission strategy is generated according to the communication tailing duration corresponding to the i-th transmission strategy, and the i+1th transmission strategy is used to transmit the i+1th time of the first neural network model The gradient of each layer parameter obtained by iteration. The above method can be executed by a computing node. In this way, after the computing node generates the i-th transmission strategy, it can obtain the communication tailing duration corresponding to the i-th transmission strategy, so that a round can be completed based on the communication tailing duration corresponding to the i-th transmission strategy Reinforcement learning makes the generated i+1th transmission strategy tend to the optimal transmission strategy (that is, the transmission strategy that makes the communication tailing time the shortest), which is beneficial to improve the efficiency of distributed training.

Further, the computing node can interact with multiple training nodes that perform distributed training on the first neural network model, and then continuously try and update the transmission strategy based on the communication tailing time that multiple training nodes feedback after completing an iteration. , Which can intelligently and automatically generate a transmission strategy close to the optimal through reinforcement learning and improve the efficiency of distributed training.

FIG. 3 is a schematic diagram of an architecture to which the embodiments of the application are applicable. As shown in FIG. 3, it includes a computing node and a distributed training system. The distributed training system may be the centralized distributed training system shown in FIG. 2b, or It can also be the decentralized distributed training system shown in Figure 2c, or it can also be other possible distributed training systems, which are not specifically limited; in Figure 3 only the distributed training system is shown in Figure 2c Take a decentralized distributed training system as an example.

In an example, the computing node may include an agent executor, and the training node may include an estimator, for example, training node 1 includes evaluator 1, training node 2 includes evaluator 2,... …, the training node 5 includes an evaluator 5.

Specifically, the agent executor can be a set of reinforcement learning networks or algorithms, which are mainly used to generate the i-th transmission strategy, and then send the i-th transmission strategy to the evaluator 1 to evaluator 5, and according to the evaluator The rewards (specifically, the communication tail time) fed back from 1 to 5 are used to update its own parameters. Taking evaluator 1 as an example, evaluator 1 is mainly used to obtain the i-th transmission strategy generated by the agent executor, and then start an iteration of the first neural network model, and transmit the reverse calculated gradient according to the i-th transmission strategy. The communication tail time is measured, and the communication tail time is fed back to the agent executor as a reward; other evaluators can refer to the description of evaluator 1 and will not be repeated. In this way, after repeated iterations, the agent executor can continuously learn and evolve from the real model training environment, and eventually tend to produce the optimal transmission strategy.

It should be noted that the above example is described by taking one evaluator corresponding to each training node as an example. In other possible examples, multiple training nodes may correspond to one evaluator, which is not specifically limited.

Based on the architecture shown in FIG. 3, FIG. 4 is a schematic flowchart corresponding to a method for determining a transmission strategy provided by an embodiment of the application, as shown in FIG. 4, including:

Step 401: The computing node generates the i-th transmission strategy, and sends the i-th transmission strategy to W training nodes respectively. The W training nodes are used for distributed training of the first neural network model; the i-th transmission strategy is used for W training nodes. The training node transmits the gradient of each layer parameter obtained in the i-th iteration of the first neural network model.

In step 402, the training node receives the i-th transmission strategy, and transmits the i-th communication tailing duration of the gradient of each layer parameter obtained in the i-th iteration of the first neural network model using the i-th transmission strategy to the computing node. The training node described here may be any one of the W training nodes.

Step 403: The computing node receives W i-th communication tailing durations sent by W training nodes.

In the embodiment of this application, the i-th communication tailing time obtained by the training node (that is, the communication tailing time of the i-th iteration) may be the end time and the i-th iteration of the i-th iteration of the first neural network model performed by the training node. +1 time between the start of the iteration. In an example, the i-th communication tail duration obtained by the training node may be equal to the duration between the end time of the i-th iteration of the training node on the first neural network model and the start time of the i+1-th iteration. Taking training node 1 as an example, the i-th communication tail time obtained by training node 1 can be equal to the time between the end time of the i-th iteration of the first neural network model by training node 1 and the start time of the i+1th iteration duration.

Step 404: The computing node determines the communication tail time corresponding to the i-th transmission strategy according to the W i-th communication tail time lengths. The communication tail duration corresponding to the i-th transmission strategy is used to indicate the duration between the end time of the i-th iteration and the start time of the i+1-th iteration of the first neural network model. Since there are W training node pairs The first neural network model performs distributed training. Therefore, it can also be understood as: the communication tailing duration corresponding to the i-th transmission strategy reflects the end time and the i-th iteration of the i-th iteration of the first neural network model by W training nodes The length of time between the start of the +1 iteration.

In the embodiment of the present application, the computing node may determine the communication tailing duration corresponding to the i-th transmission strategy according to the W i-th communication tailing duration according to the specific implementation manners. For example, the computing node can tail the W i-th communication The average of the time length is determined as the communication tailing time corresponding to the i-th transmission strategy, that is, the communication tailing time corresponding to the i-th transmission strategy is equal to the end time of the i-th iteration of the first neural network model by W training nodes and the i-th The average of the duration between the start of the +1 iteration.

Step 405: The computing node generates an i+1th transmission strategy according to the communication tailing duration corresponding to the i-th transmission strategy, and sends the i+1th transmission strategy to W training nodes, and the i+1th transmission strategy is used for W trainings The node transmits the gradient of each layer parameter obtained from the i+1th iteration of the first neural network model.

Among them, W is an integer greater than or equal to 1, i=1, 2,..., X-1, X is the number of iterations of the first neural network model, and X is an integer greater than 1. In the embodiment of this application, the value of X can be obtained according to the number of data blocks divided by the training data set of the first neural network model, and each data block is used for one iteration of the first neural network model (specifically, it can be See the description in Figure 2b and Figure 2c above). In an example, X is equal to the number of data blocks divided by the training data set of the first neural network model.

Specifically, for the above step 405, in an example, if the computing node determines that the communication tail time corresponding to the i-th transmission strategy is greater than the first threshold, it can generate the i-th communication tail time corresponding to the i-th transmission strategy. +1 transmission strategy; if the computing node determines that the communication tail duration corresponding to the i-th transmission strategy is less than or equal to the first threshold, the i-th transmission strategy can be used as the i+1-th transmission strategy, that is, the computing node passes through multiple rounds Reinforcement learning, the generated i-th transmission strategy can make distributed training more efficient (that is, after the i-th transmission strategy is a better transmission strategy), no new transmission strategy can be generated. Accordingly, the training node can be In subsequent iterations of the first neural network model, the same transmission strategy (that is, the i-th transmission strategy) is used to transmit the gradient. Among them, the first threshold can be set according to actual needs and experience. In this way, the processing burden of the computing node can be effectively reduced, and the training node can transmit gradients based on the same transmission strategy without receiving the newly generated transmission strategy of the computing node, which can effectively improve the efficiency of distributed training.

It should be noted that in the above example, the communication tailing duration corresponding to the i-th transmission strategy is compared with the first threshold to determine whether the i-th transmission strategy is a better transmission strategy. In other possible examples, it may be It is determined by other methods whether the i-th transmission strategy is a better transmission strategy, which is not specifically limited.

In another example, considering that there may be some dynamic and unstable factors (such as communication scheduling overhead, network congestion delay, etc.) that affect the length of communication tailing time, these dynamic and unstable factors will cause the computing node to determine The proposed transmission strategy cannot be effectively applied to multiple iterations of the first neural network model. Therefore, in the overall training process of the first neural network model, the computing nodes can determine the next one based on the communication tail time of the previous iteration. The transmission strategy used in the iterative process, that is, after the computing node determines the communication tail time corresponding to the i-th transmission strategy, it can directly generate the i+1-th transmission strategy according to the communication tail time corresponding to the i-th transmission strategy without It is determined whether the communication tail duration corresponding to the i-th transmission strategy is greater than the first threshold, so that the transmission strategy can be adjusted in time according to changes in some factors, and the efficiency of distributed training is improved. The description will be mainly based on this example below.

In the embodiment of the present application, the computing node may generate a transmission strategy based on a variety of possible reinforcement learning methods. The following exemplarily describes two possible implementation methods.

(1) Implementation method one

The computing node can generate the i-th transmission strategy through the second neural network model, and update the parameters of the second neural network model according to the communication tailing duration corresponding to the i-th transmission strategy, and generate the second neural network model according to the updated second neural network model. i+1 transmission strategy. Wherein, the second neural network model may be a recurrent neural network (RNN) model, such as a long short-term memory network (LSTM). There are many ways for the computing node to update the parameters of the second neural network model. For example, it can use the proximal policy optimization (PPO) algorithm or the asynchronous advantage actor-critic (A3C) algorithm. Update. Using the above method, the communication tailing duration corresponding to the i-th transmission strategy is used as the reward of reinforcement learning to update the parameters of the second neural network model. Since the second neural network model (such as the RNN model) has a strong learning ability, it is passed Continuously performing this process can achieve the convergence of the communication tail time to the optimal value.

Exemplarily, the i-th transmission strategy may include the sub-transmission strategy of each layer of the first neural network model, the sub-transmission strategy of the n-th layer may include the first information and the second information, and the first information is used to indicate the calculated Whether to initiate transmission after the gradient of the n-layer parameters, the second information is used to indicate the logical topology used for transmission. In an example, the i-th transmission strategy may be a sequence that is repeated a certain number of times (the number depends on the number of layers of the first neural network model) in the form of {first information (whether to communicate), second information (logical topology)} , {The first information (whether to communicate), the second information (logical topology)} can be understood as the sub-transmission strategy of one layer of the first neural network model. For example, the first neural network model includes 3 layers, and the i-th transmission strategy is [{Yes, topo0}, {Yes, topo1}, {Yes, topo1}]. Thus, after the training node receives the i-th transmission strategy If the gradient of the third layer parameter of the first neural network model is calculated, use topo0 to send the gradient of the third layer parameter, if the gradient of the second layer parameter of the first neural network model is calculated, then use topo1 to send the second The gradient of the layer parameters, if the gradient of the first layer parameter of the first neural network model is calculated, then topo1 is used to send the gradient of the first layer parameter. In the embodiment of the present application, a logical topological space may be preset, and the logical topological space may include multiple logical topologies for selection, and the logical topology used for transmission indicated by the second information may be one of the logical topological spaces. Topology.

Specifically, the computing node can generate a transmission strategy in a self-excitation cycle through the second neural network model. The specific implementation process of generating the transmission strategy will be described below with reference to FIG. 5a. As shown in Figure 5a, the initial input of the second neural network model can be random content (for example, it can be an identifier of a random logical topology in the logical topology space), and then the first layer of the sub-transmission strategy can be generated according to the initial input. One information, the first information of the sub-transmission strategy of the first layer is used as the input of the second neural network model, and the second information of the sub-transmission strategy of the first layer can be generated; the second information of the sub-transmission strategy of the first layer As the input of the second neural network model, the first information of the sub-transmission strategy of the second layer can be generated, and so on.

It can be seen that the second neural network model can generate the sub-transmission strategy of one layer of the first neural network model every two time steps, and then the first transmission strategy can be generated after n*2 time steps; the computing node will generate the first transmission strategy. The transmission strategy is sent to W training nodes (corresponding to step 401). Accordingly, the training node receives the first transmission strategy, and uses the first transmission strategy to transmit the gradient of each layer parameter obtained in the first iteration of the first neural network model, And use the first transmission strategy to transmit the first communication tail duration of the gradient of each layer parameter obtained in the first iteration of the first neural network model to the computing node (corresponding to step 402); the computing node receives W training nodes The first communication tailing time sent, and the communication tailing time corresponding to the first transmission strategy is determined (corresponding to step 403 and step 404); the computing node performs the second neural network model according to the communication tailing time corresponding to the first transmission strategy Update, and based on the updated neural network model, generate the second transmission strategy after n*2 time steps (corresponding to step 405), and the computing node sends the second transmission strategy to W training nodes, thereby cyclically executing the above steps 401 to Step 405, until the training of the first neural network model is completed.

FIG. 5b is an overall schematic diagram of Implementation Mode 1 provided by an embodiment of the application. As shown in Figure 5b, the second neural network model and parameter update algorithm can be run by the agent executor, and the evaluator is responsible for adding a communication operator to the reverse calculation of each layer of the first neural network model. )) (For example, the evaluator 1 is responsible for adding a communication operator to the reverse calculation of each layer of the first neural network model of the training node 1), and then according to the transmission strategy generated by the agent executor, whether the communication operator is controlled Initiate the transmission and which logical topology to use. At the end of the i-th iteration of the first neural network model, the evaluator feeds back the communication tailing time corresponding to the i-th transmission strategy to the agent executor, and the agent executor regards the communication tailing time corresponding to the i-th transmission strategy as The "reward" obtained by interacting with the environment is calculated according to the policy gradient method (such as PPO), and then the parameters of the second neural network model are updated to generate a new transmission strategy (i.e., the i+1th transmission strategy) ), complete a round of reinforcement learning. In this way, after a period of repeated iterations, the agent executor can generate an approximately optimal transmission strategy for the specific physical networking mode and the first neural network model.

(2) Implementation mode two

The computing node can generate the i-th transmission strategy through the Q-table (Q-Table) used to record the state-action in the Q-learning algorithm; and, according to the communication tail time corresponding to the i-th transmission strategy, compare the Q-table Update, and generate the i+1th transmission strategy through the updated Q table. Among them, the Q table includes P states and Q actions. P states correspond to P data volume thresholds, and Q actions correspond to Q combinations composed of state transition amounts and logic topology; among them, P and Q are both greater than Or an integer equal to 1. Using the above method, by constructing the Q table in the Q-learning algorithm, the data volume threshold is used as the state dimension information in the Q table, and the state transition amount and logical topology are used as the action dimension information in the Q table. The i-th transmission strategy; further, the Q table in Q-learning is updated by using the communication tailing duration corresponding to the i-th transmission strategy as a reward for reinforcement learning. Since the Q-Learning algorithm can take actions according to the current state, obtain the corresponding After the reward, we can improve these actions, so as to be able to make better actions, that is, to get a better transmission strategy.

In an example, referring to FIG. 6a, the computing node may determine the minimum data amount threshold of the P transmission data amounts according to the preset transmission efficiency. For example, the preset transmission efficiency may be the lowest acceptable transmission efficiency. Accept the transmission efficiency, find the logical topology with the smallest transmission volume, and use the data volume corresponding to the logical topology as the minimum data volume threshold. The computing node can determine the maximum data volume threshold of the P transmission data volume according to the parameter volume of the first neural network model, for example, determine a certain ratio (such as 50% or 80%) of the parameter volume of the first neural network model Is the maximum data volume threshold. Further, the difference between any two data amount thresholds in the P data amount thresholds may be an integer multiple of the preset step size (which can be expressed as m), so that P=(Mmax-Mmin)/m+1 can be obtained.

See Table 1 for an example of the Q table.

Table 1: Example of Q

In Table 1, +m: means +m based on the current state, for example, the current state is Mmin, then +m means transition to the state Mmin+m. -m: means -m on the basis of the current state, for example, the current state is Mmax, then -m means transition to the state Mmax-m. 0: Means to keep the current state threshold M unchanged. Topok: Represents the logical topology used for transmission. Q(s1,a1) represents the reward for performing the action +m,Topo1 when the current state is Mmin; among them, the action of performing +m,Topo1 refers to the generation of transmission strategy a, the data volume threshold included in transmission strategy a It is Mmin+m, and the logical topology used for each transmission is Topo1.

Exemplarily, the i-th transmission strategy includes third information and fourth information; wherein the third information is used to indicate the i-th data volume threshold of the gradient of each layer parameter obtained in the i-th iteration of the first neural network model. , The i-th data amount threshold is used to determine the transmission timing of the gradient of each layer parameter obtained in the i-th iteration of the first neural network model; the fourth information is used to indicate the logical topology used for each transmission. For example, after the training node calculates the gradient of the n-th layer parameter of the first neural network model, if it is determined that the data volume of the accumulated gradient (that is, the data volume of the gradient to be transmitted) is greater than or equal to the i-th data volume threshold, it initiates And use the logical topology indicated by the fourth information for transmission.

Specifically, the computing node can generate an initialized Q table before generating the first transmission strategy, and the values of Q(s1,a1), Q(s1,a2)... in the initialized Q table can be Gaussian distribution A set of random numbers obtained. The computing node uses the state corresponding to Mmin as the current state according to the initialized Q table, and determines the first target action (for example, +m, topo2) according to the reward value of Q actions executed in the current state, and then generates the first transmission strategy and Sent to W training nodes (corresponding to step 401); where the first data amount threshold is the sum of the state transition amount in the combination corresponding to the first target action and the data amount threshold corresponding to the current state (ie Mmin+m) , The logical topology used in each transmission is the logical topology in the combination corresponding to the first target action (ie topo2). Correspondingly, the training node receives the first transmission strategy, uses the first transmission strategy to transmit the gradients of the parameters of each layer obtained in the first iteration of the first neural network model, and uses the first transmission strategy to transmit the first neural network model The first communication tail duration of the gradient of each layer parameter obtained in the first iteration is sent to the computing node (corresponding to step 402); the computing node receives the first communication tail duration sent by W training nodes, and determines the first transmission The communication tailing time corresponding to the strategy (corresponding to steps 403 and 404); the computing node updates the initialized Q table according to the communication tailing time corresponding to the first transmission strategy, and generates the second transmission strategy based on the updated Q table ( Corresponding to step 405), the computing node sends the second transmission strategy to the W training nodes, so as to execute the above steps 401 to 405 in a loop until the training of the first neural network model is completed.

It should be noted that there may be multiple ways for the computing node to determine the i-th target action based on the reward values of the Q actions. For example, the computing node can determine the action with the largest reward value among the Q actions as the i-th target action, or, The computing node may also determine the action with the second highest reward value among the Q actions as the i-th target action, which is not specifically limited. There may be multiple implementation ways for the computing node to update the Q table according to the communication tailing duration corresponding to the i-th transmission strategy. For example, the computing node can use the Bellman equation to update the Q table, which is not specifically limited.

FIG. 6b is an overall schematic diagram of implementation manner 2 provided by an embodiment of the application. As shown in Figure 6b, a Q-Learning algorithm and functional components can be run by the agent executor, and the data volume threshold can be determined and modified by interacting with evaluators in multiple training nodes, and the Q table can be updated; evaluator; Responsible for controlling the transmission of gradients, and feeding back the communication tail duration as a reward to the agent executor. For example, evaluator 1 is responsible for controlling the transmission of the gradient calculated by training node 1. When the amount of gradient data accumulated by training node 1 exceeds the amount of data When the threshold is reached, the transmission is initiated, and the communication tail duration of the transmission gradient of the training node 1 is fed back to the actuator as a reward.

The foregoing introduces the solutions provided in the embodiments of the present application mainly based on the perspective of the execution process. It can be understood that, in order to realize the above-mentioned functions, the computing node and the training node may include hardware structures and/or software modules corresponding to each function. Those skilled in the art should easily realize that in combination with the units and algorithm steps of the examples described in the embodiments disclosed herein, the embodiments of the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a certain function is executed by hardware or computer software-driven hardware depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.

In the case of an integrated unit (module), FIG. 7 shows a possible exemplary block diagram of a device for determining a transmission strategy involved in an embodiment of the present application, and the device 700 may exist in the form of software. The apparatus 700 may include: a generating unit 702 and a determining unit 703. The generating unit 702 and the determining unit 703 may be collectively referred to as processing units, which are used to control and manage the actions of the apparatus 700. The apparatus 700 may further include a communication unit 704, which is configured to support communication between the apparatus 700 and other nodes. Optionally, the communication unit 704 may also be referred to as a transceiver unit, and may include a receiving unit and/or a sending unit, which are used to perform receiving and sending operations, respectively. The device 700 may further include a storage unit 701 for storing program codes and/or data of the device 700.

The generating unit 702 and the determining unit 703 may be a processor or a controller, which may implement or execute various exemplary logical blocks, modules, and circuits described in conjunction with the disclosure of the embodiments of the present application. The communication unit 704 may be a communication interface, a transceiver, or a transceiver circuit, etc., where the communication interface is a general term. In a specific implementation, the communication interface may include multiple interfaces. The storage unit 701 may be a memory.

The apparatus 700 may be the computing node in any of the foregoing embodiments. The generating unit 702 and the determining unit 703 can support the apparatus 700 to execute the actions of the computing nodes in the above method examples. Alternatively, the generating unit 702 and the determining unit 703 mainly perform the internal actions of the computing node in the method example, and the communication unit 704 may support communication between the apparatus 700 and the training node. For example, the generating unit 702 is used to perform the action of generating the i+1th transmission strategy in step 401 and step 405 in FIG. 4, the determining unit 703 is used to perform step 404 in FIG. 4; the communication unit 704 is used to perform The action of sending the i+1th transmission strategy in step 403 and step 405.

Specifically, in one embodiment, the generating unit 702 is configured to generate an i-th transmission strategy, and the i-th transmission strategy is used to transmit the gradient of each layer parameter obtained in the i-th iteration of the first neural network model;

The determining unit 703 is configured to determine the communication tailing duration corresponding to the i-th transmission strategy, and the communication tailing duration corresponding to the i-th transmission strategy is used to indicate the end time and the first iteration of the i-th iteration of the first neural network model. The length of time between the start of i+1 iterations;

The generating unit 702 is further configured to generate an i+1th transmission strategy according to the communication tail duration corresponding to the i-th transmission strategy, and the i+1th transmission strategy is used to transmit the i+1th time of the first neural network model The gradient of each layer parameter obtained by iteration; where i=1, 2, ..., X-1, X is the number of iterations of the first neural network model, and X is an integer greater than 1.

In a possible design, the generating unit 702 is specifically configured to generate the i-th transmission strategy through a second neural network model.

In a possible design, the generating unit 702 is specifically configured to: update the parameters of the second neural network model according to the communication tailing duration corresponding to the i-th transmission strategy, and pass the updated second The neural network model generates the i+1th transmission strategy.

In a possible design, the i-th transmission strategy includes the sub-transmission strategy of each layer of the first neural network model, and the sub-transmission strategy of the n-th layer includes first information and second information. One piece of information is used to indicate whether to initiate transmission after the gradient of the n-th layer parameter is calculated, and the second piece of information is used to indicate the logical topology used for transmission;

Wherein, n=1, 2, ..., N, N is the number of layers of the first neural network model, and N is an integer greater than or equal to 1.

In a possible design, the generating unit 702 generates the sub-transmission strategy of the n+1th layer through the second neural network model, specifically:

The second information of the sub-transmission strategy of the nth layer is used as the input of the second neural network model, the first information of the sub-transmission strategy of the n+1th layer is generated, and the n+1th layer is The first information of the sub-transmission strategy of the layer is used as the input of the second neural network model to generate the second information of the sub-transmission strategy of the n+1th layer.

In a possible design, the generating unit 702 is specifically configured to:

The i-th transmission strategy is generated through the Q table used to record state-action in the Q-learning algorithm; the Q table includes P states and Q actions, and the P states correspond to P data volume thresholds, The Q actions respectively correspond to Q combinations formed by the amount of state transition and logic topology; among them, P and Q are integers greater than or equal to 1.

In a possible design, the generating unit 702 is specifically configured to:

The Q table is updated according to the communication tail duration corresponding to the i-th transmission strategy, and the i+1-th transmission strategy is generated through the updated Q table.

In a possible design, the i-th transmission strategy includes third information and fourth information; wherein, the third information is used to indicate transmission of the parameters of each layer obtained in the i-th iteration of the first neural network model The i-th data volume threshold of the gradient of the i-th data volume threshold is used to determine the transmission timing of the gradient of each layer parameter obtained by the i-th iteration of the first neural network model; the fourth information is used To indicate the logical topology used for each transmission.

In a possible design, the generating unit 702 is specifically configured to:

According to the Q table, the reward value for executing the Q actions under the state corresponding to the i-1th data volume threshold is obtained, and the ith target action is determined according to the reward value of the Q actions, and the i transmission strategy; the i-th data volume threshold is the sum of the state transition volume in the combination corresponding to the i-th target action and the i-1th data volume threshold, and the logical topology used for each transmission is The logical topology in the combination corresponding to the i-th target action.

In a possible design, the maximum data amount threshold of the P data amount thresholds is determined according to the parameter amount of the first neural network model, and/or, the smallest of the P data amount thresholds The data volume threshold is determined according to the preset transmission efficiency.

It should be noted that the division of units (modules) in the embodiments of the present application is illustrative, and is only a logical function division, and there may be other division methods in actual implementation. The functional modules in the embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware or software functional modules.

If the integrated module is implemented in the form of a software function module and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solutions of the embodiments of the present application essentially or the part that contributes to the prior art or all or part of the technical solutions can be embodied in the form of software products, and the computer software products are stored in a storage The medium includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium may be various mediums capable of storing program codes, such as a memory.

Referring to FIG. 8, a schematic diagram of an apparatus for determining a transmission strategy provided by an embodiment of this application. The apparatus may be the above-mentioned computer device for executing actions performed by a computing node or a semiconductor chip provided in the computer device. The device 800 includes a memory 801, a processor 802, and a communication interface 803. Among them, the processor 802 has a function of implementing the actions performed by the generating unit 702 and the determining unit 703 in FIG. 7. Optionally, the apparatus 800 may further include a bus 804. Among them, the communication interface 803, the processor 802, and the memory 801 may be connected to each other through a communication line 804; the communication line 804 may be a peripheral component interconnection standard (peripheral component interconnect, PCI for short) bus or an extended industry standard architecture (extended industry standard architecture) , Referred to as EISA) bus and so on. The communication line 804 can be divided into an address bus, a data bus, a control bus, and so on. For ease of presentation, only one thick line is used in FIG. 8 to represent, but it does not mean that there is only one bus or one type of bus.

The processor 802 may be one or more CPUs (or GPUs), or one or more integrated circuits for controlling the execution of programs in the solutions of the present application. The communication interface 803 uses any device such as a transceiver to communicate with the training node. The memory 801 may be read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), or other types that can store information and instructions The dynamic storage device can also be electrically erasable programmable read-only memory (electrically programmable read-only memory, EEPROM), compact disc read-only memory (CD-ROM) or other optical disk storage, Optical disc storage (including compact disc, laser disc, optical disc, digital versatile disc, Blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or can be used to carry or store desired program codes in the form of instructions or data structures and can Any other medium accessed by the computer, but not limited to this. The memory can exist independently and is connected to the processor through a communication line 804. The memory can also be integrated with the processor. The memory 801 is used to store computer-executed instructions for executing the solutions of the present application, and the processor 802 controls the execution. The processor 802 is configured to execute computer-executable instructions stored in the memory 801, so as to implement the method provided in the foregoing embodiment of the present application.

Optionally, the computer-executable instructions in the embodiments of the present application may also be referred to as application program code, which is not specifically limited in the embodiments of the present application.

It should be noted that the above-mentioned method and device are based on the same inventive concept. Since the method and the device have similar principles for solving the problem, the implementation of the device and the method can be referred to each other, and the repetition will not be repeated.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented by software, it can be implemented in the form of a computer program product in whole or in part. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions described in the embodiments of the present invention are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, server, or data center. Transmission to another website site, computer, server or data center via wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center integrated with one or more available media. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a solid state disk (SSD)). This application is described with reference to flowcharts and/or block diagrams of methods, equipment (systems), and computer program products according to this application. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be implemented by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are generated It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device. The device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

Obviously, those skilled in the art can make various changes and modifications to the application without departing from the scope of the application. In this way, if these modifications and variations of this application fall within the scope of the claims of this application and their equivalent technologies, this application also intends to include these modifications and variations.

Claims

A method for determining a transmission strategy, characterized in that the method includes:

Generating an i-th transmission strategy, where the i-th transmission strategy is used to transmit the gradient of each layer parameter obtained in the i-th iteration of the first neural network model;

The communication tailing duration corresponding to the i-th transmission strategy is determined, and the communication tailing duration corresponding to the i-th transmission strategy is used to indicate the end time of the i-th iteration and the i+1-th iteration of the first neural network model The length of time between the start moments;

The i+1th transmission strategy is generated according to the communication tailing duration corresponding to the i-th transmission strategy, and the i+1th transmission strategy is used to transmit the gradient of each layer parameter obtained in the i+1th iteration of the first neural network model ；

Wherein, i=1, 2, ..., X-1, X is the number of iterations of the first neural network model, and X is an integer greater than 1.
The method according to claim 1, wherein generating the i-th transmission strategy comprises:

The i-th transmission strategy is generated through the second neural network model.
The method according to claim 2, wherein generating the (i+1)th transmission strategy according to the communication tail duration corresponding to the i-th transmission strategy comprises:

The parameters of the second neural network model are updated according to the communication tailing duration corresponding to the i-th transmission strategy, and the (i+1)th transmission strategy is generated through the updated second neural network model.
The method according to claim 2 or 3, wherein the i-th transmission strategy includes the sub-transmission strategy of each layer of the first neural network model, and the sub-transmission strategy of the n-th layer includes the first information and Second information, the first information is used to indicate whether to initiate transmission after the gradient of the nth layer parameter is calculated, and the second information is used to indicate the logical topology used for transmission;

Wherein, n=1, 2, ..., N, N is the number of layers of the first neural network model, and N is an integer greater than or equal to 1.
The method according to claim 4, wherein the sub-transmission strategy of the n+1th layer is generated through the second neural network model, specifically:

The second information of the sub-transmission strategy of the nth layer is used as the input of the second neural network model, the first information of the sub-transmission strategy of the n+1th layer is generated, and the n+1th layer is The first information of the sub-transmission strategy of the layer is used as the input of the second neural network model to generate the second information of the sub-transmission strategy of the n+1th layer.
The method according to claim 1, wherein generating the i-th transmission strategy comprises:

The i-th transmission strategy is generated through the Q table used to record state-action in the Q-learning algorithm; the Q table includes P states and Q actions, and the P states correspond to P data volume thresholds, The Q actions respectively correspond to Q combinations formed by the amount of state transition and logic topology; among them, P and Q are integers greater than or equal to 1.
The method according to claim 6, wherein generating the (i+1)th transmission strategy comprises:

The Q table is updated according to the communication tail duration corresponding to the i-th transmission strategy, and the i+1-th transmission strategy is generated through the updated Q table.
The method according to claim 6 or 7, wherein the i-th transmission strategy includes third information and fourth information; wherein the third information is used to indicate the i-th transmission of the first neural network model The i-th data amount threshold of the gradient of each layer parameter obtained by iteration, the i-th data amount threshold is used to determine the transmission timing of the gradient of each layer parameter obtained in the i-th iteration of the first neural network model ; The fourth information is used to indicate the logical topology used for each transmission.
The method according to claim 8, wherein the generating the i-th transmission strategy through the Q table comprises:

According to the Q table, the reward value for executing the Q actions under the state corresponding to the i-1th data volume threshold is obtained, and the ith target action is determined according to the reward value of the Q actions, and the i transmission strategy; the i-th data volume threshold is the sum of the state transition volume in the combination corresponding to the i-th target action and the i-1th data volume threshold, and the logical topology used for each transmission is The logical topology in the combination corresponding to the i-th target action.
The method according to claim 6, wherein the maximum data amount threshold of the P data amount thresholds is determined according to the parameter amount of the first neural network model, and/or, the P transmissions The minimum data amount threshold in the data amount is determined according to the preset transmission efficiency.
A device for determining a transmission strategy, characterized in that the device comprises:

A generating unit, configured to generate an i-th transmission strategy, where the i-th transmission strategy is used to transmit the gradient of each layer parameter obtained in the i-th iteration of the first neural network model;

The determining unit is configured to determine the communication tailing duration corresponding to the i-th transmission strategy, and the communication tailing duration corresponding to the i-th transmission strategy is used to indicate the end time and the i-th iteration of the first neural network model. The length of time between the start of +1 iteration;

The generating unit is further configured to generate an i+1th transmission strategy according to the communication tailing duration corresponding to the i-th transmission strategy, and the i+1th transmission strategy is used to transmit the i+1th iteration of the first neural network model The obtained gradient of each layer parameter;

Wherein, i=1, 2, ..., X-1, X is the number of iterations of the first neural network model, and X is an integer greater than 1.
The device according to claim 11, wherein the generating unit is specifically configured to generate the i-th transmission strategy through a second neural network model.
The device according to claim 12, wherein the generating unit is specifically configured to: update the parameters of the second neural network model according to the communication tailing duration corresponding to the i-th transmission strategy, and update The latter second neural network model generates the i+1th transmission strategy.
The device according to claim 12 or 13, wherein the i-th transmission strategy includes a sub-transmission strategy of each layer of the first neural network model, and the n-th layer sub-transmission strategy includes first information and Second information, the first information is used to indicate whether to initiate transmission after the gradient of the nth layer parameter is calculated, and the second information is used to indicate the logical topology used for transmission;

Wherein, n=1, 2, ..., N, N is the number of layers of the first neural network model, and N is an integer greater than or equal to 1.
The device according to claim 14, wherein the generating unit generates the sub-transmission strategy of the n+1th layer through the second neural network model, specifically:

The second information of the sub-transmission strategy of the nth layer is used as the input of the second neural network model, the first information of the sub-transmission strategy of the n+1th layer is generated, and the n+1th layer is The first information of the sub-transmission strategy of the layer is used as the input of the second neural network model to generate the second information of the sub-transmission strategy of the n+1th layer.
The device according to claim 11, wherein the generating unit is specifically configured to:

The i-th transmission strategy is generated through the Q table used to record state-action in the Q-learning algorithm; the Q table includes P states and Q actions, and the P states correspond to P data volume thresholds, The Q actions respectively correspond to Q combinations formed by the amount of state transition and logic topology; among them, P and Q are integers greater than or equal to 1.
The device according to claim 16, wherein the generating unit is specifically configured to:

The Q table is updated according to the communication tail duration corresponding to the i-th transmission strategy, and the i+1-th transmission strategy is generated through the updated Q table.
The device according to claim 16 or 17, wherein the i-th transmission strategy includes third information and fourth information; wherein the third information is used to indicate the i-th transmission of the first neural network model The i-th data amount threshold of the gradient of each layer parameter obtained by iteration, the i-th data amount threshold is used to determine the transmission timing of the gradient of each layer parameter obtained in the i-th iteration of the first neural network model ; The fourth information is used to indicate the logical topology used for each transmission.
The device according to claim 18, wherein the generating unit is specifically configured to:

According to the Q table, the reward value for executing the Q actions under the state corresponding to the i-1th data volume threshold is obtained, and the ith target action is determined according to the reward value of the Q actions, and the i transmission strategy; the i-th data volume threshold is the sum of the state transition volume in the combination corresponding to the i-th target action and the i-1th data volume threshold, and the logical topology used for each transmission is The logical topology in the combination corresponding to the i-th target action.
The device according to claim 16, wherein the maximum data amount threshold in the P data amount thresholds is determined according to the parameter amount of the first neural network model, and/or, the P transmissions The minimum data amount threshold in the data amount is determined according to the preset transmission efficiency.
A device for determining a transmission strategy, characterized in that the device includes a processor, a memory, and instructions stored in the memory and executable on the processor. When the instructions are executed, the device executes The method of any one of 1 to 10 is required.
A computer-readable storage medium, characterized by comprising instructions, which when run on a computer, causes the computer to execute the method according to any one of claims 1 to 10.
A computer program product, which is characterized in that when it runs on a computer, the computer executes the method according to any one of claims 1 to 10.