WO2023222113A1 - 稀疏参数的更新方法、训练节点、设备和存储介质 - Google Patents

稀疏参数的更新方法、训练节点、设备和存储介质 Download PDF

Info

Publication number
WO2023222113A1
WO2023222113A1 PCT/CN2023/095266 CN2023095266W WO2023222113A1 WO 2023222113 A1 WO2023222113 A1 WO 2023222113A1 CN 2023095266 W CN2023095266 W CN 2023095266W WO 2023222113 A1 WO2023222113 A1 WO 2023222113A1
Authority
WO
WIPO (PCT)
Prior art keywords
training
node
parameter
card
gradient
Prior art date
Application number
PCT/CN2023/095266
Other languages
English (en)
French (fr)
Inventor
王国威
苏磊
刘静
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023222113A1 publication Critical patent/WO2023222113A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application relates to the field of deep learning technology, and in particular to a sparse parameter update method, training nodes, equipment and storage media.
  • training data is processed into discrete features through one-hot encoding or multi-hot encoding, and embedding parameters need to be used to convert the input into continuous vectors for further processing, in a single During the training process, only part of the embedding parameters participate in the calculation and are updated. Such parameters are called sparse parameters.
  • the total amount of sparse parameters reaches the level of 10TB to 30TB, and the memory capacity of the training card (the training card is also called the training accelerator card) is not enough to store all the sparse parameters, so the sparse parameters are generally stored in the memory of the server or the solid state drive connected to the server. (solid state disk, SSD).
  • sparse parameters are distributedly stored in the memory of multiple servers or connected SSDs, and training cards are deployed on the servers.
  • the process of updating sparse parameters at a time is: each training node obtains the sparse parameters corresponding to the training data and the training data from the memory or SSD. Based on the obtained training data, each training node trains to obtain the corresponding sparse parameters of the training data. The gradient of the sparse parameters.
  • Each training node transmits the gradient of the sparse parameters obtained by training to all other training nodes through the network.
  • Each training node performs gradient aggregation on the gradient of the sparse parameters stored by itself, and the result is based on the gradient aggregation. Update the sparse parameters stored in itself.
  • the training node Since after the training node calculates the gradient of the sparse parameters, the gradients calculated by all training nodes need to be exchanged, so it takes up a lot of network transmission resources.
  • This application provides a sparse parameter update method, training nodes, equipment and storage media, which can save transmission resources and improve sparse parameter update efficiency.
  • this application provides a method for updating sparse parameters, which method is applied to an artificial intelligence model training system.
  • the system includes a first parameter node, a first training node and a second training node.
  • the first training node includes a first training card
  • the second training node includes a second training card
  • the method includes: the first training node obtains a first parameter set from the first parameter node, the first parameter set includes a plurality of parameters; the first The training node uses the multiple parameters to train the data to be trained to obtain a first gradient set.
  • the first gradient set includes the gradients distributed to the second training card among the multiple gradients corresponding to the multiple parameters of the first parameter set.
  • the first training node sends the first gradient set and the parameters corresponding to the gradients in the first gradient set to the second training card; the second training card sets the first gradient set according to the gradient in the first gradient set The parameters corresponding to the gradient are updated; the second training card sends the updated parameters to the first parameter node.
  • training cards are used to update sparse parameters, so that the update efficiency of sparse parameters is improved. Higher, which can improve the training efficiency of the model.
  • the training node divides the sparse parameters it obtains and then transmits them to other training cards. Each sparse parameter and the corresponding gradient will only be transmitted to one training card. There is no need to repeatedly transmit data, and it can also save network transmission resources.
  • the method further includes: the first training node respectively obtains a second parameter set from N second parameter nodes, each second parameter set includes multiple parameters; the first training node uses the multiple The parameters are trained on the data to be trained to obtain the first gradient set, which includes: multiple training cards in the first training node use the parameters of the first parameter set and the obtained N second parameter sets to calculate the data to be trained.
  • the plurality of training cards include the first training card; the first training card aggregates the gradient data obtained by the plurality of training cards after completing the training; the first training card aggregates the gradient data according to parameters
  • the nodes are divided to obtain the first gradient set corresponding to the first parameter node, or the first training card divides the aggregated gradient data according to training nodes to obtain the first gradient corresponding to the second training node.
  • the first gradient set corresponding to the second training node also includes the gradient distributed to the second training card among the multiple gradients corresponding to the multiple parameters of the N second parameter sets.
  • the first training card in the first training node obtains all parameters of the first training node and the corresponding gradients, and then the first training card divides the gradients according to the parameter nodes or training nodes and distributes them to The corresponding training card causes the training card to update the parameters.
  • the training card is used to calculate the gradient of sparse parameters and update the sparse parameters, making the update of sparse parameters more efficient.
  • the method further includes: the first training node respectively obtains a second parameter set from N second parameter nodes, each second parameter set includes multiple parameters; the first training node uses the multiple After the parameters are trained on the data to be trained, a first gradient set is obtained, including: multiple training cards in the first training node use the parameters of the first parameter set and the obtained N second parameter sets to calculate the parameters of the first gradient set to be trained.
  • the data is trained, and the plurality of training cards include the first training card; the first training card divides the gradient data of the first training card after completing the training according to each training card in the system to obtain the second training card.
  • the first gradient set corresponding to the card, the first gradient set also includes the parameters of the first training card corresponding to the N second parameter sets in the gradient data after completing the training, and the gradients distributed to the second training card .
  • each training card of the first training node calculates the gradient in parallel and divides the gradient in parallel, which can update parameters in parallel and improve training efficiency.
  • the method further includes: the first training node respectively obtains a second parameter set from N second parameter nodes, each second parameter set includes multiple parameters; the first training node uses the multiple After the parameters are trained on the data to be trained, a first gradient set is obtained, including: multiple training cards in the first training node use the parameters of the first parameter set and the obtained N second parameter sets to calculate the parameters of the first gradient set to be trained.
  • the data is trained, and the plurality of training cards include the first training card; the first training card divides the gradient data of the first training card after completing the training according to parameter nodes to obtain the corresponding to the second training card. The first gradient set.
  • Each training card of the first training node calculates the gradient in parallel, and divides the gradient according to the parameter node in parallel, so that the parameters distributed to the same training card belong to The same parameter node enables the training card to update the updated parameters to the corresponding parameter node at the same time.
  • the second training card updates parameters corresponding to the gradients in the first gradient set according to the gradients in the first gradient set, including: the second training card receives a pair of data from multiple training nodes in the system.
  • the gradients in multiple gradient sets obtained are aggregated, and the multiple gradient sets include the first gradient set; and the parameters corresponding to the gradients in the first gradient set are updated using the aggregated gradients.
  • the first training card divides the aggregated gradient data according to parameter nodes to obtain the first gradient set corresponding to the first parameter node, including: parameters corresponding to each aggregated gradient data.
  • the characteristics determine the index of value, an index value is used to indicate a parameter node; the gradient corresponding to the parameter with the first index value is classified into the first gradient set, and the first index value is used to indicate the first parameter node.
  • the present application provides a first training node, which has the function of realizing the function performed by the first training node in the first aspect.
  • the application provides a computer device.
  • the computer device includes a processor and a memory; the memory stores computer instructions; the processor is used to execute the computer instructions, so that the computer device executes the above first aspect or the first aspect.
  • Either optional method provides an updated partial method for sparse parameters.
  • the present application provides a computer-readable storage medium, which stores at least one computer instruction.
  • the computer instruction is read by a processor to cause the computer device to execute the above-mentioned first aspect or any one of the first aspects.
  • This optional method provides some methods for training nodes to update sparse parameters.
  • the present application provides a computer program product.
  • the computer program product includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the training node update sparse provided by the above-mentioned first aspect or any optional method of the first aspect. Some methods of parameters.
  • Figure 1 is a schematic diagram of the system architecture provided by an exemplary embodiment of the present application.
  • Figure 2 is a schematic diagram of the system architecture provided by an exemplary embodiment of the present application.
  • Figure 3 is a schematic diagram of the hardware structure of the device provided by an exemplary embodiment of the present application.
  • Figure 4 is a schematic flowchart of a method for updating sparse parameters provided by an exemplary embodiment of the present application
  • Figure 5 is a schematic flowchart of a method for updating sparse parameters provided by an exemplary embodiment of the present application
  • Figure 6 is a schematic diagram of an example update of sparse parameters provided by an exemplary embodiment of the present application.
  • Figure 7 is a schematic flowchart of a method for updating sparse parameters provided by an exemplary embodiment of the present application.
  • Figure 8 is a schematic diagram of an example update of sparse parameters provided by an exemplary embodiment of the present application.
  • ANN Artificial neural network
  • ANN Artificial neural network
  • NN neural network
  • neural network-like models are a structure and model that imitates biological neural networks (such as the central nervous system of animals, which can be the brain, etc.) in the field of machine learning and cognitive science.
  • Artificial neural networks perform calculations by connecting a large number of artificial neurons.
  • Artificial neural networks can also be referred to as neural network (NN) models or neural network-like models.
  • the embedding layer is a layer in the neural network model that converts the features of the input layer into a vector with fixed dimensions.
  • Parameters refer to the parameters in the neural network model, specifically the weight matrix W and bias vector b of each neural network unit.
  • the process of neural network model training is a process of continuously adjusting parameter values to optimize the performance of the neural network model.
  • Sparse parameters are a type of parameters in neural network models.
  • the characteristic of sparse parameters is that in each round of training, only some of the sparse parameters will be activated. "Activation" refers to participating in forward calculation and reverse update.
  • the input is processed into discrete features through one-hot encoding or multi-hot encoding. It is necessary to use embedding parameters in the embedding layer to convert the input into a continuous vector for further processing. During a single training process, only part of the embedding parameters are calculated and updated. Such parameters are called sparse parameters.
  • the data volume of sparse parameters is relatively large, and the memory capacity of the training card is not enough to store all sparse parameters, so sparse parameters are generally stored in the memory of the server or the SSD connected to the server.
  • the following characteristics are kept: 1.
  • the amount of data is huge.
  • the sparse parameters in the recommended model can reach 10 12 to 10 14 , and the storage space consumption is 10TB to 30TB, which cannot be used in training.
  • Complete training in the card 2.
  • Parameter sparseness that is, only a small part of all sparse parameters are used for each training. For example, when the mini-batch is 10000 and the sample features are 1000, the sparse parameters used are 10 7 ; 3.
  • There is a large amount of calculation and data transmission during the training process which makes computing resources and network resources easy to form bottlenecks.
  • this application reduces the amount of data transmission by optimizing the sparse parameter update process, thereby reducing the single-round training time and ultimately optimizing the overall training time of the model.
  • System architecture 100 is an illustration of a system architecture for updating sparse parameters.
  • System architecture 100 includes parameter node 101 and training node 102. Among them, the parameter node 101 is used to store parameters, etc., and the training node 102 is used to update parameters.
  • the training node 102 may be called a worker node.
  • the parameter node 101 and the training node 102 are respectively deployed on different physical nodes, and the parameter node 101 and the training node 102 are connected through a wireless network or a wired network.
  • the embodiment of the present application does not limit the number and type of parameter nodes 101 and training nodes 102.
  • the system architecture 100 includes N parameter nodes 101 and M training nodes 102. N and M are both integers. N and M may be the same or different.
  • Each training node 102 is deployed with at least one Training card, that is, each training node 102 includes at least one training card.
  • the training card can be a neural network processor (neural network processing unit, NPU) or a graphics processor (graphics processing unit, GPU), deployed on different training nodes 102
  • the number of training cards can be the same or different, etc.
  • the parameter node 101 and the training node 102 are respectively deployed on different physical nodes, the parameter node 101 may also be called a parameter server.
  • a parameter node 101 and a training node 102 are deployed on the same physical node.
  • the parameter node 101 and the training node 102 are connected through a bus inside the physical node.
  • part of the parameter node 101 and part of the training node 102 are deployed on the same physical node, and one parameter node 101 and one training node 102 are deployed on one physical node.
  • the part of the parameter node 101 and the part of the training node 102 are respectively Deployed on different physical nodes.
  • any two training nodes 102 are connected through a wireless network or a wired network.
  • the training cards of different training nodes 102 can optionally be connected through the inter-card transmission network. That is to say, the training cards of different training nodes 102 You can communicate directly between practicing cards.
  • the device 300 shown in FIG. 3 is an illustration of the hardware structure of the device in the above-mentioned system architecture 100.
  • the device 300 is configured as a parameter node 101, a training node 102 or a physical node.
  • the device 300 is, for example, a host or a server.
  • Device 300 is optionally implemented with a generic bus architecture.
  • Device 300 includes at least one processor 301, communication bus 302, memory 303, and at least one network interface 304.
  • the processor 301 is, for example, a general central processing unit (CPU), a network processor (NP), a GPU, an NPU, a data processing unit (DPU), a microprocessor, or one or more Integrated circuit used to implement the solution of this application.
  • the processor 301 includes an application-specific integrated circuit (ASIC), a programmable logic device (PLD), or a combination thereof.
  • PLD is, for example, a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a general array logic (GAL), or any combination thereof.
  • Communication bus 302 is used to transfer information between the above-mentioned components.
  • the communication bus 302 can be divided into an address bus, a data bus, a control bus, etc.
  • address bus a data bus
  • control bus a control bus
  • Only one thick line is used in Figure 3, but this does not mean that there is only one bus or one type of bus.
  • the memory 303 is, for example, a read-only memory (ROM) or other type of static storage device that can store static information and instructions, or a random access memory (random access memory, RAM) or a device that can store information and instructions.
  • ROM read-only memory
  • RAM random access memory
  • Other types of dynamic storage devices such as electrically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM) or other optical disk storage, optical discs Storage (including compressed optical discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or can be used to carry or store desired program code in the form of instructions or data structures and can Any other media accessed by a computer, without limitation.
  • the memory 303 exists independently, for example, and is connected to the processor 301 through the communication bus 302 .
  • the memory 303 may also be integrated with the processor 301.
  • the memory 303 is used to save data obtained by the device 300 during the update process of sparse parameters, etc.
  • Network interface 304 uses any transceiver-like device for communicating with other devices or communications networks.
  • the network interface 304 includes a wired network interface and may also include a wireless network interface.
  • the wired network interface may be an Ethernet interface, for example.
  • the Ethernet interface can be an optical interface, an electrical interface, or a combination thereof.
  • the wireless network interface may be a wireless local area networks (WLAN) interface, a cellular network interface or a combination thereof, etc.
  • WLAN wireless local area networks
  • the processor 301 may include one or more CPUs.
  • the device 300 may include multiple processors.
  • processors can be a single-core processor (single-CPU) or a multi-core processor (multi-CPU).
  • a processor here may refer to one or more devices, circuits, and/or processing cores for processing data (such as computer program instructions).
  • the memory 303 is used to store program codes for executing the solution of the present application, and the processor 301 can execute the program codes stored in the memory 303. That is, the device 300 can implement the sparse parameter updating method provided by the method embodiment through the processor 301 and the program code in the memory 303 .
  • Sparse parameters are distributedly stored in the memory of multiple parameter nodes 101 or connected SSDs, and each sparse parameter pair There should be features, and the features of each sparse parameter can also be called an index.
  • the feature based on the sparse parameter can be located to the parameter node 101 that stores the sparse parameter and the sparse parameter.
  • the system includes a first parameter node, a first training node, and a second training node.
  • the system includes one parameter node. and two training nodes.
  • the system can include multiple parameter nodes and multiple training nodes.
  • the first parameter node is used to store sparse parameters
  • the first training node and the second training node are used to update sparse parameters.
  • the first training node includes a first training card
  • the second training node includes a second training card.
  • the first training node obtains sparse parameters from each parameter node.
  • the first training node uses the obtained sparse parameters to train on the data to be trained, and obtains the gradient corresponding to the obtained sparse parameters.
  • the first training node determines the gradient distributed to each training card in the system, and the first training node sends the gradient and the corresponding sparse parameter to the corresponding training card.
  • the training card uses the received gradient to update the received sparse parameters, and then the training card sends the updated sparse parameters to the corresponding parameter node according to the characteristics of the sparse parameters.
  • the first training node obtains a first parameter set from a first parameter node.
  • the first parameter set includes a plurality of parameters, and the plurality of parameters are all sparse parameters.
  • the first training node uses the plurality of parameters to train on the data to be trained, and obtains a first gradient set.
  • the first gradient set includes the gradients distributed to the second training card among the plurality of gradients corresponding to the plurality of parameters.
  • the first training node sends the first gradient set and parameters corresponding to the gradients in the first gradient set to the second training card.
  • the second training card uses the gradients in the first gradient set to update parameters corresponding to the gradients in the first gradient set to obtain updated parameters.
  • the first training card sends the updated parameters to the first parameter node.
  • the first parameter node replaces the parameters before the update with the updated parameters.
  • the training card is used to update the sparse parameters, so that the update efficiency of the sparse parameters is higher.
  • the training node divides the sparse parameters it obtains and then transmits them to other training cards. Each sparse parameter and the corresponding gradient will only be transmitted to one training card. There is no need to repeatedly transmit data, and it can also save network transmission resources.
  • Figure 4 takes the first training node and the second training node as an example for illustration.
  • the update process of sparse parameters of other training nodes can be seen in Figure 4 and will not be described again here.
  • only one update process of sparse parameters is shown. Please refer to the process shown in Figure 4 for each update process of sparse parameters.
  • the first training card can be the main training card in the first training node, or any training card in the first training node
  • the second training card can be the main training card in the second training node.
  • the main training card can also be any training card in the second training node.
  • the main training card is the training card used to update sparse parameters in the corresponding training node. The updating process of sparse parameters in these two cases is described below.
  • the first training node includes the first training card and the third training card.
  • the first training card and the third training card are both training cards participating in the training.
  • the first training card is the main training in the first training node. card, the number of the third training card is one or more.
  • Partial training models are running on each training card participating in training on the first training node.
  • the partial training models running on all training cards participating in training form a complete training model. After the training of the complete training model is completed, it can be used for Inference, for example, can be a recommendation model or a natural language processing model.
  • the embodiments of this application do not limit the specific structure of the model.
  • the second training node includes a second training card and a fourth training card.
  • Both the second training card and the fourth training card are training cards participating in training.
  • the second training card is the main training card in the second training node.
  • the number of four training cards is one or more.
  • Partial training models are running on each training card participating in the training on the second training node.
  • the partial training models running on all training cards participating in the training form a complete training model.
  • the complete training model is the same as the training on the first training node.
  • the model is the same.
  • Step 501 The first training node obtains the data to be trained for this round of training.
  • the data to be trained can be stored in a distributed manner on parameter nodes or on other servers.
  • the data to be trained can also be called training data.
  • the first training node obtains the data to be trained for its current round of training, which is called a batch of data.
  • the batch data is the sum of the data to be trained used by the training cards participating in the training on the first training node.
  • the number of data to be trained in each round can be preset.
  • the data to be trained is stored on the parameter node as an example for explanation.
  • Step 502 The first training node calculates the characteristics of the sparse parameters required for the data to be trained, and obtains the sparse parameters from the corresponding parameter node based on the characteristics of the sparse parameters.
  • the sparse parameters required for the data to be trained are parameters used to convert the data to be trained into a fixed-dimensional vector.
  • the processor (such as CPU, etc.) of the first training node calculates the characteristics of the sparse parameters required for the batch data. Based on the characteristics of the sparse parameter, the corresponding parameter node is queried, and the sparse parameter is obtained from the parameter node.
  • the system architecture includes a first parameter node and N second parameter nodes, where N is an integer greater than or equal to 1.
  • the sparse parameters obtained by the first training node from the first parameter node constitute the first parameter set, and the sparse parameters obtained by the first training node from the N second parameter nodes respectively constitute the second parameter set. In this way, the first training node obtains to 1 first parameter set and N second parameter sets.
  • different data to be trained may correspond to the same sparse parameters, so the sparse parameters can be deduplicated first and then obtained from the parameter node, thus saving transmission resources.
  • Step 503 The first training node transmits the data to be trained, the sparse parameters, and the characteristics of the sparse parameters to each training card of the first training node.
  • the processor (such as CPU, etc.) of the first training node broadcasts the batch data, sparse parameters, and characteristics of the sparse parameters to each training card of the first training node.
  • the first training node transmits the batch data, sparse parameters and characteristics of the sparse parameters to each training card through scatter operation and all reduce operation.
  • each training card refers to each training card participating in training on the first training node.
  • the processing transmitted to each training card through scatter operation and all reduce operation is: assuming that the number of training cards of the first training node is P, the first training node divides the batch data, sparse parameters and features of the sparse parameters respectively. For P copies, send a batch data, a sparse parameter and a feature of the sparse parameter to a training card, and different training cards obtain different data. Then each training card obtains all batch data, sparse parameters and sparse parameter characteristics through all reduce operations.
  • Step 504 Each training card in the first training node uses the corresponding data to be trained to train the sparse parameters corresponding to the data to be trained, and obtains the gradient corresponding to the sparse parameters.
  • each training card in the first training node obtains its own mini-batch data from the batch data, and then uses the sparse parameters corresponding to the mini-batch data for training to obtain the sparse parameters corresponding to the data. gradient.
  • the mini-batch data of different training cards are not exactly the same. They may be partially the same or completely different.
  • Step 505 The first training card aggregates the gradient data obtained by multiple training cards in the first training node after completing training.
  • the first training card obtains the sparse parameters and corresponding gradients on the third training card.
  • the third training card is a training card participating in the training in the first training node except the first training card.
  • the third training card trains to obtain the gradient corresponding to its own sparse parameter, and the third training card sends the gradient corresponding to the sparse parameter, the characteristics of the sparse parameter, and the sparse parameter to the first training card.
  • an all reduce sum operation is performed between the first training card and the third training card, so that the first training card obtains the characteristics of the sparse parameters, the sparse parameters and the corresponding gradients on the third training card.
  • the first training card can obtain all the participants in the training in the first training node. There is the sum of the gradients of each sparse parameter on the training card. In this way, after the all reduce sum operation, since the gradients corresponding to the same sparse parameters on the first training node are added, the amount of data transmitted by the first training card to the main training cards of other training nodes is reduced.
  • Step 506 The first training card segments the aggregated gradient data according to training nodes.
  • the first training card determines the training nodes participating in the training in the system architecture, and uses the characteristics of the sparse parameters corresponding to the aggregated gradient data to determine the training nodes to which each sparse parameter is distributed. Since gradients correspond to sparse parameters one-to-one, determining the training nodes to which the sparse parameters are distributed also determines the training nodes to which the gradients corresponding to the sparse parameters are distributed.
  • the first training card calculates a hash value of the feature of each sparse parameter.
  • the first training card uses the hash value of the feature of each sparse parameter to determine the training node to which each sparse parameter is distributed. For example, each training node corresponds to an index, determine the hash value of the index of each training node, for any sparse parameter, determine the hash value of the feature of the sparse parameter, select the hash value in the training node and the sparse The selected training node is determined as the training node to which the sparse parameter is distributed to the training node with the closest hash value of the characteristic of the parameter. Since each training node has only one main training card, the training node is determined, that is, the main training card is determined.
  • the gradient set composed of the gradients of the main training card (ie, the second training card) distributed to the second training node is the first gradient set
  • the first gradient set includes the first gradient set.
  • Step 507 The first training node sends the first gradient set and the sparse parameters corresponding to the gradients in the first gradient set to the second training card.
  • the first training node uses the network between the first training node and the second training node to convert the first gradient set
  • the sparse parameters corresponding to the gradients in the first gradient set are transmitted to the second training node, and the second training node then sends the received sparse parameters and corresponding gradients to the main training card.
  • the sparse parameters corresponding to the first gradient set and the gradients in the first gradient set are directly distributed to the second training card through the inter-card transmission network.
  • the first training card can also send the sparse parameter features to the second training card.
  • the features of the sparse parameters can be used later to update the updated sparse parameters to the parameter node.
  • the second training card can use the sparse parameters and look up the table to find the features of the sparse parameters.
  • Step 508 The second training card uses the gradient in the first gradient set to update the sparse parameters corresponding to the gradients in the first gradient set.
  • the second training card adds all gradients corresponding to any sparse parameter, calculates the average value, and determines the calculated average value as the gradient aggregation result corresponding to any sparse parameter.
  • the second training card uses the gradient aggregation result corresponding to any sparse parameter, uses the gradient descent method to iteratively adjust the sparse parameter in the opposite direction of the gradient, and obtains the updated sparse parameter corresponding to the any sparse parameter.
  • Step 509 The second training card sends the updated sparse parameters to the corresponding parameter node.
  • the second training card uses the characteristics of sparse parameters to determine the index value.
  • the second training card uses the index value to correspond to the parameter node, and sends the updated sparse parameters to the corresponding parameter node.
  • Step 510 The parameter node stores the received sparse parameters.
  • the second training card of the second training node will also move to the first training node.
  • the training card sends the sparse parameters and gradients distributed to the first training card in the second training node, and the second training card obtains the sparse parameters and corresponding values of all training cards (the second training card and the fourth training card) on the second training node. gradient.
  • the first training node receives the sparse parameters and corresponding gradients sent by the second training card to the first training card.
  • the first training card aggregates the gradients corresponding to the current sparse parameters on the first training card, obtains the gradient aggregation results corresponding to the current sparse parameters, and updates the current sparse parameters based on the gradient aggregation results.
  • the current sparse parameters include the first A training card distributes sparse parameters to the first training card and receives sparse parameters from the second training card.
  • the first training node uses the network between the first training node and the second training node to receive the second training node
  • the second training card sends the sparse parameters and corresponding gradients to the first training card.
  • the first training node sends the received sparse parameters and corresponding gradients to the first training card.
  • the first training card When an inter-card transmission network exists between the first training card and the second training card, the first training card receives the sparse parameters and corresponding gradients sent to itself by the second training card through the inter-card transmission network.
  • the current sparse parameters on the first training card include the sparse parameters distributed to itself and the sparse parameters received from the second training card.
  • the first training card aggregates the gradient corresponding to any sparse parameter, obtains the gradient aggregation result corresponding to the any sparse parameter, and uses the gradient aggregation corresponding to the any sparse parameter As a result, any sparse parameter is updated, and an updated sparse parameter corresponding to the sparse parameter is obtained.
  • the first training card uses the features of sparse parameters to determine the index value.
  • the first training card uses the index value to correspond to the parameter node, and updates the updated sparse parameters to the corresponding parameter node.
  • the second training card also sends the features of the sparse parameters to the first training card.
  • the first training card can also receive the characteristics of the sparse parameters sent by the second training card. In this way, the features of the sparse parameters can be used later to update the updated sparse parameters to the parameter node.
  • the first training card can use the sparse parameters and look up the table to find the characteristics of the sparse parameters.
  • the training node to which the sparse parameter is distributed is determined based on the index of the training node and the characteristics of the sparse parameter.
  • the training node stores the updated sparse parameters to the corresponding parameter node through the network with the parameter node.
  • gradient segmentation is segmented according to training nodes.
  • segmentation can also be performed according to parameter nodes to ensure that sparse parameters stored in the same parameter node are Storage is split into the same training node.
  • the mapping relationship between parameter nodes and training nodes can be stored in advance.
  • the main training card uses the characteristics of sparse parameters to determine the index value. Different index values correspond to different parameter nodes, and one index value only corresponds to one parameter node. Use one index value. It can correspond to a parameter node, and then in the mapping relationship, the training node corresponding to the parameter node can be determined. In this way, in the process shown in Figure 5, it can be considered that the gradient corresponding to the sparse parameter with the first index value is classified into the first gradient set, and the first index value corresponds to a certain parameter node.
  • the main training card can use the characteristics of one sparse parameter to determine the corresponding parameter node, instead of using the characteristics of each sparse parameter to determine the corresponding parameter node.
  • the system architecture 100 includes two physical nodes.
  • the two physical nodes are physical node 1 and physical node 2.
  • Two training cards are inserted into each physical node.
  • the two training cards on physical node 1 are training cards.
  • Training card 1 is the main training card.
  • the two training cards on physical node 2 are training card 3 and training card 4.
  • Training card 3 is the main training card.
  • the training card can be an NPU.
  • Step 601 The training node in physical node 1 obtains the data to be trained for this round of training, which is called a batch data.
  • the CPU of physical node 1 calculates the characteristics of the sparse parameters required for the batch data, and obtains the sparse parameters from the corresponding parameter node based on the characteristics of the sparse parameters.
  • the sparse parameter is represented by W1, which is part of the total sparse parameters. It is assumed that W1 includes three subsets A, B and C.
  • Step 602 Physical node 1 transmits the batch data, W1, and W1 features to the two training cards through scatter operations and all reduce operations.
  • Step 603 The two training cards in physical node 1 use the corresponding mini-batch data to train the sparse parameters corresponding to the mini-batch data, and obtain the gradient corresponding to the sparse parameters.
  • the mini-batch data of training card 1 corresponds to The sparse parameters of are represented by subsets A1, B1, and C1, and the obtained gradients are represented by A11, B11, and C11.
  • the sparse parameters corresponding to the mini-batch data of training card 2 are represented by subsets A2, B2, and C2, and the obtained gradients are represented by for A22, B22 and C22.
  • Step 604 The main training card (training card 1) in physical node 1 obtains the sparse parameters and corresponding gradients on training card 2 in physical node 1.
  • the sparse parameters and corresponding gradients in the current physical node 1 are represented as subset A33 , B33 and C33, each subset includes multiple sparse parameters and corresponding gradients.
  • Step 605 The main training card in physical node 1 determines the shards A44 and B44 according to the characteristics of the sparse parameters and the index of the training node.
  • the corresponding main training cards are the main training card of physical node 1 and the main training card of physical node 2. Card.
  • Step 606 The main training card of physical node 1 sends the fragment B44 and the characteristics of the sparse parameters in fragment B44 to the main training card of physical node 2.
  • Step 607 The main training card of physical node 1 receives the fragment C44 sent by the main training card of physical node 2 and the characteristics of the sparse parameters in fragment C44.
  • Step 608 The main training card of physical node 1 aggregates the gradients in A44 and C44 to obtain the gradient aggregation result A55. Based on the gradient aggregation result A55, the sparse parameters in A44 and C44 are updated.
  • Step 609 The main training card of physical node 1 updates the updated sparse parameters to the corresponding parameter node.
  • physical node 2 is processed similarly to physical node 1.
  • the sparse parameters obtained by physical node 2 are represented by W2, including three subsets D, E and F.
  • the sparse parameters corresponding to the mini-batch data of training card 3 Expressed as subsets D1, E1 and F1, the obtained gradients are expressed as subsets D11, E11 and F11, the sparse parameters corresponding to the mini-batch data of training card 4 are expressed as subsets D2, E2 and F2, and the obtained gradients are expressed as Subsets D22, F22 and E22.
  • the sparse parameters and corresponding gradients on training cards 3 and 4 in physical node 2 are merged and expressed as subsets D33, F33 and E33.
  • Each subset includes multiple sparse parameters and corresponding gradients.
  • Training card 3 divides subsets D33, F33, and E33 into shards C44 and D44 based on the characteristics of sparse parameters in subsets D33, F33, and E33. Training card 3 sends C44 and the characteristics of sparse parameters in C44 to physical node 1. Training card 1. The training card 3 of the physical node 2 aggregates the gradients in B44 and D44 to obtain the gradient aggregation result B55. Based on the gradient aggregation result B55, the sparse parameters are updated.
  • Adopting the solution shown in Figure 5 using the training card for gradient aggregation and parameter updating, compared with the traditional parameter node method for parameter updating, reduces the CPU and physical node memory usage, making gradient aggregation and parameter updating more efficient. Moreover, when there is an inter-card transmission network, data transmission can reduce dependence on the host network, further alleviating host network bottlenecks. neck.
  • FIG. 7 takes the interaction between the first training node and the second training node included in the system as an example for illustration.
  • Step 701 The first training node obtains the data to be trained on each training card for this round of training.
  • the data to be trained can be stored in a distributed manner on parameter nodes or on other servers.
  • the data to be trained can also be called training data.
  • the first training node obtains the data to be trained for this round of training on each training card.
  • the data to be trained for this round of training on each training card is called a mini-batch data.
  • the number of data to be trained for each round of training is Can be set in advance.
  • Step 702 The first training node calculates the characteristics of the sparse parameters required for the data to be trained in this round of each training card, and obtains the sparse parameters from the corresponding parameter node based on the characteristics of the sparse parameters.
  • the processor (such as CPU, etc.) of the first training node calculates the characteristics of the required sparse parameters. Based on the characteristics of the sparse parameter, the corresponding parameter node is queried, and the sparse parameter is obtained from the parameter node.
  • the sparse parameters can be deduplicated first and then obtained from the parameter node. In this way, transmission resources can be saved.
  • Step 703 The first training node transmits the data to be trained, the sparse parameters, and the characteristics of the sparse parameters of each training card to each training card.
  • the processor of the first training node transmits the data to be trained, the sparse parameters, and the characteristics of the sparse parameters of each training card to each training card.
  • Step 704 Each training card in the first training node uses the corresponding data to be trained to train the sparse parameters corresponding to the data to be trained, and obtains the gradient of the sparse parameters corresponding to the data to be trained.
  • Step 705 Each training card in the first training node determines the training card of the training node to which its own sparse parameters are distributed according to each training card in the system.
  • each training card in the first training node determines the training card of each training node, and uses the hash value of the characteristics of its own sparse parameters to determine the training card of the training node to which its own sparse parameters are distributed.
  • the training card determines a hash value of a feature of its own sparse parameter.
  • the training card of each training node corresponds to an index, and the hash value of the index of each training card is determined.
  • the training node whose hash value is closest to the hash value of the feature of the sparse parameter is selected in the training card, and the selected training card is determined as the training card to which the sparse parameter is distributed.
  • the gradient set composed of the gradients of the second training card distributed to the second training node is the first gradient set
  • the first gradient set includes multiple values of the first parameter set.
  • the gradient distributed to the second training card and the gradient data of the first training card corresponding to the N second parameter sets in the gradient data after completion of training are distributed to the second training card.
  • Step 706 Each training card in the first training node distributes its own sparse parameters and corresponding gradients to the training card of the corresponding training node.
  • sending to the second training card is used as an example in Figure 7 for explanation.
  • the first training node Use the network between the first training node and the second training node to distribute the sparse parameters and corresponding gradients to the second training node.
  • the second training node then sends the received sparse parameters and corresponding gradients to the second training node. Training cards.
  • the identification of the training card will also be sent, indicating the training card to which the sparse parameters and corresponding gradients are sent.
  • step 706 gradients and sparse parameters are transmitted between training cards in the same training node.
  • the bus can be used for transmission.
  • each training card can also send features of sparse parameters to the second training card.
  • the features of the sparse parameters can be used later to update the updated sparse parameters to the parameter node.
  • the second training card can use the sparse parameters and look up the table to find the sparse parameter features.
  • Step 707 The second training card uses the gradient in the first gradient set to update the sparse parameters corresponding to the gradients in the first gradient set.
  • Step 708 The second training card sends the updated sparse parameters to the corresponding parameter node.
  • step 707 and step 708, refer to the process shown in Figure 5 and will not be described here.
  • Step 709 The parameter node stores the received sparse parameters.
  • each training card of the first training node receives the sparse parameters and corresponding gradients sent to itself by other training cards.
  • other training cards include training cards in the first training node except the first training card and training cards of other training nodes.
  • the other training cards include a second training card.
  • Each training card in the first training node aggregates the gradient corresponding to its current sparse parameter, obtains the gradient aggregation result corresponding to the current sparse parameter, and updates the current sparse parameter based on the gradient aggregation result.
  • the parameters include its original partial sparse parameters and the received sparse parameters.
  • Each training card in the first training node stores the updated sparse parameters to the corresponding parameter node.
  • the gradient segmentation is segmented according to the training card.
  • it can also be segmented according to the parameter node to ensure that the sparse parameters stored in the same parameter node are Store splits into identical training cards.
  • the mapping relationship between parameter nodes and training cards can be stored in advance.
  • the parameter node can correspond to one or more training cards.
  • Any training card uses the characteristics of sparse parameters to determine an index value.
  • the index value can Corresponds to a parameter node, and in this mapping relationship, the training card corresponding to the parameter node can be determined. In this way, after updating the sparse parameters, the training card can use the characteristics of one sparse parameter to determine the corresponding parameter node, without using the characteristics of each sparse parameter to determine the corresponding parameter node.
  • the system architecture 100 includes two physical nodes.
  • the two physical nodes are physical node 1 and physical node 2.
  • Two training cards are inserted into the physical node 1, and the two training cards are training card 1 and training card 2.
  • there is a training card inserted into physical node 2 the training card is training card 3, and the training card can be an NPU.
  • Step 801 The training node in physical node 1 obtains the data to be trained for this round of training of each training card, and the data to be trained for each training card becomes a min-batch data.
  • the CPU of physical node 1 calculates the characteristics of the sparse parameters required for the data to be trained on each training card, and obtains the sparse parameters from the corresponding parameter node based on the characteristics of the sparse parameters.
  • the sparse parameters used in training card 1 are represented by W11, which are full sparse parameters. It is assumed that W11 includes three subsets G, H and I.
  • the sparse parameters used in training card 2 are represented by W12, which are full sparse parameters. It is assumed that W12 includes three subsets M, N and O.
  • Step 802 Physical node 1 transmits the min-batch data, sparse parameters, and features of the sparse parameters corresponding to the data to be trained to the two training cards.
  • Step 803 Each training card in physical node 1 uses the corresponding mini-batch data to train the sparse parameters corresponding to the mini-batch data, and obtains the gradient corresponding to the sparse parameters.
  • Sparse parameters on training card 1 and obtained ladder The degrees are represented as subsets G1, H1 and I1, and the sparse parameters and obtained gradients on training card 2 are represented as subsets M1, N1 and O1.
  • Each subset includes sparse parameters and corresponding gradients.
  • Step 804 Training card 1 in physical node 1 determines subsets G1, H1, and I1 based on the characteristics of sparse parameters in subsets G1, H1, and I1.
  • the corresponding training cards are training card 1 and training card 2 in physical node 1. and training card 3 in physical node 2.
  • Training card 2 in physical node 1 determines subsets M1, N1, and O1 based on the characteristics of sparse parameters in subsets M1, N1, and O1.
  • the corresponding training cards are training card 2, training card 1, and physical node in physical node 1. 2 training cards 3.
  • Step 805 Training card 1 in physical node 1 sends subset I1 to training card 3 in physical node 2. Training card 1 in physical node 1 sends subset H1 to training card 2 in physical node 1. Training card 2 in physical node 1 sends subset O1 to training card 3 in physical node 2. Training card 2 in physical node 1 sends subset N1 to training card 1 in physical node 1.
  • Step 806 The training card 1 in the physical node 1 receives the subset N1 sent by the training card 2, and receives the subset J1 sent by the training card 3 in the physical node 2.
  • Training card 2 in physical node 1 receives the subset H1 sent by training card 1, and receives the subset K1 sent by training card 3 in physical node 2.
  • Step 807 The training card 1 in the physical node 1 aggregates the gradients in the subsets N1, G1 and J1 to obtain the gradient aggregation result A66. Based on the gradient aggregation result A66, the sparse parameters in the subsets N1, G1 and J1 are Make an update.
  • the training card 2 in the physical node 1 aggregates the gradients in the subsets M1, H1 and K1 to obtain the gradient aggregation result A77. Based on the gradient aggregation result A77, the sparse parameters in the subsets M1, H1 and K1 are updated.
  • Step 808 Training card 1 and training card 2 in physical node 1 update the updated sparse parameters to the corresponding parameter nodes.
  • the first training card in the previous article is any training card in the physical node.
  • the processing of physical node 2 is similar to that of physical node 1.
  • the sparse parameters obtained by physical node 2 are represented by W13, including three subsets J, K and L.
  • Subsets J1, K1, and L1 are obtained.
  • Subsets J1, K1, and L1 include sparse parameters and corresponding gradients respectively.
  • the training card 3 in the physical node 2 aggregates the gradients in the subsets L1, I1 and O1 to obtain the gradient aggregation result B66. Based on the gradient aggregation result B66, the sparse parameters in the subsets L1, I1 and O1 are updated.
  • Adopting the solution shown in Figure 8 using the training card for gradient aggregation and parameter updating, compared with the traditional parameter node method for parameter updating, reduces CPU and physical node memory usage, making gradient aggregation and parameter updating more efficient. Moreover, when there is an inter-card transmission network, data transmission can reduce dependence on the host network and further alleviate host network bottlenecks.
  • multiple training cards within a training node do not need to synchronize sparse parameters within the node, and the advantages of parallelism of multiple training cards can be fully utilized.
  • the training card can directly transmit the gradient after calculating it, without the need for more complicated operations.
  • the network data transmission volume of sparse parameters on a single training card is W_k.
  • the upper limit of the total network data transmission volume of sparse parameters is N_worker*N_device_per_worker*W_k.
  • the network data transmission volume of sparse parameters is (N_worker-1)*N_device_per_worker*W_k/N_worker.
  • N_worker is the number of training nodes
  • N_device_per_worker is the number of training cards in each training node.
  • the network data transmission volume of a single training node is (N_worker-1)*(W_k+G_k)/N_worker, where N_worker is the number of training nodes and G_k is the number of training nodes on a single training node.
  • the gradient corresponding to the sparse parameter and the data transmission within the training node are not included in the network data transmission volume. Without considering gradient compression, W_k is equal to G_k.
  • the network data transmission volume of a single training node can also be expressed as 2*(N_worker -1)*G_k/N_worker is multiplied by 2 here because sparse parameters also need to be transmitted.
  • the parameter node and the training node are deployed on the same physical node.
  • the amount of network transmission data is reduced to (N_worker-1)*N_device_per_worker*W_k/N_worker+(N_worker-1)*(W_k+G_k).
  • the network data transmission volume is (N_worker-1)*G_k/N_worker.
  • an embodiment of the present application provides a computer program product.
  • the computer program product includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs part of the method performed by the first training node in the process shown in FIG. 4 .
  • the disclosed system architecture, devices and methods can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the modules is only a logical function division. In actual implementation, there may be other division methods.
  • multiple modules or components may be combined or may be Integrated into another system, or some features can be ignored, or not implemented.
  • the coupling or direct coupling or communication connection between each other shown or discussed may be indirect coupling or communication connection through some interfaces, devices or modules, or may be electrical, mechanical or other forms of connection.
  • the modules described as separate components may or may not be physically separated.
  • the components shown as modules may or may not be physical modules, that is, they may be located in one place, or they may be distributed to multiple network modules. Some or all of the modules can be selected according to actual needs to achieve the purpose of the embodiments of the present application.
  • each module in each embodiment of the present application can be integrated into one processing module, or each module can exist physically alone, or two or more modules can be integrated into one module.
  • the above integrated modules can be implemented in the form of hardware or software modules.
  • the integrated module is implemented in the form of a software function module and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to cause a computer device (which can be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods in various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk and other media that can store program codes.
  • first and second are used to distinguish identical or similar items with substantially the same functions and functions. It should be understood that there is no logical or logical connection between “first” and “second”. Timing dependencies do not limit the number and execution order. It should also be understood that, although the following description uses the terms first, second, etc. to describe various elements, these elements should not be limited by the terms. These terms are only used to distinguish one element from another. For example, without departing from the scope of various examples, a first sparse parameter may be referred to as a second sparse parameter, and similarly, the second sparse parameter may be referred to as a first sparse parameter. Both the first sparse parameter and the second sparse parameter may be sparse parameters, and in some cases, may be separate and different sparse parameters.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Complex Calculations (AREA)

Abstract

本申请提供了一种稀疏参数的更新方法、训练节点、设备和存储介质,属于深度学习技术领域。该方法应用于人工智能模型训练系统,系统包括第一参数节点、第一训练节点和第二训练节点。该方法包括:第一训练节点从第一参数节点获取第一参数集,第一训练节点利用第一参数集中的参数对待训练的数据进行训练,得到第一梯度集,第一梯度集包括第一参数集的参数对应的梯度中分发至第二训练节点的第二训练卡的梯度,第一训练节点将第一梯度集和第一梯度集中的梯度对应的参数发送至第二训练卡,第二训练卡根据第一梯度集中的梯度对第一梯度集中的梯度对应的参数进行更新,第二训练卡将更新后的参数发送至第一参数节点。采用本申请的方案,能够节约传输资源。

Description

稀疏参数的更新方法、训练节点、设备和存储介质
本申请要求于2022年05月19日提交中国专利局、申请号为202210555107.7、发明名称为“稀疏参数的更新方法、训练节点、设备和存储介质”的中国专利申请的优先权,所述专利申请的全部内容通过引用结合在本申请中。
技术领域
本申请涉及深度学习技术领域,特别涉及一种稀疏参数的更新方法、训练节点、设备和存储介质。
背景技术
在一些神经网络模型中,训练数据通过独热(one-hot)编码或者多热(multi-hot)编码被处理为离散特征,需要使用嵌入参数将输入转化为连续向量再进一步处理,在一次单独的训练过程中,嵌入参数仅有部分参与计算并被更新,这种参数被称为是稀疏参数。稀疏参数总量达到10TB~30TB级别,而训练卡(训练卡也称为是训练加速卡)的内存容量不足以存储所有的稀疏参数,所以稀疏参数一般存储在服务器的内存或者服务器连接的固态硬盘(solid state disk,SSD)中。
相关技术中,稀疏参数分布式存储于多个服务器的内存或者连接的SSD中,训练卡部署在服务器上。一次更新稀疏参数的过程为:每个训练节点从内存或者SSD中获取此次训练数据对应以及此次训练数据对应的稀疏参数,每个训练节点基于获取到的训练数据,训练获得训练数据对应的稀疏参数的梯度,每个训练节点通过网络将训练得到的稀疏参数的梯度,传输给其它所有训练节点,每个训练节点对自身存储的稀疏参数的梯度进行梯度聚合,基于梯度聚合后的结果对自身存储的稀疏参数进行更新。
由于在训练节点计算出稀疏参数的梯度后,需要将所有训练节点计算得到的梯度进行互换,所以占用的网络传输资源比较多。
发明内容
本申请提供了一种稀疏参数的更新方法、训练节点、设备和存储介质,能够节约传输资源,以及提升稀疏参数的更新效率。
第一方面,本申请提供了一种稀疏参数的更新方法,该方法应用于人工智能模型训练系统,该系统包括第一参数节点、第一训练节点和第二训练节点,该第一训练节点包括第一训练卡,该第二训练节点包括第二训练卡,该方法包括:该第一训练节点从该第一参数节点获取第一参数集,该第一参数集包括多个参数;该第一训练节点利用该多个参数对待训练的数据进行训练,得到第一梯度集,该第一梯度集包括该第一参数集的多个参数对应的多个梯度中分发至该第二训练卡的梯度;该第一训练节点将该第一梯度集和该第一梯度集中的梯度对应的参数发送至该第二训练卡;该第二训练卡根据该第一梯度集中的梯度对该第一梯度集中的梯度对应的参数进行更新;该第二训练卡将更新后的参数发送至该第一参数节点。
本申请所示的方案中,使用训练卡对稀疏参数进行更新处理,使得稀疏参数的更新效率 更高,进而能够提升模型的训练效率。而且训练节点是对自身获取的稀疏参数切分后传输给其它训练卡,每个稀疏参数以及对应的梯度仅会传输给一个训练卡,不需要重复传输数据,也能够节约网络传输资源。
在一种示例中,该方法还包括:该第一训练节点从N个第二参数节点分别获取一个第二参数集,每个第二参数集包括多个参数;第一训练节点利用该多个参数对待训练的数据进行训练,得到第一梯度集,包括:该第一训练节点中的多个训练卡利用该第一参数集和获取的N个第二参数集的参数对该待训练的数据进行训练,该多个训练卡包括该第一训练卡;该第一训练卡对该多个训练卡在完成训练后得到的梯度数据进行聚合;该第一训练卡对聚合后的梯度数据按照参数节点进行切分,得到该第一参数节点对应的该第一梯度集,或者,该第一训练卡对聚合后的梯度数据按照训练节点进行切分,得到该第二训练节点对应的第一梯度集,该第二训练节点对应的第一梯度集还包括该N个第二参数集的多个参数对应的多个梯度中分发至该第二训练卡的梯度。
本申请所示的方案中,第一训练节点中第一训练卡获取到第一训练节点的全部参数和对应的梯度,然后第一训练卡按照参数节点或者训练节点对梯度进行切分,分发至对应的训练卡,使得训练卡对参数进行更新。这样,使用训练卡进行稀疏参数的梯度的计算处理,以及对稀疏参数的更新处理,使得稀疏参数的更新效率更高。
在一种示例中,该方法还包括:该第一训练节点从N个第二参数节点分别获取一个第二参数集,每个第二参数集包括多个参数;第一训练节点利用该多个参数对待训练的数据进行训练后,得到第一梯度集,包括:该第一训练节点中的多个训练卡利用该第一参数集和获取的N个第二参数集的参数对该待训练的数据进行训练,该多个训练卡包括该第一训练卡;该第一训练卡对该第一训练卡在完成训练后的梯度数据按照该系统中各训练卡进行切分,得到该第二训练卡对应的该第一梯度集,该第一梯度集还包括该第一训练卡在完成训练后的梯度数据中对应该N个第二参数集的参数,且分发至该第二训练卡的梯度。
本申请所示的方案中,第一训练节点的各训练卡并行计算梯度,并且并行对梯度进行切分,能够并行更新参数,提升训练效率。
在一种示例中,该方法还包括:该第一训练节点从N个第二参数节点分别获取一个第二参数集,每个第二参数集包括多个参数;第一训练节点利用该多个参数对待训练的数据进行训练后,得到第一梯度集,包括:该第一训练节点中的多个训练卡利用该第一参数集和获取的N个第二参数集的参数对该待训练的数据进行训练,该多个训练卡包括该第一训练卡;该第一训练卡对该第一训练卡在完成训练后的梯度数据按照参数节点进行切分,得到该第二训练卡对应的该第一梯度集。
本申请所示的方案中,参数节点与训练卡建立有映射关系,第一训练节点的各训练卡并行计算梯度,并且并行按照参数节点对梯度进行切分,使得分发至同一训练卡的参数属于同一个参数节点,能够使得训练卡一起将更新后的参数更新至对应的参数节点。
在一种示例中,第二训练卡根据该第一梯度集中的梯度对该第一梯度集中的梯度对应的参数进行更新,包括:该第二训练卡对从该系统中的多个训练节点接收到的多个梯度集中的梯度进行聚合,该多个梯度集包括该第一梯度集;利用聚合后的梯度更新该第一梯度集中的梯度对应的参数。
本申请所示的方案中,使用聚合后的梯度更新参数,能够使得更新后的参数更准确。
在一种示例中,第一训练卡对聚合后的梯度数据按照参数节点进行切分,得到该第一参数节点对应的该第一梯度集,包括:根据聚合后的每个梯度数据对应的参数的特征确定索引 值,一个索引值用于指示一个参数节点;将具有第一索引值的参数对应的梯度归入该第一梯度集,所述第一索引值用于指示所述第一参数节点。
第二方面,本申请提供了一种第一训练节点,该第一训练节点具有实现上述第一方面中第一训练节点所执行的功能。
第三方面,本申请提供了一种计算机设备,该计算机设备包括处理器和存储器;存储器存储有计算机指令;该处理器用于执行该计算机指令,使得该计算机设备执行上述第一方面或第一方面任一种可选方式所提供的稀疏参数的更新的部分方法。
第四方面,本申请提供了一种计算机可读存储介质,该存储介质中存储有至少一条计算机指令,该计算机指令由处理器读取以使计算机设备执行上述第一方面或第一方面任一种可选方式所提供的训练节点更新稀疏参数的部分方法。
第五方面,本申请提供了一种计算机程序产品,该计算机程序产品包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述第一方面或第一方面任一种可选方式所提供的训练节点更新稀疏参数的部分方法。
附图说明
图1是本申请一个示例性实施例提供的系统架构的示意图;
图2是本申请一个示例性实施例提供的系统架构的示意图;
图3是本申请一个示例性实施例提供的设备的硬件结构示意图;
图4是本申请一个示例性实施例提供的稀疏参数的更新的方法流程示意图;
图5是本申请一个示例性实施例提供的稀疏参数的更新的方法流程示意图;
图6是本申请一个示例性实施例提供的稀疏参数的更新示例示意图;
图7是本申请一个示例性实施例提供的稀疏参数的更新的方法流程示意图;
图8是本申请一个示例性实施例提供的稀疏参数的更新示例示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
下面对本申请实施例涉及的一些术语概念做解释说明。
1、人工神经网络(artificial neural network,ANN)模型,是机器学习和认知科学领域中,一种模仿生物神经网络(如动物的中枢神经系统,该中枢神经系统可以是大脑等)的结构和功能的数学模型或计算模型,其用于对函数进行估计或近似。人工神经网络由大量的人工神经元联结进行计算。人工神经网络也可以简称为神经网络(neural network,NN)模型或类神经网络模型。
2、嵌入(embedding)层,是神经网络模型中将输入层的特征转换为具有固定维度的向量的层。
3、参数(parameters),指神经网络模型中的参数,具体指每个神经网络单元的权重矩阵W和偏置向量b。在神经网络模型架构固定的情况下,神经网络模型训练的过程是通过不断调整参数的值使得神经网络模型性能最佳的过程。
4、稀疏参数,是神经网络模型中的参数的一种,稀疏参数的特点是在每一轮训练时,稀疏参数中只有部分会被激活,“激活”指参与前向计算和反向更新。例如,在推荐系统或自然语言处理等神经网络模型中,输入通过one-hot编码或者multi-hot编码被处理为离散特征,需要在嵌入层使用嵌入参数将输入转化为连续向量再进一步处理,在一次单独的训练过程中,嵌入参数仅有部分参与计算并被更新,这种参数被称为是稀疏参数。
下面描述本申请实施例中的背景。
稀疏参数的数据量比较大,而训练卡的内存容量不足以存储所有的稀疏参数,所以稀疏参数一般存储在服务器的内存或者服务器连接的SSD中。在对稀疏参数进行更新时,存下如下特点:1、数据量巨大,例如,推荐模型中的稀疏参数的量级能达到1012至1014,存储空间消耗为10TB~30TB级别,无法在训练卡中进行完整训练;2、参数稀疏,即对每次训练仅使用到全部稀疏参数中的很小一部分,例如,在mini-batch为10000,样本特征为1000时,使用到的稀疏参数为107个;3、训练过程中存在大量的计算和数据传输,使得计算资源和网络资源容易形成瓶颈。
考虑到稀疏参数更新时的这些问题,本申请通过优化稀疏参数更新过程,减少了数据传输量,从而可以减少单轮训练时间,最终优化模型的整体训练时间。
下面,按照系统架构、系统架构中设备的硬件结构和稀疏参数的更新方法的顺序,从多个角度对本申请实施例提供的技术方案进行具体描述。
下面描述本申请实施例的系统架构。
本申请实施例提供了一种人工智能模型训练系统的架构,简称为系统架构100。系统架构100是对稀疏参数进行更新的系统架构的举例说明。系统架构100包括参数节点101和训练节点102。其中,参数节点101用于存储参数等,训练节点102用于更新参数。训练节点102可以称为是工作节点(worker)。
在一种示例中,参数节点101和训练节点102分别部署在不同的物理节点上,参数节点101与训练节点102之间通过无线网络或者有线网络连接。其中,本申请实施例不对参数节点101与训练节点102的数目与类型进行限制。例如,参见图1,系统架构100包括N个参数节点101和M个训练节点102,N和M均是整数,N和M可以相同,也可以不相同,每个训练节点102上部署有至少一个训练卡,即每个训练节点102包括至少一个训练卡,训练卡可以是神经网络处理器(neural network processing unit,NPU)或者图形处理器(graphics processing unit,GPU),不同训练节点102上部署的训练卡的数目可以相同,也可以不相同等。在参数节点101和训练节点102分别部署在不同的物理节点上时,参数节点101也可以称为是参数服务器。
在另一种示例中,一个参数节点101和一个训练节点102部署在相同的物理节点上,参见图2,参数节点101与训练节点102之间通过物理节点内部的总线连接。
在另一种示例中,部分参数节点101和部分训练节点102部署在相同的物理节点上,且一个物理节点上部署一个参数节点101和一个训练节点102,部分参数节点101和部分训练节点102分别部署在不同的物理节点上。
在上述三种示例中,任意两个训练节点102之间通过无线网络或者有线网络连接。不同训练节点102的训练卡之间可选的通过卡间传输网络连接,也就是说不同训练节点102的训 练卡之间可以直接进行通信。
下面对上述系统架构100中设备的硬件结构进行介绍。
参见图3,图3所示的设备300是对上述系统架构100中设备的硬件结构的举例说明。可选的,设备300配置为参数节点101、训练节点102或者物理节点。设备300例如是主机或服务器等。
设备300可选的由一般性的总线体系结构来实现。设备300包括至少一个处理器301、通信总线302、存储器303以及至少一个网络接口304。
处理器301例如是通用中央处理器(central processing unit,CPU)、网络处理器(network processer,NP)、GPU、NPU、数据处理单元(data processing unit,DPU)、微处理器或者一个或多个用于实现本申请方案的集成电路。例如,处理器301包括专用集成电路(application-specific integrated circuit,ASIC)、可编程逻辑器件(programmable logic device,PLD)或其组合。PLD例如是复杂可编程逻辑器件(complex programmable logic device,CPLD)、现场可编程逻辑门阵列(field-programmable gate array,FPGA)、通用阵列逻辑(generic array logic,GAL)或其任意组合。
通信总线302用于在上述组件之间传送信息。通信总线302可以分为地址总线、数据总线、控制总线等。为便于表示,附图3中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
存储器303例如是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其它类型的静态存储设备,又如是随机存取存储器(random access memory,RAM)或者可存储信息和指令的其它类型的动态存储设备,又如是电可擦可编程只读存储器(electrically erasable programmable read-only memory,EEPROM)、只读光盘(compact disc read-only memory,CD-ROM)或其它光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其它磁存储设备,或者是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其它介质,但不限于此。存储器303例如是独立存在,并通过通信总线302与处理器301相连接。存储器303也可以和处理器301集成在一起。
可选的,存储器303用于保存设备300在稀疏参数的更新过程中获得数据等。
网络接口304使用任何收发器一类的装置,用于与其它设备或通信网络通信。网络接口304包括有线网络接口,还可以包括无线网络接口。其中,有线网络接口例如可以为以太网接口。以太网接口可以是光接口,电接口或其组合。无线网络接口可以为无线局域网(wireless local area networks,WLAN)接口,蜂窝网络网络接口或其组合等。
在具体实现中,作为一种实施例,处理器301可以包括一个或多个CPU。
在具体实现中,作为一种实施例,设备300可以包括多个处理器。这些处理器中的每一个可以是一个单核处理器(single-CPU),也可以是一个多核处理器(multi-CPU)。这里的处理器可以指一个或多个设备、电路、和/或用于处理数据(如计算机程序指令)的处理核。
在一些实施例中,存储器303用于存储执行本申请方案的程序代码,处理器301可以执行存储器303中存储的程序代码。也即是,设备300可以通过处理器301以及存储器303中的程序代码,来实现方法实施例提供的稀疏参数的更新方法。
在描述稀疏参数的更新方法前,首先对稀疏参数的存储有关概念进行描述。
稀疏参数分布式的存储于多个参数节点101的内存或者连接的SSD中,每个稀疏参数对 应有特征,每个稀疏参数的特征也可以称为是索引。对于任一稀疏参数,基于该稀疏参数的特征可以定位到存储该稀疏参数的参数节点101以及该稀疏参数。
下面描述稀疏参数的更新的方法的流程,该方法应用于人工智能模型训练系统,该系统包括第一参数节点、第一训练节点和第二训练节点,此处仅示出该系统包括一个参数节点和两个训练节点,在实际应用中该系统可以包括多个参数节点和多个训练节点。第一参数节点用于存储稀疏参数,第一训练节点和第二训练节点用于更新稀疏参数。第一训练节点包括第一训练卡,第二训练节点包括第二训练卡。
具体的,第一训练节点从各参数节点获取稀疏参数。第一训练节点利用获取到的稀疏参数对待训练的数据进行训练,得到获取到的稀疏参数对应的梯度。然后第一训练节点确定分发至系统中各训练卡的梯度,第一训练节点将梯度和对应的稀疏参数发送至对应的训练卡。训练卡使用接收到的梯度更新接收到的稀疏参数,然后训练卡按照稀疏参数的特征,将更新后的稀疏参数发送至对应的参数节点。
例如,参见图4,第一训练节点从第一参数节点获取第一参数集,第一参数集包括多个参数,该多个参数均是稀疏参数。第一训练节点利用该多个参数对待训练的数据进行训练,得到第一梯度集,第一梯度集包括该多个参数对应的多个梯度中分发至第二训练卡的梯度。第一训练节点将第一梯度集和第一梯度集中的梯度对应的参数发送至第二训练卡。第二训练卡使用第一梯度集中的梯度,对第一梯度集中的梯度对应的参数进行更新,获得更新后的参数。第一训练卡将更新后的参数发送至第一参数节点。第一参数节点使用更新后的参数替换更新前的参数。
参见图4所示的流程,采用本申请的方案,使用训练卡对稀疏参数进行更新处理,使得稀疏参数的更新效率更高。而且训练节点对自身获取到的稀疏参数切分后传输给其它训练卡,每个稀疏参数以及对应的梯度仅会传输给一个训练卡,不需要重复传输数据,也能够节约网络传输资源。
需要说明的是,图4是以第一训练节点和第二训练节点为例进行说明,其它训练节点进行稀疏参数的更新过程参见图4,此处不再赘述。另外,在图4所示的流程中,仅示出了一次稀疏参数的更新过程。每次稀疏参数的更新过程均参见图4的所示的流程。
在图4所示的流程中,第一训练卡可以是第一训练节点中的主训练卡,也可以是第一训练节点中的任一训练卡,第二训练卡可以是第二训练节点中的主训练卡,也可以是第二训练节点中的任一训练卡,主训练卡为所属训练节点中用于对稀疏参数更新的训练卡。下面分别描述这两种情况下稀疏参数的更新过程。
第一种情况,第一训练节点包括第一训练卡和第三训练卡,第一训练卡和第三训练卡均是参与训练的训练卡,第一训练卡是第一训练节点中的主训练卡,第三训练卡的数目是一个或多个。第一训练节点上参与训练的各训练卡上运行有部分训练模型,参与训练的所有训练卡上运行的部分训练模型组成一个完整的训练模型,该完整的训练模型训练完成后,可以用于进行推理,例如,可以是推荐模型或者自然语言处理模型,本申请实施例不对模型的具体结构进行限定。同样,第二训练节点包括第二训练卡和第四训练卡,第二训练卡和第四训练卡均是参与训练的训练卡,第二训练卡是第二训练节点中的主训练卡,第四训练卡的数目是一个或多个。第二训练节点上参与训练的各训练卡上运行有部分训练模型,参与训练的所有训练卡上运行的部分训练模型组成一个完整的训练模型,该完整的训练模型与第一训练节点上的训练模型相同。稀疏参数的更新流程参见图5中步骤501至步骤510。
步骤501,第一训练节点获取本轮训练的待训练的数据。
在本实施例中,待训练的数据可以分布式的存储在参数节点上,也可以存储在其它服务器上,待训练的数据也可以称为训练数据。第一训练节点获取自身本轮训练的待训练的数据,称为是一个批量(batch)数据,该一个batch数据是第一训练节点上参与训练的训练卡所使用的待训练的数据的总和,每轮的待训练的数据的数目可以预先设置。在图5所示流程中,以待训练的数据存储在参数节点上为例进行说明。
步骤502,第一训练节点计算该待训练的数据所需的稀疏参数的特征,基于该稀疏参数的特征从对应的参数节点获取该稀疏参数。
其中,待训练的数据所需的稀疏参数为将该待训练的数据转换为固定维度向量所使用的参数。
在本实施例中,第一训练节点的处理器(如CPU等)计算该batch数据所需的稀疏参数的特征。基于该稀疏参数的特征,查询到对应的参数节点,从参数节点中获取到稀疏参数。
在步骤502中,系统架构包括第一参数节点和N个第二参数节点,N为大于或等于1的整数。第一训练节点从第一参数节点获取到的稀疏参数组成第一参数集,第一训练节点从N个第二参数节点分别获取到的稀疏参数组成第二参数集,这样,第一训练节点获取到1个第一参数集和N个第二参数集。
在一种示例中,不同待训练的数据有可能对应有相同的稀疏参数,所以可以先对稀疏参数进行去重后,然后再从参数节点获取,这样,可以节约传输资源。
步骤503,第一训练节点将待训练的数据、稀疏参数以及稀疏参数的特征传输给第一训练节点的各训练卡。
在本实施例中,第一训练节点的处理器(如CPU等)向第一训练节点的各训练卡广播batch数据、稀疏参数以及稀疏参数的特征。或者,第一训练节点通过分散(scatter)操作和全规约(all reduce)操作,将batch数据、稀疏参数以及稀疏参数的特征传输给该各训练卡。此处,该各训练卡指第一训练节点上参与训练的各训练卡。
此处通过scatter操作和all reduce操作传输给该各训练卡的处理为:假设第一训练节点的训练卡的数目为P个,第一训练节点将batch数据、稀疏参数以及稀疏参数的特征分别分为P份,将一份batch数据、一份稀疏参数以及一份稀疏参数的特征发送到一个训练卡,且不同训练卡获取到不同数据。然后各个训练卡通过all reduce操作获得全部的batch数据、稀疏参数以及稀疏参数的特征。
步骤504,第一训练节点中的各训练卡使用对应的待训练的数据,对该待训练的数据对应的稀疏参数进行训练,获得该稀疏参数对应的梯度。
在本实施例中,第一训练节点中各训练卡在batch数据中获取自身的小批量(mini-batch)数据,然后使用对mini-batch数据对应的稀疏参数进行训练,获得该稀疏参数对应的梯度。
不同训练卡的mini-batch数据不完全相同,可以存在部分相同,也可以完全不相同。
步骤505,第一训练卡对第一训练节点中多个训练卡在完成训练后得到的梯度数据进行聚合。
在本实施例中,第一训练卡获取第三训练卡上的稀疏参数以及对应的梯度,第三训练卡为第一训练节点中除第一训练卡之外的参与训练的训练卡。例如,第三训练卡训练获得自身的稀疏参数对应的梯度,第三训练卡将该稀疏参数对应的梯度、该稀疏参数的特征和该稀疏参数,发送给第一训练卡。再例如,第一训练卡与第三训练卡之间执行全规约求和(all reduce sum)操作,使得第一训练卡获取到第三训练卡上的稀疏参数的特征、稀疏参数以及对应的梯度,此处,通过all reduce sum操作,第一训练卡可以获取到第一训练节点中参与训练的所 有训练卡上的各稀疏参数的梯度之和。这样,通过all reduce sum操作后,由于第一训练节点上的对应相同稀疏参数的梯度相加,所以,第一训练卡向其它训练节点的主训练卡传输的数据量减少。
步骤506,第一训练卡按照训练节点对聚合后的梯度数据进行切分。
在本实施例中,第一训练卡确定系统架构中参与训练的训练节点,使用聚合后的梯度数据对应的稀疏参数的特征,确定各稀疏参数所分发至的训练节点。由于梯度与稀疏参数一一对应,所以确定出稀疏参数所分发至的训练节点,也就确定出稀疏参数对应的梯度所分发至的训练节点。
在一种示例中,第一训练卡计算各稀疏参数的特征的哈希值。第一训练卡使用各稀疏参数的特征的哈希值,确定出各稀疏参数所分发至的训练节点。例如,每个训练节点对应有索引,确定每个训练节点的索引的哈希值,对于任一稀疏参数,确定该稀疏参数的特征的哈希值,在训练节点中选取哈希值与该稀疏参数的特征的哈希值最接近的训练节点,将选取的训练节点确定为该稀疏参数所分发至的训练节点。由于每个训练节点仅有一个主训练卡,所以确定出训练节点,即确定出主训练卡。
在图5所示的流程中,按照上述方式确定出,分发至第二训练节点的主训练卡(即第二训练卡)的梯度组成的梯度集为第一梯度集,第一梯度集包括第一参数集的多个稀疏参数对应的多个梯度中分发至第二训练卡的梯度和N个第二参数集的多个稀疏参数对应的多个梯度中分发至第二训练卡的梯度。
步骤507,第一训练节点将第一梯度集和第一梯度集中的梯度对应的稀疏参数发送至第二训练卡。
在本实施例中,在第一训练卡与第二训练卡之间不存在卡间传输网络时,第一训练节点使用第一训练节点与第二训练节点之间的网络,将第一梯度集和第一梯度集中的梯度对应的稀疏参数传输至第二训练节点,第二训练节点再将接收到的稀疏参数以及对应的梯度,下发至主训练卡。
在第一训练卡与第二训练卡之间存在卡间传输网络时,通过卡间传输网络,直接将第一梯度集和第一梯度集中的梯度对应的稀疏参数分发至第二训练卡。
在一种示例中,第一训练卡还可以向第二训练卡发送稀疏参数的特征。这样,后续可以使用稀疏参数的特征,将更新后的稀疏参数更新至参数节点。
在第一训练卡不向第二训练卡发送稀疏参数的特征时,第二训练卡可以使用稀疏参数,查表找到稀疏参数的特征。
步骤508,第二训练卡使用第一梯度集中的梯度对第一梯度集中的梯度对应的稀疏参数进行更新。
在本实施例中,第二训练卡将该任一稀疏参数对应的所有梯度相加后,计算平均值,将计算获得的平均值确定为该任一稀疏参数对应的梯度聚合结果。第二训练卡使用该任一稀疏参数对应的梯度聚合结果,使用梯度下降法朝着梯度的反方向迭代调整稀疏参数,获得该任一稀疏参数对应的更新后的稀疏参数。
步骤509,第二训练卡将更新后的稀疏参数发送至对应的参数节点。
在本实施例中,第二训练卡使用稀疏参数的特征,确定出索引值。第二训练卡使用索引值对应到参数节点,将更新后的稀疏参数发送至对应的参数节点。
步骤510,参数节点存储接收到的稀疏参数。
在图5所示的流程中,同第一训练节点一样,第二训练节点的第二训练卡也会向第一训 练卡发送第二训练节点中分发至第一训练卡的稀疏参数和梯度,第二训练卡获取第二训练节点上所有训练卡(第二训练卡和第四训练卡)的稀疏参数和对应的梯度。第一训练节点接收第二训练卡发送给第一训练卡的稀疏参数以及对应的梯度。第一训练卡对第一训练卡上当前的稀疏参数对应的梯度进行聚合,获得当前的稀疏参数对应的梯度聚合结果,基于梯度聚合结果,对当前的稀疏参数进行更新,当前的稀疏参数包括第一训练卡分发至第一训练卡的稀疏参数和接收自第二训练卡的稀疏参数。
在本实施例中,在第一训练卡与第二训练卡之间不存在卡间传输网络时,第一训练节点使用第一训练节点与第二训练节点之间的网络,接收第二训练节点的第二训练卡发送给第一训练卡的稀疏参数以及对应的梯度。然后第一训练节点将接收到的稀疏参数以及对应的梯度,下发至第一训练卡。
在第一训练卡与第二训练卡之间存在卡间传输网络时,通过卡间传输网络,第一训练卡接收第二训练卡发送给自身的稀疏参数以及对应的梯度。
第一训练卡获取到第二训练卡发送给自身的稀疏参数以及对应的梯度后,第一训练卡上当前的稀疏参数包括自身分发给自身的稀疏参数和接收自第二训练卡的稀疏参数。
对于当前的稀疏参数中的任一稀疏参数,第一训练卡对该任一稀疏参数对应的梯度进行聚合,获得该任一稀疏参数对应的梯度聚合结果,使用该任一稀疏参数对应的梯度聚合结果,对该任一稀疏参数进行更新,获得该任一稀疏参数对应的更新后的稀疏参数。
第一训练卡使用稀疏参数的特征,确定出索引值。第一训练卡使用索引值对应到参数节点,将更新后的稀疏参数更新至对应的参数节点。
在一种示例中,第二训练卡还向第一训练卡发送稀疏参数的特征。第一训练卡还可以接收第二训练卡发送的稀疏参数的特征。这样,后续可以使用稀疏参数的特征,将更新后的稀疏参数更新至参数节点。
在第二训练卡不向第一训练卡发送稀疏参数的特征时,第一训练卡可以使用稀疏参数,查表找到稀疏参数的特征。
在图5所示的流程中,需要说明的是,在参数节点与训练节点部署在同一物理节点的情况下,为了减少物理节点间的数据传输量,稀疏参数分发至所存储的物理节点的主训练卡上,上述训练节点的索引即为所属物理节点的索引。这样,在主训练卡更新稀疏参数后,直接通过内部高速串行计算机扩展总线标准(peripheral component interconnect express,PCIe)总线,将更新后的稀疏参数存储至对应的参数节点,无需再通过物理节点之间的网络进行传输。
在参数节点与训练节点未部署在同一物理节点的情况下,根据训练节点的索引和稀疏参数的特征,确定稀疏参数所分发至的训练节点。在主训练卡更新稀疏参数后,训练节点通过与参数节点之间的网络,将更新后的稀疏参数存储至对应的参数节点。
在图5所示的流程中,梯度切分是按照训练节点进行切分的,在另一种示例中,步骤506中,也可以按照参数节点进行切分,确保存储在同一参数节点的稀疏参数存储切分至相同的训练节点。例如,可以预先存储参数节点与训练节点的映射关系,主训练卡使用稀疏参数的特征确定出索引值,不同索引值对应不同的参数节点,且一个索引值仅对应一个参数节点,使用一个索引值可以对应到一个参数节点,然后在该映射关系中,即可确定出该参数节点对应的训练节点。这样,对于在图5所示的流程中,可以认为是将具有第一索引值的稀疏参数对应的梯度归入第一梯度集,第一索引值对应某个参数节点。
采用这种方案,主训练卡在更新稀疏参数后,可以使用一个稀疏参数的特征,确定对应的参数节点,而无需使用每个稀疏参数的特征,确定对应的参数节点。
为了更好地理解图5所示的流程,下面以一个参数节点与一个训练节点部署在同一物理节点为例进行说明。
参见图6,系统架构100包括两个物理节点,两个物理节点分别为物理节点1和物理节点2,每个物理节点上插有两个训练卡,物理节点1上的两个训练卡为训练卡1和训练卡2,训练卡1是主训练卡,物理节点2上的两个训练卡为训练卡3和训卡练4,训练卡3是主训练卡,训练卡可以是NPU。
步骤601,物理节点1中训练节点获取本轮训练的待训练的数据,称为一个batch数据。物理节点1的CPU计算batch数据所需的稀疏参数的特征,基于该稀疏参数的特征从对应的参数节点获取该稀疏参数。
其中,稀疏参数使用W1表示,属于全量稀疏参数的一部分,假设W1包括三个子集A、B和C。
步骤602,物理节点1通过scatter操作和all reduce操作将batch数据、W1以及W1的特征传输给两个训练卡。
步骤603,物理节点1中的两个训练卡使用对应的mini-batch数据,对该mini-batch数据对应的稀疏参数进行训练,获得该稀疏参数对应的梯度,训练卡1的mini-batch数据对应的稀疏参数表示为子集A1、B1和C1,获得的梯度表示为A11、B11和C11,训练卡2的mini-batch数据对应的稀疏参数表示为子集A2、B2和C2,获得的梯度表示为A22、B22和C22。
步骤604,物理节点1中的主训练卡(训练卡1)获取物理节点1中训练卡2上的稀疏参数以及对应的梯度,当前物理节点1中的稀疏参数以及对应的梯度表示为子集A33、B33和C33,每个子集包括多个稀疏参数以及对应的梯度。
步骤605,物理节点1中的主训练卡按照稀疏参数的特征和训练节点的索引,确定分片A44和B44,分别对应的主训练卡为物理节点1的主训练卡和物理节点2的主训练卡。
步骤606,物理节点1的主训练卡向物理节点2的主训练卡发送分片B44和分片B44中稀疏参数的特征。
步骤607,物理节点1的主训练卡接收物理节点2的主训练卡发送的分片C44和分片C44中稀疏参数的特征。
步骤608,物理节点1的主训练卡对A44和C44中的梯度,进行聚合,获得梯度聚合结果A55,基于该梯度聚合结果A55,对A44和C44中的稀疏参数进行更新。
步骤609,物理节点1的主训练卡将更新后的稀疏参数,更新至对应的参数节点。
在图6中,物理节点2与物理节点1的处理类似,物理节点2获取到的稀疏参数使用W2表示,包括三个子集D、E和F,训练卡3的mini-batch数据对应的稀疏参数表示为子集D1、E1和F1,获得的梯度表示为子集D11、E11和F11,训练卡4的mini-batch数据对应的稀疏参数表示为子集D2、E2和F2,获得的梯度表示为子集D22、F22和E22。物理节点2中训练卡3和训练卡4上的稀疏参数、以及对应的梯度合并后表示为子集D33、F33和E33,每个子集包括多个稀疏参数以及对应的梯度。训练卡3根据子集D33、F33和E33中稀疏参数的特征,将子集D33、F33和E33分为分片C44和D44,训练卡3将C44以及C44中稀疏参数的特征发送给物理节点1的训练卡1。物理节点2的训练卡3对B44和D44中的梯度,进行聚合,获得梯度聚合结果B55,基于该梯度聚合结果B55,对稀疏参数进行更新。
采用图5所示的方案,使用训练卡进行梯度聚合和参数更新,相较于传统参数节点进行参数更新的方式,减少了CPU和物理节点内存占用,使得梯度聚合和参数更新的效率更高。而且在存在卡间传输网络时,数据传输可以减少对主机网络的依赖,进一步缓解主机网络瓶 颈。
第二种情况,第一训练节点和第二训练节点上参与训练的每个训练卡上部署有全部的训练模型。稀疏参数的更新流程参见图7中步骤701至步骤709。图7中以系统包括的第一训练节点与第二训练节点之间的交互为例进行说明。
步骤701,第一训练节点获取各训练卡上本轮训练的待训练的数据。
在本实施例中,待训练的数据可以分布式的存储在参数节点上,也可以存储在其它服务器上,待训练的数据也可以称为是训练数据。第一训练节点获取各训练卡上本轮训练的待训练的数据,每个训练卡上本轮训练的待训练的数据称为是一个mini-batch数据,每轮训练的待训练的数据的数目可以预先设置。
步骤702,第一训练节点计算各训练卡本轮的待训练的数据所需的稀疏参数的特征,基于该稀疏参数的特征从对应的参数节点获取该稀疏参数。
在本实施例中,第一训练节点的处理器(如CPU等)计算该所需的稀疏参数的特征。基于该稀疏参数的特征,查询到对应的参数节点,从参数节点中获取到稀疏参数。
在一种示例中,由于各训练卡的待训练的数据有可能对应相同的稀疏参数,所以可以先对稀疏参数进行去重后,从参数节点获取,这样,可以节约传输资源。
步骤703,第一训练节点将各训练卡的待训练的数据、稀疏参数以及稀疏参数的特征传输给该各训练卡。
在本实施例中,第一训练节点的处理器将各训练卡的待训练的数据、稀疏参数以及稀疏参数的特征传输给该各训练卡。
步骤704,第一训练节点中的各训练卡分别使用对应的待训练的数据,对该待训练的数据对应的稀疏参数进行训练,获得该待训练的数据对应的稀疏参数的梯度。
步骤705,第一训练节点中各训练卡按照系统中各训练卡,确定自身的稀疏参数所分发至的训练节点的训练卡。
在本实施例中,第一训练节点中各训练卡确定各训练节点的训练卡,使用自身的稀疏参数的特征的哈希值,确定自身的稀疏参数所分发至的训练节点的训练卡。
在一种示例中,对于第一训练节点中的任一训练卡,该训练卡确定自身的稀疏参数的特征的哈希值。每个训练节点的训练卡对应有索引,确定每个训练卡的索引的哈希值。对于任一稀疏参数,在训练卡中选取哈希值与该稀疏参数的特征的哈希值最接近的训练节点,将选取的训练卡确定为该稀疏参数所分发至的训练卡。
在图7所示的流程中,按照上述方式确定出,分发至第二训练节点的第二训练卡的梯度组成的梯度集为第一梯度集,第一梯度集包括第一参数集的多个稀疏参数对应的多个梯度中分发至第二训练卡的梯度和第一训练卡在完成训练后的梯度数据中对应N个第二参数集的稀疏参数,且分发至第二训练卡的梯度。
步骤706,第一训练节点中各训练卡将自身的稀疏参数以及对应的梯度分发至对应的训练节点的训练卡。
在本实施例中,在图7中以发送给第二训练卡为例进行说明,在第一训练节点的各训练卡与第二训练卡之间不存在卡间传输网络时,第一训练节点使用第一训练节点与第二训练节点之间的网络,将稀疏参数以及对应的梯度分发至第二训练节点,第二训练节点再将接收到的稀疏参数以及对应的梯度,下发至第二训练卡。此种情况下,发送稀疏参数以及对应的梯度时,还会发送训练卡的标识,指示稀疏参数以及对应的梯度所发往的训练卡。
在第一训练节点的各训练卡与第二训练卡之间存在卡间传输网络时,通过卡间传输网络, 将稀疏参数以及对应的梯度分发至第二训练卡。
在步骤706中存在同一个训练节点中,训练卡之间传输梯度和稀疏参数,此时可以使用总线进行传输。
在一种示例中,各训练卡还可以向第二训练卡发送稀疏参数的特征。这样,后续可以使用稀疏参数的特征,将更新后的稀疏参数更新至参数节点。
在各训练卡不向第二训练卡发送稀疏参数的特征时,第二训练卡可以使用稀疏参数,查表找到稀疏参数的特征。
步骤707,第二训练卡使用第一梯度集中的梯度对第一梯度集中的梯度对应的稀疏参数进行更新。
步骤708,第二训练卡将更新后的稀疏参数发送至对应的参数节点。
步骤707和步骤708中的详细描述参见图5所示的流程,此处不再描述。
步骤709,参数节点存储接收到的稀疏参数。
在图7所示的流程中,第一训练节点的各训练卡接收其它训练卡发送给自身的稀疏参数以及对应的梯度。其中,对于第一训练卡来说,其它训练卡包括第一训练节点中除第一训练卡之外的训练卡以及其它训练节点的训练卡。该其它训练卡包括第二训练卡。第一训练节点中各训练卡对自身当前的稀疏参数对应的梯度进行聚合,获得当前的稀疏参数对应的梯度聚合结果,基于该梯度聚合结果,对该当前的稀疏参数进行更新,该当前的稀疏参数包括自身原来的部分稀疏参数和接收到的稀疏参数。第一训练节点中各训练卡将更新后的稀疏参数,存储至对应的参数节点。
在图7所示的流程中,梯度切分是按照训练卡进行切分的,在另一种示例中,步骤705中,也可以按照参数节点进行切分,确保存储在同一参数节点的稀疏参数存储切分至相同的训练卡。例如,可以预先存储参数节点与训练卡的映射关系,在该映射关系中,参数节点可以对应一个或多个训练卡,任一训练卡使用稀疏参数的特征确定出一个索引值,该索引值可以对应到一个参数节点,在该映射关系中,即可确定出该参数节点对应的训练卡。这样,训练卡在更新稀疏参数后,可以使用一个稀疏参数的特征,确定对应的参数节点,而无需使用每个稀疏参数的特征,确定对应的参数节点。
为了更好地理解图7所示的流程,下面以一个参数节点与一个训练节点部署在同一物理节点为例进行说明。
参见图8,系统架构100包括两个物理节点,两个物理节点分别为物理节点1和物理节点2,物理节点1上插有两个训练卡,两个训练卡为训练卡1和训练卡2,物理节点2上插有一个训练卡,该一个训练卡为训练卡3,训练卡可以是NPU。
步骤801,物理节点1中训练节点获取各训练卡本轮训练的待训练的数据,每个训练卡的待训练的数据成为一个min-batch数据。物理节点1的CPU计算各训练卡的待训练的数据所需的稀疏参数的特征,基于该稀疏参数的特征从对应的参数节点获取该稀疏参数。
其中,训练卡1使用的稀疏参数使用W11表示,属于全量稀疏参数,假设W11包括三个子集G、H和I。训练卡2使用的稀疏参数使用W12表示,属于全量稀疏参数,假设W12包括三个子集M、N和O。
步骤802,物理节点1将各待训练的数据分别对应min-batch数据、稀疏参数以及稀疏参数的特征传输给两个训练卡。
步骤803,物理节点1中的各训练卡使用对应的mini-batch数据,对该mini-batch数据对应的稀疏参数进行训练,获得该稀疏参数对应的梯度。训练卡1上的稀疏参数以及获得的梯 度表示为子集G1、H1和I1,训练卡2上的稀疏参数以及获得的梯度表示为子集M1、N1和O1,每个子集包括稀疏参数以及对应的梯度。
步骤804,物理节点1中训练卡1按照子集G1、H1和I1中稀疏参数的特征,确定子集G1、H1和I1,分别对应的训练卡为物理节点1中训练卡1、训练卡2和物理节点2中训练卡3。物理节点1中训练卡2按照子集M1、N1和O1中稀疏参数的特征,确定子集M1、N1和O1,分别对应的训练卡为物理节点1中训练卡2、训练卡1和物理节点2中训练卡3。
步骤805,物理节点1中训练卡1向物理节点2中训练卡3发送子集I1。物理节点1中训练卡1向物理节点1中训练卡2发送子集H1。物理节点1中训练卡2向物理节点2中训练卡3发送子集O1。物理节点1中训练卡2向物理节点1中训练卡1发送子集N1。
步骤806,物理节点1中训练卡1接收训练卡2发送的子集N1,接收物理节点2中训练卡3发送的子集J1。物理节点1中训练卡2接收训练卡1发送的子集H1,接收物理节点2中训练卡3发送子集K1。
步骤807,物理节点1中训练卡1对子集N1、G1和J1中的梯度,进行聚合,获得梯度聚合结果A66,基于该梯度聚合结果A66,对子集N1、G1和J1中的稀疏参数进行更新。物理节点1中训练卡2对子集M1、H1和K1中的梯度,进行聚合,获得梯度聚合结果A77,基于该梯度聚合结果A77,对子集M1、H1和K1中的稀疏参数进行更新。
步骤808,物理节点1中训练卡1和训练卡2将更新后的稀疏参数,更新至对应的参数节点。
前文中的第一训练卡为物理节点中的任一训练卡。
在图8中,物理节点2与物理节点1的处理类似,物理节点2获取到的稀疏参数使用W13表示,包括三个子集J、K和L,物理节点2中训练卡3计算的梯度后,获得子集J1、K1和L1,子集J1、K1和L1分别包括稀疏参数以及对应的梯度。物理节点2中训练卡3对子集L1、I1和O1中的梯度,进行聚合,获得梯度聚合结果B66,基于该梯度聚合结果B66,对子集L1、I1和O1中的稀疏参数进行更新。
采用图8所示的方案,使用训练卡进行梯度聚合和参数更新,相较于传统参数节点进行参数更新的方式,减少了CPU和物理节点内存占用,使得梯度聚合和参数更新的效率更高。而且在存在卡间传输网络时,数据传输可以减少对主机网络的依赖,进一步缓解主机网络瓶颈。
而且训练节点内部多个训练卡无需做节点内稀疏参数的同步,可以充分利用多个训练卡并行的优势。而且训练卡在计算出梯度后可以直接进行传输,无需更复杂的操作。
本申请实施例中,假设单训练卡上稀疏参数的网络数据传输量为W_k,获取稀疏参数的过程中,稀疏参数的总网络数据传输量的上限为N_worker*N_device_per_worker*W_k。在参数节点和训练节点部署在同一物理节点的场景下,获取稀疏参数的过程中,稀疏参数的网络数据传输量为(N_worker-1)*N_device_per_worker*W_k/N_worker。其中,N_worker为训练节点的数目,N_device_per_worker为每个训练节点中训练卡的数目。
在传输稀疏参数以及对应的梯度过程中,单个训练节点的网络数据传输量为(N_worker-1)*(W_k+G_k)/N_worker,其中,N_worker为训练节点的数目,G_k为单个训练节点上的稀疏参数对应的梯度,训练节点内部的数据传输不计在网络数据传输量中,在不考虑梯度压缩的情况下,W_k与G_k相等,单个训练节点的网络数据传输量也可以表示为2*(N_worker-1)*G_k/N_worker此处乘以2是由于稀疏参数也需要传输。
在完成一次batch数据的训练过程中,在参数节点和训练节点部署在同一物理节点的场 景下,网络传输数据量减少至(N_worker-1)*N_device_per_worker*W_k/N_worker+(N_worker-1)*(W_k+G_k)。
另外,在参数节点和训练节点不是部署在同一物理节点的场景下,在稀疏参数更新到参数节点的过程中,网络数据传输量为(N_worker-1)*G_k/N_worker。
在一种示例中,本申请实施例提供了一种计算机程序产品,该计算机程序产品包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行图4所示的流程中第一训练节点执行的部分方法。
本领域普通技术人员可以意识到,结合本申请中所公开的实施例中描述的各方法步骤和单元,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各实施例的步骤及组成。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。本领域普通技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统架构、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,该模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个模块或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口、装置或模块的间接耦合或通信连接,也可以是电的,机械的或其它的形式连接。
该作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理模块,即可以位于一个地方,或者也可以分布到多个网络模块上。可以根据实际的需要选择其中的部分或者全部模块来实现本申请实施例方案的目的。
另外,在本申请各个实施例中的各模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以是两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件模块的形式实现。
该集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例中方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
本申请中术语“第一”和“第二”等字样用于对作用和功能基本相同的相同项或相似项进行区分,应理解,“第一”和“第二”之间不具有逻辑或时序上的依赖关系,也不对数量和执行顺序进行限定。还应理解,尽管以下描述使用术语第一和第二等来描述各种元素,但这些元素不应受术语的限制。这些术语只是用于将一元素与另一元素区别分开。例如,在不脱离各种示例的范围的情况下,第一稀疏参数可以被称为第二稀疏参数,并且类似地,第二稀疏参数可以被称为第一稀疏参数。第一稀疏参数和第二稀疏参数都可以是稀疏参数,并且在某些情况下,可以是单独且不同的稀疏参数。
本申请中术语“至少一个”的含义是指一个或多个,本申请中术语“多个”的含义是指两个 或两个以上。
以上描述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (10)

  1. 一种稀疏参数的更新方法,其特征在于,应用于人工智能模型训练系统,所述系统包括第一参数节点、第一训练节点和第二训练节点,所述第一训练节点包括第一训练卡,所述第二训练节点包括第二训练卡,所述方法包括:
    所述第一训练节点从所述第一参数节点获取第一参数集,所述第一参数集包括多个参数;
    所述第一训练节点利用所述多个参数对待训练的数据进行训练,得到第一梯度集,所述第一梯度集包括所述第一参数集的多个参数对应的多个梯度中分发至所述第二训练卡的梯度;
    所述第一训练节点将所述第一梯度集和所述第一梯度集中的梯度对应的参数发送至所述第二训练卡;
    所述第二训练卡根据所述第一梯度集中的梯度对所述第一梯度集中的梯度对应的参数进行更新;
    所述第二训练卡将更新后的参数发送至所述第一参数节点。
  2. 如权利要求1所述的方法,其特征在于,所述方法还包括:
    所述第一训练节点从N个第二参数节点分别获取一个第二参数集,每个第二参数集包括多个参数;
    所述第一训练节点利用所述多个参数对待训练的数据进行训练,得到第一梯度集,包括:
    所述第一训练节点中的多个训练卡利用所述第一参数集和获取的N个第二参数集的参数对所述待训练的数据进行训练,所述多个训练卡包括所述第一训练卡;
    所述第一训练卡对所述多个训练卡在完成训练后得到的梯度数据进行聚合;
    所述第一训练卡对聚合后的梯度数据按照参数节点进行切分,得到所述第一参数节点对应的所述第一梯度集,或者,所述第一训练卡对聚合后的梯度数据按照训练节点进行切分,得到所述第二训练节点对应的所述第一梯度集,所述第二训练节点对应的所述第一梯度集还包括所述N个第二参数集的多个参数对应的多个梯度中分发至所述第二训练卡的梯度。
  3. 如权利要求1所述的方法,其特征在于,所述方法还包括:
    所述第一训练节点从N个第二参数节点分别获取一个第二参数集,每个第二参数集包括多个参数;
    所述第一训练节点利用所述多个参数对待训练的数据进行训练后,得到第一梯度集,包括:
    所述第一训练节点中的多个训练卡利用所述第一参数集和获取的N个第二参数集的参数对所述待训练的数据进行训练,所述多个训练卡包括所述第一训练卡;
    所述第一训练卡对所述第一训练卡在完成训练后的梯度数据按照所述系统中各训练卡进行切分,得到所述第二训练卡对应的所述第一梯度集,所述第一梯度集还包括所述第一训练卡在完成训练后的梯度数据中对应所述N个第二参数集的参数,且分发至所述第二训练卡的梯度。
  4. 如权利要求1至3任意一项所述的方法,其特征在于,所述第二训练卡根据所述第一梯度集中的梯度对所述第一梯度集中的梯度对应的参数进行更新,包括:
    所述第二训练卡对从所述系统中的多个训练节点接收到的多个梯度集中的梯度进行聚合,所述多个梯度集包括所述第一梯度集;
    利用聚合后的梯度更新所述第一梯度集中的梯度对应的参数。
  5. 如权利要求2所述的方法,其特征在于,所述第一训练卡对聚合后的梯度数据按照参 数节点进行切分,得到所述第一参数节点对应的所述第一梯度集,包括:
    根据聚合后的每个梯度数据对应的参数的特征确定索引值,一个索引值用于指示一个参数节点;
    将具有第一索引值的参数对应的梯度归入所述第一梯度集,所述第一索引值用于指示所述第一参数节点。
  6. 一种第一训练节点,其特征在于,所述第一训练节点属于人工智能模型训练系统,所述系统还包括第一参数节点和第二训练节点,所述第一训练节点包括第一训练卡,所述第二训练节点包括第二训练卡;
    所述第一训练节点用于:
    从所述第一参数节点获取第一参数集,所述第一参数集包括多个参数;
    利用所述多个参数对待训练的数据进行训练,得到第一梯度集,所述第一梯度集包括所述第一参数集的多个参数对应的多个梯度中分发至所述第二训练卡的梯度;
    将所述第一梯度集和所述第一梯度集中的梯度对应的参数发送至所述第二训练卡,以使所述第二训练卡对所述第一梯度集中的梯度对应的参数进行更新。
  7. 如权利要求6所述的第一训练节点,其特征在于,所述第一训练节点还用于:
    从N个第二参数节点分别获取一个第二参数集,每个第二参数集包括多个参数;
    所述第一训练节点中的多个训练卡,用于:
    利用所述第一参数集和获取的N个第二参数集的参数对所述待训练的数据进行训练,所述多个训练卡包括所述第一训练卡;
    所述第一训练卡,用于:
    对所述多个训练卡在完成训练后得到的梯度数据进行聚合;
    对聚合后的梯度数据按照参数节点进行切分,得到所述第一参数节点对应的所述第一梯度集,或者,对聚合后的梯度数据按照训练节点进行切分,得到所述第二训练节点对应的所述第一梯度集,所述第一梯度集还包括所述N个第二参数集的多个参数对应的多个梯度中分发至所述第二训练卡的梯度。
  8. 如权利要求6所述的第一训练节点,其特征在于,所述第一训练节点还用于:
    从N个第二参数节点分别获取一个第二参数集,每个第二参数集包括多个参数;
    所述第一训练节点中的多个训练卡,用于:
    利用所述第一参数集和获取的N个第二参数集的参数对所述待训练的数据进行训练,所述多个训练卡包括所述第一训练卡;
    所述第一训练卡,用于:
    对所述第一训练卡在完成训练后的梯度数据按照所述系统中各训练卡进行切分,得到所述第二训练卡对应的所述第一梯度集,所述第一梯度集还包括所述第一训练卡在完成训练后的梯度数据中对应所述N个第二参数集的参数,且分发至所述第二训练卡的梯度。
  9. 一种计算机设备,其特征在于,所述计算机设备包括存储器和处理器;
    所述存储器存储有计算机指令;
    所述处理器用于执行所述计算机指令,使得所述计算机设备执行如权利要求1至5中任一项中训练节点所执行的方法。
  10. 一种计算机可读存储介质,其特征在于,所述存储介质中存储有至少一条计算机指令;所述计算机指令由处理器读取以使计算机设备执行如权利要求1至5中任一项中训练节点所执行的方法。
PCT/CN2023/095266 2022-05-19 2023-05-19 稀疏参数的更新方法、训练节点、设备和存储介质 WO2023222113A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210555107.7A CN117151184A (zh) 2022-05-19 2022-05-19 稀疏参数的更新方法、训练节点、设备和存储介质
CN202210555107.7 2022-05-19

Publications (1)

Publication Number Publication Date
WO2023222113A1 true WO2023222113A1 (zh) 2023-11-23

Family

ID=88834717

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/095266 WO2023222113A1 (zh) 2022-05-19 2023-05-19 稀疏参数的更新方法、训练节点、设备和存储介质

Country Status (2)

Country Link
CN (1) CN117151184A (zh)
WO (1) WO2023222113A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109034385A (zh) * 2017-06-12 2018-12-18 辉达公司 用稀疏数据训练神经网络的系统和方法
CN112508190A (zh) * 2020-12-10 2021-03-16 上海燧原科技有限公司 结构化稀疏参数的处理方法、装置、设备及存储介质
CN113642734A (zh) * 2020-05-11 2021-11-12 阿里巴巴集团控股有限公司 一种深度学习模型的分布式训练方法、装置以及计算设备
CN113660113A (zh) * 2021-07-27 2021-11-16 上海大学 面向分布式机器学习的自适应稀疏参数模型设计与量化传输方法
US20210374544A1 (en) * 2019-04-11 2021-12-02 Huawei Technologies Co.,Ltd. Leveraging lagging gradients in machine-learning model training
CN113971428A (zh) * 2020-07-24 2022-01-25 北京达佳互联信息技术有限公司 数据处理方法、系统、设备、程序产品及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109034385A (zh) * 2017-06-12 2018-12-18 辉达公司 用稀疏数据训练神经网络的系统和方法
US20210374544A1 (en) * 2019-04-11 2021-12-02 Huawei Technologies Co.,Ltd. Leveraging lagging gradients in machine-learning model training
CN113642734A (zh) * 2020-05-11 2021-11-12 阿里巴巴集团控股有限公司 一种深度学习模型的分布式训练方法、装置以及计算设备
CN113971428A (zh) * 2020-07-24 2022-01-25 北京达佳互联信息技术有限公司 数据处理方法、系统、设备、程序产品及存储介质
CN112508190A (zh) * 2020-12-10 2021-03-16 上海燧原科技有限公司 结构化稀疏参数的处理方法、装置、设备及存储介质
CN113660113A (zh) * 2021-07-27 2021-11-16 上海大学 面向分布式机器学习的自适应稀疏参数模型设计与量化传输方法

Also Published As

Publication number Publication date
CN117151184A (zh) 2023-12-01

Similar Documents

Publication Publication Date Title
US10728091B2 (en) Topology-aware provisioning of hardware accelerator resources in a distributed environment
CN110134636B (zh) 模型训练方法、服务器和计算机可读存储介质
CN110262901B (zh) 一种数据处理方法及数据处理系统
CN114861911B (zh) 深度学习模型的训练方法、装置、系统、设备和介质
WO2020228378A1 (zh) 一种确定数据库的配置参数的方法及装置
US20170091668A1 (en) System and method for network bandwidth aware distributed learning
WO2023093355A1 (zh) 针对分布式图学习的数据融合方法及装置
CN113435682A (zh) 分布式训练的梯度压缩
CN112235344B (zh) 一种面向分布式机器学习的稀疏通信模型的实现方法
US11886225B2 (en) Message processing method and apparatus in distributed system
US20220156324A1 (en) Graph refactorization method and graph refactorization apparatus
WO2023222113A1 (zh) 稀疏参数的更新方法、训练节点、设备和存储介质
TWI758223B (zh) 具有動態最小批次尺寸之運算方法,以及用於執行該方法之運算系統及電腦可讀儲存媒體
US20230403232A1 (en) Data Transmission System and Method, and Related Device
WO2017113865A1 (zh) 一种大数据增量计算方法和装置
WO2024055529A1 (zh) 放置组成员选择方法、装置、设备及可读存储介质
Zhan et al. Field programmable gate array‐based all‐layer accelerator with quantization neural networks for sustainable cyber‐physical systems
CN115879543A (zh) 一种模型训练方法、装置、设备、介质及系统
CN116737069A (zh) 数据传输方法、装置、系统和计算机设备
CN117035045A (zh) 模型参数更新方法、装置、设备、存储介质和程序产品
CN115292044A (zh) 数据处理方法、装置、电子设备及存储介质
CN110325980A (zh) 用于数据库绑定型应用的用户界面后端集群的扩展技术
CN113971428A (zh) 数据处理方法、系统、设备、程序产品及存储介质
CN111652346A (zh) 一种基于分层优化范式的大规模图深度学习计算框架
WO2023151216A1 (zh) 图数据处理的方法和芯片

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23807062

Country of ref document: EP

Kind code of ref document: A1