WO2024001861A1 - Model training method, apparatus and system, and related device - Google Patents

Model training method, apparatus and system, and related device Download PDF

Info

Publication number
WO2024001861A1
WO2024001861A1 PCT/CN2023/101224 CN2023101224W WO2024001861A1 WO 2024001861 A1 WO2024001861 A1 WO 2024001861A1 CN 2023101224 W CN2023101224 W CN 2023101224W WO 2024001861 A1 WO2024001861 A1 WO 2024001861A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
gradient data
communication
communication domain
devices
Prior art date
Application number
PCT/CN2023/101224
Other languages
French (fr)
Chinese (zh)
Inventor
郝日佩
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2024001861A1 publication Critical patent/WO2024001861A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/098Distributed learning, e.g. federated learning

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a model training method, device, system and related equipment.
  • the scale of AI models is gradually increasing.
  • the parameter volume of the Pangu model in the field of natural language processing (NLP) can be as high as 200 billion, and the data volume of training samples can be as high as 40 terabytes.
  • a larger amount of model parameters and sample data will require higher computing power to train the AI model.
  • the huge computing power required for AI model training can be solved through distributed training of AI models.
  • the training samples of the AI model can be equally divided into multiple sample subsets, and each sample subset and the AI model are assigned to a device, so that each device uses a sample subset to iteratively train the AI model. And generate the gradient used to update the AI model.
  • different devices interact with each other to generate gradient data and perform gradient fusion to calculate the global gradient data (that is, the data obtained after gradient fusion of the gradient data generated by all devices), and then based on the global gradient data Update the parameters in the AI model trained by each device.
  • multiple rounds of iterative updates are performed on the parameters in the AI model, and the training of the AI model is finally completed.
  • embodiments of the present application provide a model training method, which can be executed by a corresponding model training device.
  • the model training device obtains an AI model to be trained and determines multiple communication domains.
  • the AI model can, for example, It is an AI model with a large amount of model parameters or sample data, such as the Pangu model, etc.
  • Each communication domain determined includes multiple devices. For example, all devices used to train the AI model can be divided into multiple communication domains, etc.
  • the model training device uses the local gradient data corresponding to each communication domain to update the AI model trained in the communication domain, where each communication domain
  • the corresponding local gradient data is obtained by gradient fusion based on the gradient data generated by multiple devices in the communication domain, and when the AI model is distributedly trained using the devices in the multiple communication domains in multiple rounds (interval rounds)
  • the number of rounds can be a fixed number or a random number).
  • the model training device uses all the gradient data to update the AI model trained separately in each communication domain.
  • the all gradient data is generated separately based on multiple devices in multiple communication domains.
  • the gradient data is obtained by gradient fusion, so that the global gradient data can be used to update the global AI model.
  • each communication domain independently uses the gradient data generated by multiple devices within it. Updating the AI model trained by the communication domain can alleviate the problem that the overall training progress of the AI model is lowered due to low progress in training the AI model in some communication domains over a period of time, thereby improving the overall training efficiency of the AI model.
  • the model training device will use global gradient data to update the AI model at multiple intervals, which can ensure that the training effect of the AI model can reach a high level. On this basis, since during each round of model training, devices in different communication domains do not need to exchange gradient data, which can effectively reduce the communication resources required to train the AI model.
  • the AI model can be independently trained and updated for each of the multiple communication domains.
  • the target communication domains Taking one of the target communication domains as an example, during each round of training, Multiple devices in the target communication domain interact with each other to generate gradient data, and perform gradient fusion based on the gradient data interacted between multiple devices to generate local gradient data corresponding to the target communication domain, thereby utilizing the gradient data corresponding to the target communication domain.
  • the local gradient data updates the AI model trained in the target communication domain.
  • similar methods can also be used to train the AI models for which they are responsible. In this way, each communication domain can realize the gradient update of the AI model in a local range by exchanging gradient data internally, and the efficiency of updating the AI model between different communication domains can not be affected by other communication domains. This can improve the overall training efficiency of the AI model.
  • the specific method may be to obtain the version number of the activation operation corresponding to the target communication domain and the version number of the interactive operation, Among them, the activation operation is used to trigger the interaction of gradient data between different devices in the target communication domain.
  • the interaction operation refers to the operation of exchanging gradient data between different devices in the target communication domain. Therefore, when the version number of the activation operation is greater than or equal to the interaction
  • multiple devices in the target communication domain interact with each other to generate gradient data. In this way, each communication domain can prevent some communication domains from executing interactive gradient data too many times by limiting the number of times they interact with gradient data, thereby avoiding asynchronous conflicts between multiple communication domains.
  • the physical connection between multiple devices in the target communication domain is a ring connection, such as a connection based on a HCCS ring mode.
  • the model training device can obtain the device topology relationship.
  • the topological relationship is used to indicate the connection relationship between multiple devices used to train the AI model.
  • the multiple devices used to train the AI model are divided to obtain multiple communication domains.
  • Each communication domain The communication rate between different devices in the network is higher than the communication rate between devices in different communication domains.
  • the user can configure all the devices for training the AI model into the communication domain to which each device belongs.
  • the model training device may generate a first configuration interface, which is used to present to the user the identities of multiple devices used to train the AI model, so that the user can configure each device on the first configuration interface.
  • the communication domain to which it belongs so that the model training device can respond to the user's first configuration operation and determine the communication domain to which each of the multiple devices used to train the AI model belongs, thereby dividing the multiple communication domains.
  • the user can configure multiple communication domains, so that the user can intervene in the training of the AI model and achieve better model training effects.
  • the model training device before training the AI model, can also generate a second configuration interface.
  • the second configuration interface is used to present multiple interaction strategies to the user, where each interaction strategy is used to indicate A way to interact gradient data between multiple devices in the communication domain, such as allgather, allreduce, ring-allreduce, half-doubling (having-doubling) allreduce strategy, etc., thereby responding to the user's second request for these multiple interaction strategies.
  • Configure operations that determine how gradient data is exchanged between multiple devices in each communication domain can use the same interaction strategy to exchange gradient data, or can use different interaction strategies to exchange gradient data, etc.
  • the user can manually configure the interaction strategy in each communication domain, so that the user can intervene in the training of the AI model.
  • the most appropriate interaction strategy can be configured according to the device characteristics in each communication domain, so as to achieve better results. Model training effect.
  • devices in different communication domains are located on the same computing node, or devices in multiple communication domains are located on different computing nodes.
  • the devices in each communication domain include processors, chips, servers, etc., so that distributed training of AI models can be implemented based on devices with different granularities.
  • inventions of the present application provide a model training device.
  • the device includes: an acquisition module for acquiring an AI model to be trained; and a determination module for determining multiple communication domains.
  • Each communication domain includes multiple devices; an update module is configured to update the local gradient data corresponding to each communication domain during each round of distributed training of the AI model using devices in the multiple communication domains.
  • the local gradient data corresponding to each communication domain is obtained by gradient fusion based on the gradient data respectively generated by multiple devices in the communication domain; and, when multiple rounds are separated, the multiple
  • the AI model trained separately in each communication domain is updated using global gradient data.
  • the global gradient data is generated separately based on multiple devices in the multiple communication domains.
  • the gradient data is obtained by gradient fusion.
  • the update module is configured to exchange gradient data generated by each of multiple devices in a target communication domain, and the target communication domain is one of the multiple communication domains. domain; the target communication domain performs gradient fusion according to the gradient data interacted between the multiple devices to generate local gradient data corresponding to the target communication domain; the target communication domain utilizes the local gradient corresponding to the target communication domain The data updates the AI model trained in the target communication domain.
  • the update module is configured to: obtain the version number of the activation operation corresponding to the target communication domain and the version number of the interactive operation, and the activation operation is used to trigger the activation operation in the target communication domain.
  • Gradient data is exchanged between different devices, and the interactive operation is an operation of exchanging gradient data between different devices in the target communication domain; when the version number of the activation operation is greater than or equal to the version number of the interactive operation, the target communication Multiple devices in the domain interact with each other to generate gradient data.
  • the physical connection between multiple devices in the target communication domain is a ring connection.
  • the determining module is configured to: obtain a device topology relationship, which indicates a connection relationship between multiple devices used to train the AI model; according to the device topology relationship, divide the multiple devices used to train the AI model to obtain the multiple communication domains, wherein the communication rate between different devices in each communication domain is higher than the communication rate between devices in different communication domains. Communication rate.
  • the determining module is configured to: generate a first configuration interface, the first configuration interface is used to present to the user identification of multiple devices used to train the AI model; in response to The user's first configuration operation is to determine the communication domain to which each of the plurality of devices used to train the AI model belongs.
  • the determination module is further configured to: before training the AI model, generate a second configuration interface, the second configuration interface being used to present multiple interaction strategies to the user, the Each of the plurality of interaction strategies is used to indicate a way of interacting gradient data between multiple devices in the communication domain; in response to a second configuration operation by the user for the plurality of interaction strategies, determining each of the interaction strategies A way for multiple devices in a communication domain to exchange gradient data.
  • devices in different communication domains are located on the same computing node, or devices in multiple communication domains are located on different computing nodes.
  • the devices in each communication domain include a processor, a chip or a server.
  • model training device provided in the second aspect corresponds to the model training method provided in the first aspect
  • the technical effects of the second aspect and each embodiment in the second aspect can be found in the corresponding first aspect and the first aspect. The technical effects of each embodiment will not be described in detail here.
  • embodiments of the present application provide a model training system, which is characterized in that the model training system includes a plurality of devices, and the model training system is used to perform the above-mentioned first aspect and any implementation of the first aspect. model training methods.
  • inventions of the present application provide a computing device.
  • the computing device includes a processor and a memory; the memory is used to store instructions, and the processor executes the instructions stored in the memory, so that the calculation
  • the device performs the model training method in the above first aspect or any possible implementation manner of the first aspect.
  • the memory can be integrated into the processor or independent of the processor.
  • the computing device may also include a bus. Among them, the processor is connected to the memory through a bus.
  • the memory may include readable memory and random access memory.
  • embodiments of the present application further provide a computer-readable storage medium, which stores programs or instructions that, when run on at least one computer, cause at least one computer to execute the above-mentioned first step.
  • a computer-readable storage medium which stores programs or instructions that, when run on at least one computer, cause at least one computer to execute the above-mentioned first step.
  • aspect and the model training method in any implementation of the first aspect.
  • embodiments of the present application further provide a computer program product containing instructions that, when run on at least one computer, causes at least one computer to execute the model in the above first aspect and any implementation of the first aspect. Training methods. .
  • Figure 1 is a schematic diagram of interactive gradient data of four devices provided by the embodiment of the present application.
  • Figure 2 is a schematic architectural diagram of an exemplary model training system provided by an embodiment of the present application.
  • Figure 3 is a schematic flow chart of a model training method provided by an embodiment of the present application.
  • Figure 4 is a schematic diagram of the topology between NPU1 to NPU8 used to train the AI model
  • Figure 5 is a schematic diagram of an exemplary configuration interface provided by an embodiment of the present application.
  • Figure 6 is a schematic diagram of interactive gradient data between four processors
  • Figure 7 is a schematic diagram of another exemplary configuration interface provided by an embodiment of the present application.
  • Figure 8 is a schematic diagram of processor 2 notifying other processors of interactive gradient data
  • Figure 9 is an architectural schematic diagram of an exemplary server provided by an embodiment of the present application.
  • FIG. 10 is a schematic flowchart of distributed training for the Pangu model provided by the embodiment of the present application.
  • Figure 11 is a schematic structural diagram of a model training device provided by an embodiment of the present application.
  • Figure 12 is a schematic diagram of the hardware structure of a computing device provided by an embodiment of the present application.
  • the equipment for training AI models can be processor-level equipment, such as neural network processor (neural-network processing unit, NPU), graphics processor (graphics processing unit, GPU), etc.
  • the device for training the AI model can be a chip-level device, such as multiple chips connected to the host.
  • the device for training the AI model can be a server-level device, such as multiple independent servers.
  • the multiple processors may be located on the same server (the server may constitute a computing node), or may be located on different servers.
  • the multiple devices for training the AI model are server-level devices, the multiple devices can be located in the same data center (the data center can be regarded as a computing node), or the multiple devices can be located in different data centers, that is, AI models can be trained distributedly across data centers.
  • multiple devices In the process of iteratively training the AI model, multiple devices usually use the ring-allreduce method to interact with the gradient data generated by each device in each round of AI model training, and generate gradient data through gradient fusion of the gradient data obtained from the interaction.
  • the new gradient data updates the AI model parameters.
  • Device 1 to device 4 Take the use of 4 devices to train an AI model as an example.
  • Device 1 to device 4 are shown in Figure 1.
  • each device uses a subset of samples to train the AI model and generate the corresponding Gradient data, then Device 1 to Device 4 can respectively divide the gradient data obtained by training into 4 shards according to the number of devices.
  • the gradient data of device 1 can be divided into slices a 1 , b 1 , c 1 , d 1
  • the gradient data of device 2 can be divided into slices a 2 , b 2 , c 2 , d 2
  • device 3 The gradient data of can be divided into slices a 3 , b 3 , c 3 , and d 3
  • the gradient data of device 4 can be divided into slices a 4 , b 4 , c 4 , and d 4 .
  • device 1 sends fragment a 1 to device 2
  • device 2 sends fragment b 2 to device 3
  • device 3 sends fragment c 3 to device 4
  • device 4 sends fragment c 3 to device 4.
  • each device can gradient fuse its own stored fragments of gradient data with the received fragments to generate new gradient data.
  • device 1 can combine fragment d 4 sent by device 4 with its stored fragment d 1 Perform gradient fusion to generate a new fragment D 1 of gradient data, and use D 1 to cover fragment d 1 .
  • shard d 4 is ⁇ 3,5,4,2,7,2 ⁇
  • shard d 1 is ⁇ 1,5,6,9,11,21 ⁇
  • D 1 D 1 obtained by gradient fusion can be ⁇ 2,5,5,6,9,12 ⁇ (the values at the corresponding positions are added and the average is calculated, and the average is rounded up).
  • device 2 can generate a new gradient data fragment B 2 and use B 2 to overwrite fragment b 2 ;
  • device 3 can generate a new gradient data fragment C 2 and use C 2 to overwrite fragment c 2 ;
  • Device 4 can generate a new slice C 4 of gradient data, and use C 4 to overwrite slice c 4 .
  • device 1 to device 4 interact for the second time.
  • Device 1 sends fragment D 1 to device 2.
  • Device 2 sends fragment A 2 to device 3.
  • Device 3 sends fragment B 3 to device 4.
  • Device 2 4.
  • Send fragment C 4 to device 1 and each device once again uses its own stored gradient data fragments to perform gradient fusion with the received fragments, generates new gradient data fragments, and replaces them with the originally stored gradients. Data sharding. Through multiple interactions between device 1 and device 4, there is a gradient data fragment in each device, which is obtained by gradient fusion using the corresponding fragments in device 1 and device 4, as shown in Figure 1.
  • device 1 to device 4 continue to interact and share the gradient fusion stored by each device with other devices, so that each device can obtain the gradient fusion based on the gradient data in all devices.
  • the gradient fusion result is shown in Figure 1, so that each device can use the gradient fusion result to update the parameters in the AI model. In this way, multiple devices can complete a round of training process for the AI model.
  • embodiments of the present application provide a model training method, which can be executed by a corresponding model training device to improve the training efficiency of the AI model.
  • the model training device obtains the AI model to be trained and determines multiple communication domains. Each communication domain in the multiple communication domains includes multiple devices; the model training device uses the multiple communication domains for training in each round.
  • the local gradient data corresponding to each communication domain is used to update the AI model trained in the communication domain.
  • the local gradient data corresponding to each communication domain is based on the gradients generated by multiple devices in the communication domain.
  • the data is obtained by gradient fusion; and, in the process of training the AI model using the multiple communication domains at each interval, the model training device uses global gradient data to update the AI model trained separately in each communication domain, and the global gradient data is based on Gradient data generated by multiple devices in the multiple communication domains are obtained by gradient fusion.
  • each communication domain independently uses the gradient data generated by multiple devices within it. Updating the AI model trained by the communication domain can alleviate the problem that the overall training progress of the AI model is lowered due to low progress in training the AI model in some communication domains over a period of time, thereby improving the overall training efficiency of the AI model.
  • the model training device can classify device 1 and device 2 into communication domain 1, and classify device 3 and device 4 into communication domain 2.
  • device 1 and device 2 in communication domain 1 train the AI model respectively and obtain corresponding gradient data.
  • the model training device can perform gradient fusion on the gradient data in communication domain 1, and The generated local gradient data is used to update the AI model trained by device 1 and device 2.
  • the model training device will also use the local gradient data generated in the communication domain 2 to update the AI model trained by the device 3 and the device 4 during this round of training.
  • communication domain 1 takes 40 seconds to complete the AI model training and update
  • communication domain 2 takes 60 seconds to complete the AI model training and update
  • it takes 55 seconds for communication domain 1 to complete the AI model training and update
  • 40 seconds for communication domain 2 to complete the AI model training and update Then based on these two communication domain 2, in the second round
  • the local gradient data generated by training updates the AI model globally assuming it takes 10 seconds. Since communication domain 1 and communication domain 2 are independent of each other during the two rounds of model training, and the model training of sub-communication domain 1 takes 95 seconds (i.e.
  • the time-consuming of sub-communication domain 2 is 100 seconds (60 seconds + 40 seconds), which makes the overall training time of the AI model 110 seconds (i.e. 100 seconds + 10 seconds), which is less than the existing ring-allreduce-based method of training AI models that takes 125 seconds seconds (i.e. 60 seconds + 55 seconds + 10 seconds), thereby improving the overall training efficiency of the AI model.
  • the model training device will use global gradient data to update the AI model at multiple intervals, which can ensure that the training effect of the AI model can reach a high level.
  • devices in different communication domains do not need to exchange gradient data, which can effectively reduce the communication resources required to train the AI model.
  • the above-mentioned model training device for executing the model training method can be deployed in the system architecture shown in Figure 2.
  • the system architecture shown in Figure 2 may include a deep learning framework 201, a computing architecture 202, firmware and drivers 203, and a hardware layer 204.
  • Deep learning framework 201 can comprehensively develop components, pre-trained models and other resources, shield users from the underlying complex hardware, and provide users with services for rapid development of AI models.
  • the deep learning framework 201 may be, for example, the TensorFlow framework, the PyTorch framework, the MindSpore framework, etc., or may be other types of deep learning frameworks, which are not limited.
  • the computing architecture 202 is used to provide an open programming interface, support users to quickly build AI applications and services based on AI models, and call multiple processors in the hardware layer 204 to achieve parallelization capabilities for AI model training. Furthermore, the computing architecture 102 can also implement functions such as graph-level and operator-level compilation optimization and automatic tuning of AI models.
  • the computing architecture 202 may be, for example, a neural network computing architecture (compute architecture for neural networks, CANN), etc., or may be other applicable architectures.
  • Firmware and driver 203 are used to respond to the call of the computing architecture 202 to the hardware layer 204 and use multiple processors in the hardware layer 204 to perform corresponding data processing operations, such as using multiple processors in the hardware layer 204 for parallelization. Training AI models, etc.
  • the hardware layer 204 includes multiple processors, such as processor 1 to processor 8 in Figure 2, and other devices such as memory, network card, etc. (not shown in Figure 2), which are used to provide data processing capabilities for the upper layer and support Upper level services.
  • the processors included in the hardware layer 204 may include, for example, a central processing unit (CPU), a neural network processing unit (NPU), a graphics processing unit (GPU). ), one or more of data processing unit (DPU), or may include other types of processors, which are not limited.
  • the model training device can also be deployed in other types of system architecture to implement distributed training of AI models.
  • the hardware layer 204 may include multiple servers, that is, the AI model may be distributedly trained at the server granularity.
  • FIG 3 it is a schematic flow chart of a model training method in an embodiment of the present application.
  • This method can be applied to the system architecture shown in Figure 2. In practical applications, this method can also be applied to other applicable system architectures. To facilitate understanding and description, the following is an example of an application to the system architecture shown in Figure 2.
  • the method may specifically include:
  • the deep learning framework 201 when users develop AI applications on the deep learning framework 201, they can provide the AI model used to implement the AI application to the deep learning framework 201, so that the deep learning framework 201 can provide the AI model to the computing architecture 202.
  • the model training device is used to trigger the training of the AI model.
  • the user can write a training script (an executable file written based on a specific format) on the deep learning framework 201, and the training script can be integrated with the file of the AI model built by the user on the deep learning framework 201.
  • the deep learning framework 201 can provide the training script to the model training device in the computing architecture 202, so that the model training device can parse the AI model from the training script and train the AI according to the model training logic indicated by the training script.
  • the model undergoes distributed training.
  • the deep learning framework 201 can provide the user with a configuration interface, which can present multiple completed AI models, etc., so that the deep learning framework 201 can determine the user's operation according to the user's selection of operations for the AI model.
  • the configuration interface can also present a variety of deep learning algorithms that can be used to train AI models, so that users can select a deep learning algorithm on the configuration interface and configure corresponding parameters, such as learning rate, based on the selected deep learning algorithm. , loss function, etc.
  • the deep learning framework 201 can provide the AI model, deep learning algorithm, and configured parameters selected by the user to the model training device, so that the model training device performs distributed training on the AI model based on the deep learning algorithm and configured parameters.
  • the model training device can also obtain the AI model to be trained through other methods, which is not limited in this embodiment.
  • S302 Determine multiple communication domains, each of the multiple communication domains including multiple devices.
  • the model training device can use N processors in the hardware layer 204 to train the AI model, where N is a positive integer (such as N is 8, 16, etc.). Moreover, before training the AI model, the model training device can first divide the N processors for training the AI model into multiple sets, each set including at least two processors, and the processors in each set can form a communication domain. For example, the model training device can divide 8 processors into 2 communication domains, each communication domain including 4 processors, etc.
  • processors in different communication domains can train the AI model independently. For example, when the processor in communication domain 1 completes a round of training for the AI model, it does not need to wait for communication domain 2.
  • the internal processor also completes one round of AI model training and directly executes the next round of training the AI model.
  • processors in each communication domain can interact with each other through allreduce, ring-allreduce and other methods to generate gradient data generated by each round of AI model training.
  • this embodiment provides the following two implementation examples for determining multiple communication domains:
  • the model training device can classify multiple devices with higher affinity into the same communication domain based on the affinity between the devices.
  • the model training device can obtain the device topology relationship between the N processors in the hardware layer 204.
  • the device topology relationship can be used to indicate the connection relationship between the N processors, so that the model training device can obtain the device topology relationship between the N processors.
  • the device topology divides the N processors used to train the AI model to obtain multiple communication domains.
  • the communication rate between different processes in each communication domain is higher than the communication between processors in different communication domains. rate.
  • the model training device can divide multiple processors that are physically connected in a ring connection into one communication domain according to the topological relationship of the device, so that multiple communication domains can be divided. For example, physical connections may be established between processors in the same communication domain based on a Huawei cache coherence system (huawei cache coherence system, HCCS) ring connection method.
  • HCCS Huawei cache coherence system
  • the N processors used to train the AI model include NPU1 to NPU8 as shown in Figure 4.
  • NPU1 to NPU4 are connected in full mesh mode, and NPU5 to NPU8 are also connected in full mesh mode. mode to connect, NPU1 and NPU5 can be connected through the CPU.
  • the model training device can determine to classify NPU1 to NPU4 into communication domain 1 and classify NPU5 to NPU8 into communication domain 2 according to the topological structure between NPU1 to NPU8.
  • the communication rate between NPU1 to NPU4 is higher than the communication rate between NPUs across communication domains.
  • the model training device can generate a configuration interface. For example, it can generate a configuration interface as shown in Figure 5.
  • the configuration interface includes the identifiers of M processors (such as processor names, etc.) that can be used to train the AI model. ), M is a positive integer greater than or equal to N, so the model training device can present the configuration interface to the user through the deep learning framework 201, so that the user can select from the presented M processors for training the AI model this time. N processors, and further configure the communication domain to which each processor selected by the user belongs.
  • the model training device may execute the initialization process of the communication domain.
  • the AI model may respond to the user's configuration operation, determine the communication domain to which each of the N processors used to train the AI model belongs, and divide it to obtain Multiple communication domains and determining the size of each communication domain.
  • the number of processors included in each communication domain may be the same or different.
  • the configuration interface shown in Figure 5 presents 16 processors for the user to select, and based on the user's selection operation on the processor, it is determined to select processor 1 to processor 8 to train the AI model. Then, the user can create two communication domains on the configuration interface, namely communication domain 1 and communication domain 2, and specify the communication domains to which processors 1 to 8 respectively belong on the configuration interface. In this way, the model training device can determine the processors included in each communication domain according to the user's configuration of the communication domain to which each processor belongs, thereby obtaining multiple communication domains.
  • model training device can also determine multiple communication domains in other ways. For example, after the model training device determines multiple processors selected by the user, Processors located in the same server can be classified into a communication domain, etc., which is not limited in this embodiment.
  • S303 In the process of training the AI model using multiple communication domains in each round, use the local gradient data corresponding to each communication domain to update the AI model trained in the communication domain.
  • the local gradient data corresponding to each communication domain is based on the communication domain.
  • the gradient data generated by multiple processors in the system are obtained by gradient fusion.
  • the model training device can use processors in the multiple communication domains to perform distributed training on the AI model.
  • the model training device can allocate an AI model and a subset of training samples for training the AI model to each processor. Different processors are assigned the same AI model but different subsets of training samples. Each training sample subset includes at least one training sample. Then, during each round of training, each processor can train the AI model using the assigned subset of training samples, and generate gradient data based on the difference between the AI model's inference results based on the training samples and the actual results. The gradient data is used for Perform gradient updates on parameters in the AI model. Since each processor trains the AI model based on part of the training samples (that is, a subset of training samples), different processors can interact with the gradient data generated by each processor and perform gradient fusion, and perform each process based on the results of gradient fusion. The parameters in the AI model on the processor are used to perform gradient data to achieve the effect of using multiple training sample subsets to train the AI model.
  • gradient data is not exchanged between all processors.
  • Each processor only exchanges gradient data within the communication domain to which it belongs, and performs gradient fusion and model parameter updating. This prevents the model training processes in different communication domains from interfering with each other.
  • NPU1 to NPU8 shown in Figure 4 as an example.
  • NPU1 to NPU4 only interact with gradient data in communication domain 1, and between NPU5 to NPU8 in communication domain 2. Do not interact with gradient data.
  • NPU5 to NPU8 only exchange gradient data in communication domain 2, and do not exchange gradient data with the NPU in communication domain 1. In this way, each communication domain can directly execute the next round of model training process after completing gradient data interaction, gradient fusion, and model parameter update, without waiting for other communication domains to complete a round of AI model training.
  • multiple processors in each communication domain can exchange gradient data based on any strategy.
  • one communication domain among multiple communication domains (hereinafter referred to as the target communication domain) is taken as an example for illustrative explanation.
  • Processors in the remaining communication domains can refer to similar processes for data exchange.
  • the new gradient data generated by gradient fusion based on the gradient data of all processors in each communication domain is called local gradient data.
  • data interaction can be carried out between multiple processors in the target communication domain based on the following implementation methods.
  • multiple processors in the target communication domain can interact with gradient data based on any one of strategies including allgather, allreduce, ring-allreduce, and having-doubling allreduce.
  • processor 1 can exchange gradient data with processor 2, and at the same time, processor 3 and processor 4 exchange gradient data.
  • processor 1 and processor 2 can perform gradient fusion on gradient data a and gradient data b to generate gradient data M;
  • processor 3 and processor 4 can perform gradient fusion on gradient data c and gradient data d, Generate gradient data N.
  • processor 1 can exchange gradient data with processor 3.
  • processor 1 sends gradient data M to processor 3, and processor 3 sends gradient data N to processor 1.
  • processor 2 and processor 4 exchange gradient data.
  • the processor 1 and the processor 3 can perform gradient fusion on the gradient data M and the gradient data N to generate the gradient data X.
  • the gradient data d The data generated by gradient fusion.
  • the processor 2 and the processor 4 can also perform gradient fusion on the gradient data M and the gradient data N to generate the gradient data X. In this way, after two interactions, each processor can obtain the gradient data X generated by gradient fusion based on the gradient data of all processors in the target communication domain.
  • the strategy for interacting gradient data adopted in each communication domain can be configured by the user.
  • the model training device can present the configuration interface as shown in Figure 7 to the user through the deep learning framework 201, so that the user can configure the interaction strategy for each communication domain on the configuration interface.
  • the configuration interface can provide multiple interaction strategy candidates for each communication domain, such as allgather, allreduce, ring-allreduce, half-allreduce, etc., so that users can choose from multiple candidates.
  • An interaction strategy is configured for each communication domain, and the interaction strategies adopted by different communication domains may be the same or different, which is not limited in this embodiment.
  • processors that have completed AI model training can give priority to the interaction of gradient data without waiting for all other processors in the target communication domain to also complete AI model training.
  • the efficiency of interacting gradient data between multiple processors within the target communication domain can be improved.
  • processor 1 to processor 4 train the AI model in parallel. Assuming that processor 2 is the first to complete the training of the AI model in the target communication domain, processor 2 can generate activation message, and use the activation message to notify processor 1, processor 3 and processor 4 to start exchanging gradient data. In actual application, based on the physical connection and communication rules between processors, processor 2 can first send an activation message to processor 1 to notify processor 1 to start interacting with gradient data, and then send an activation message to processor 4, and The processor 1 sends an activation message to the processor 2 to notify the processor 2 to start exchanging gradient data, as shown in Figure 8.
  • processor 1 is the second to complete AI model training
  • processor 2 and processor 3 can directly interact with gradient data (and perform gradient data fusion). Then, if processor 3 completes AI model training for the third time, processor 2 can exchange gradient data with processor 3 again. Moreover, when processor 4 also completes AI model training, processor 2 then exchanges gradient data with processor 4. In this way, processor 2 can obtain gradient data generated by all processors in the target communication domain. Finally, processor 2 can send the local gradient data generated based on the gradient data of the four processors to the remaining processors, so as to use the local gradient data to update the parameters of the AI model on each processor, as shown in Figure 8 Show.
  • each communication domain can avoid some problems by limiting the number of interactive gradient data.
  • the communication domain executes interactive gradient data too many times to avoid asynchronous conflicts between multiple communication domains.
  • the first processor in the target communication domain to complete AI model training can generate an activation message.
  • the activation message is used to notify the remaining processors in the target communication domain to start interacting with gradient data, and the activation message includes the activation message.
  • the processor that completes the AI model training first can obtain the version number of the currently executed interaction operation, which is the operation of the interaction gradient data between different devices in the target communication domain, so that the processor can compare the activation operations.
  • the version number and the version number of the interoperability are used to compare the activation operations.
  • the processor starts to interact with gradient data with other processors; otherwise, the interaction operation of gradient data is not performed between multiple processors.
  • multiple processors in the target communication domain can exchange gradient data through shared memory.
  • multiple processors in the target communication domain can be configured with shared memory, and multiple processors can access the shared memory.
  • each processor in the target communication domain completes a round of training for the AI model and generates gradient data, it can send the gradient data to a designated area in the shared memory, so that multiple processors can be stored in the shared memory.
  • Gradient data generated by the processor respectively.
  • each processor can access the gradient data generated by all processors in the target communication domain from the shared memory, and by performing gradient fusion on these gradient data, local gradient data can be obtained.
  • multiple processors in each communication domain can interact with gradient data and generate local gradient data by referring to the above method.
  • the above-mentioned implementation methods of exchanging gradient data in the communication domain are only examples.
  • multiple devices in each communication domain can also use other methods to exchange gradient data. This embodiment does not Not limited.
  • the inference performance (such as inference accuracy, etc.) of the AI model trained in each communication domain is usually difficult to reach the level of training based on the full set of training samples.
  • the inference performance of the AI model after completing multiple rounds of training on the AI model, gradient data can be exchanged between multiple communication domains to update the AI model on each processor based on the gradient data generated by all processors. Specifically, it can be Perform gradient fusion on the gradient data generated by all processors, and use the new gradient data generated by gradient fusion (hereinafter referred to as global gradient data) to update the parameters in the AI model on each processor. In this way, the inference performance of the finally trained AI model can usually reach the inference performance of the AI model trained based on the full set of training samples.
  • each communication domain trains the AI model independently, and the processor in each communication domain can count the current number of iterations for the AI model when training the AI model in each round. If the current number of iterations is an integer multiple of the T value, not only will multiple processors in the communication domain interact with gradient data in the manner described above and generate local gradient data corresponding to the communication domain through gradient fusion, but also The communication domain also exchanges local gradient data with other communication domains, so that each communication domain can obtain local gradient data generated separately by all communication domains. In this way, by gradient fusion of the local gradient data generated in all communication domains, global gradient data can be obtained, and the global gradient data can be used to update the parameters of the AI model in each communication domain.
  • the way in which local gradient data is exchanged between multiple communication domains is similar to the way in which gradient data is exchanged between multiple processors in each communication domain.
  • multiple communication domains can exchange local gradient data based on any one of allgather, allreduce, ring-allreduce, and half-allreduce strategies, or multiple communication domains can complete (m*T) rounds of model training in the order Local gradient data are exchanged sequentially (m is a positive integer), or multiple communication domains exchange local gradient data based on a shared storage area, etc. This embodiment is not limited to this.
  • local gradient data is exchanged between different communication domains.
  • gradient data generated by each processor can be directly exchanged between multiple communication domains.
  • a processor in each communication domain can summarize the gradient data generated by each processor in the communication domain to obtain the communication
  • the gradient data set corresponding to the domain includes the gradient data generated by all processors in the communication domain, so that the processors responsible for summarizing gradient data in multiple communication domains can interact with their respective gradient data sets.
  • all processors participating in AI model training directly interact with each other to generate gradient data, etc.
  • each communication domain can obtain the gradient data generated by the processors in all communication domains, so that by gradient fusion of all the gradient data, the global gradient data can be obtained, so that the global gradient data can be used to calculate the gradient data in each communication domain.
  • the AI model performs parameter updates.
  • multiple communication domains interact with gradient data (or local gradient data) every interval (T-1) as an example for illustration.
  • multiple communication domains interact with each other every time.
  • the number of model training times between the interaction gradient data may not be a fixed value. For example, in the process of distributed AI model training, when each communication domain iteratively trains the AI model for 1,000 times, gradient data (or local gradient data) are exchanged between multiple communication domains for the first time.
  • the number of model training times is 1000; then, when the number of times each communication domain iteratively trains the AI model reaches 1900 times, gradient data (or local gradient data) are exchanged between multiple communication domains for the second time, and the number of interval model training times is 900; when the number of times each communication domain iteratively trains the AI model reaches 2700, the third interaction of gradient data (or local gradient data) between multiple communication domains, the number of interval model training is 800; when each communication domain When the number of domain iteration training AI models reaches 3400 times, the gradient data (or local gradient data) between multiple communication domains is interacted for the fourth time, and the number of interval model training times is 700, and so on.
  • the device in the communication domain is specifically a processor for illustration.
  • the device in the communication domain may also be a chip or a server, which distributes the AI model.
  • the specific implementation process of the training can be understood with reference to the relevant descriptions of this embodiment, and will not be described in detail here.
  • each communication domain uses its internal The gradient data generated by multiple processors updates the AI model trained in the communication domain. This can alleviate the impact of the low progress of training the AI model in some communication domains on the overall training progress of the AI model, that is, it can improve the AI model. overall training efficiency. For example, if the progress of Set 1 in the first round is reduced by 3 seconds, and the progress of Set 2 in the second round is reduced by 5 seconds, the overall progress will not be reduced to 8 seconds, but the slowest one, which is 5 seconds. .
  • the system architecture described in Figure 1 can be deployed in a server.
  • the server includes 4 CPUs and can be connected to 8 NPU chips, as shown in Figure 9.
  • NPU1 to NPU8 in the server can Implement distributed training of the Pangu model (an AI model).
  • distributed training of the Pangu model can also be implemented based on NPU chips in multiple servers.
  • the training method is similar to the implementation of distributed training of the Pangu model using multiple NPUs in a server. Refer to understanding.
  • each CPU can support 8 generation 4 double-rate dual inline memory modules (double data rate 4 dual inline memory modules, DDR4 DIMM), and between CPU1 to CPU4 Can be fully interconnected (full mesh).
  • the CPU in the server can provide a bandwidth capability of 90GB/s (gigabytes per second), of which each CPU can provide a one-way bandwidth of 30GB/s and a two-way bandwidth of 60GB/s.
  • NPU1 to NPU4 can be fully interconnected and can be located on one NPU motherboard.
  • NPU5 to NPU8 can be fully interconnected and can be located on another NPU motherboard.
  • the connection can be based on the peripheral component interconnect express (PCIE) bus (only part of the connection between the NPU and the CPU is shown in Figure 9), so that NPU1 to NPU4 can pass through the CPU and NPU5 in the server. Go to NPU8 for data exchange.
  • PCIE peripheral component interconnect express
  • Each NPU motherboard can provide a bandwidth capacity of 90GB/s, of which each NPU can provide a one-way bandwidth of 30GB/s and a two-way bandwidth of 60GB/s.
  • the distributed training process is shown in Figure 10.
  • Users can provide a training script to the server.
  • the training script can include the Pangu model file, specify NPU1 to NPU8 to train the Pangu model, and define NPU1 to NPU4 to belong to communication domain 1, and NPU5 to NPU8 to belong to communication domain 2.
  • the CPU on the host side of the server can parse the Pangu model to be trained from the training script, and determine the multiple NPUs used for distributed training of the Pangu model this time and the communication domain to which each NPU belongs.
  • the CPU can extract the calculation graph according to the training script.
  • the calculation graph includes multiple nodes, and there are edges connecting different nodes. Among them, the nodes in the calculation graph are used to indicate the calculations defined in the training script, and the edges between the nodes are used to indicate the dependencies between different calculations.
  • the extracted calculation graph can be saved to a trans-flash card.
  • the CPU can compile the calculation graph in the flash card, generate an intermediate representation (IR), and provide the IR to the compiler.
  • the compiler can define one or more operator libraries, such as the neural network (NN) operator library shown in Figure 10, the Huawei collective communication library (huawei collective communication library, HCCL) operator library, etc.
  • the NN operator library can include convolution layer operators, pooling layer operators, loss functions, etc.
  • the HCCL operator library can include operators used to define data communication methods, such as allreduce operators, allgather Operator etc.
  • the CPU can use the compiler to determine the operators that need to be executed sequentially for distributed training of the Pangu model, generate corresponding device instructions, and send the device instructions to the NPU on the device side.
  • NPU1 and NPU8 on the device side can execute the corresponding operators in a loop based on the device instructions issued by the host side and perform gradient updates on the Pangu model until the iteration termination conditions are met, thereby realizing distributed training of the Pangu model.
  • NPU1 to NPU4 in communication domain 1 and NPU5 to NPU8 in communication domain 2 train the Pangu model separately, and communication domain 1 and communication domain 2 are trained at every interval ( T-1) Round model training interactively trains the gradient data generated by the Pangu model to achieve global gradient update of the Pangu model.
  • T-1 Round model training interactively trains the gradient data generated by the Pangu model to achieve global gradient update of the Pangu model.
  • the device side can send the training results to the host side.
  • the training results may include, for example, the trained Pangu model, attribute information of the Pangu model (such as inference accuracy), etc.
  • FIG. 11 is a schematic structural diagram of a model training device provided by an embodiment of the present application.
  • the model training device 1100 shown in FIG. 11 may, for example, be the model training device mentioned in the embodiment shown in FIG. 3 .
  • the model training device 1100 includes:
  • Acquisition module 1101 used to obtain the AI model to be trained
  • Determining module 1102 configured to determine multiple communication domains, each of the multiple communication domains including multiple devices;
  • Update module 1103 configured to use the local gradient data corresponding to each communication domain to update the AI trained in the communication domain during each round of distributed training of the AI model using devices in the multiple communication domains.
  • Model the local gradient data corresponding to each communication domain is obtained by gradient fusion based on the gradient data generated by multiple devices in the communication domain; and, when multiple rounds are used, the distributed training of the devices in the multiple communication domains is used.
  • the AI model trained separately in each communication domain is updated using global gradient data, which is obtained by gradient fusion based on gradient data respectively generated by multiple devices in the multiple communication domains.
  • the update module 1103 is used to:
  • Multiple devices in a target communication domain interact with each other to generate gradient data, and the target communication domain is one of the multiple communication domains;
  • the target communication domain performs gradient fusion based on the gradient data interacted between the multiple devices to generate local gradient data corresponding to the target communication domain;
  • the target communication domain uses the local gradient data corresponding to the target communication domain to update the AI model trained by the target communication domain.
  • the update module 1103 is used to:
  • the activation operation is used to trigger interactive gradient data between different devices in the target communication domain.
  • the interactive operation is the target Operation of interactive gradient data between different devices in the communication domain;
  • the physical connection between multiple devices in the target communication domain is a ring connection.
  • the determining module 1102 is used to:
  • multiple devices used to train the AI model are divided to obtain the multiple communication domains, wherein the communication rate between different devices in each communication domain is higher than that in different communication domains The communication rate between devices.
  • the determining module 1102 is used to:
  • a communication domain to which each of the plurality of devices for training the AI model belongs is determined.
  • the determination module 1102 is also used to:
  • a second configuration interface is generated.
  • the second configuration interface is used to present multiple interaction strategies to the user, and each of the multiple interaction strategies is used to indicate multiple interaction strategies in the communication domain.
  • a manner of interacting gradient data between the plurality of devices in each communication domain is determined.
  • devices in different communication domains are located on the same computing node, or devices in multiple communication domains are located on different computing nodes.
  • the devices in each communication domain include a processor, a chip or a server.
  • the data processing device 100 shown in Figure 11 corresponds to the data processing device in the embodiment shown in Figure 3. Therefore, the specific implementation of each functional module in the data processing device 100 and the technical effects thereof can be referred to the previous embodiments. Relevant descriptions will not be repeated here.
  • the computing device 1200 may include a communication interface 1210 and a processor 1220.
  • the computing device 1200 may also include a memory 1230.
  • the memory 1230 may be disposed inside the computing device 1200 or may be disposed outside the computing device 1200 .
  • each action performed by the data processing device in the embodiment shown in FIG. 3 can be implemented by the processor 1220.
  • the processor 1220 can obtain the AI model to be trained and multiple communication domains through the communication interface 1210, and use it to implement the method executed in Figure 3.
  • each step of the processing flow can complete the method executed in Figure 3 through instructions in the form of hardware integrated logic circuits or software in the processor 1220.
  • the program code executed by the processor 1220 to implement the above method may be stored in the memory 1230 .
  • the memory 1230 and the processor 1220 are connected, such as coupling connection, etc.
  • Some features of the embodiments of the present application may be implemented/supported by the processor 1220 executing program instructions or software codes in the memory 1230.
  • the software components loaded on memory 1230 may be summarized functionally or logically.
  • Any communication interface involved in the embodiments of this application may be a circuit, bus, transceiver, or any other device that can be used for information exchange.
  • the communication interface 1210 in the computing device 1200 for example, the other device may be a device connected to the computing device 1200, or the like.
  • the processor involved in the embodiments of this application may be a general-purpose processor, a digital signal processor, an application-specific integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component, which may implement or Execute each method, step and logical block diagram disclosed in the embodiment of this application.
  • a general-purpose processor may be a microprocessor or any conventional processor, etc.
  • the steps of the methods disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware processor for execution, or can be executed by a combination of hardware and software modules in the processor.
  • the coupling in the embodiment of this application is an indirect coupling or communication connection between devices, modules or modules, which may be in electrical, mechanical or other forms, and is used for information interaction between devices, modules or modules.
  • the processor may operate in conjunction with the memory.
  • the memory can be a non-volatile memory, such as a hard disk or a solid state drive, or a volatile memory, such as a random access memory.
  • Memory is, but is not limited to, any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
  • the embodiments of the present application do not limit the specific connection medium between the above communication interface, processor and memory.
  • the memory, processor and communication interface can be connected through a bus.
  • the bus can be divided into address bus, data bus, control bus, etc.
  • embodiments of the present application also provide a computer storage medium, which stores a software program.
  • the software program can implement any one or more of the above.
  • the embodiment provides a method executed by the model training device.
  • the computer storage medium may include: U disk, mobile hard disk, read-only memory, random access memory, magnetic disk or optical disk and other various media that can store program codes.
  • embodiments of the present application may be provided as methods, devices, systems, storage media or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including, but not limited to, disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, the instructions
  • the device implements the functions specified in a process or processes of the flowchart and/or a block or blocks of the block diagram.
  • These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device.
  • Instructions provide steps for implementing the functions specified in a process or processes of a flowchart diagram and/or a block or blocks of a block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computer And Data Communications (AREA)
  • Filters That Use Time-Delay Elements (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)

Abstract

Provided is a model training method, comprising: acquiring an AI model to be trained, and determining a plurality of communication domains; and during the process of training the AI model in each round, by using local gradient data corresponding to each communication domain, updating the AI model, wherein the local gradient data corresponding to each communication domain is obtained by performing gradient fusion according to gradient data respectively generated by a plurality of devices in the communication domain, and when the AI model is trained at an interval of a plurality of rounds, updating, by using total gradient data, the AI model respectively trained in each communication domain, wherein the total gradient data is obtained by performing gradient fusion according to the gradient data in the plurality of communication domains. Therefore, since it is only at an interval of a plurality of rounds of training that an AI model is updated by using gradient data generated by all devices training the AI model, the problem of the overall training progress of an AI model being slowed down due to the progress of training the AI model within a period of time in some communication domains being relatively low can be alleviated, such that the overall training efficiency of the AI model can be improved.

Description

模型训练方法、装置、系统及相关设备Model training methods, devices, systems and related equipment
本申请要求于2022年06月29日提交中国国家知识产权局、申请号为202210760755.6、申请名称为“深度学习的方法、装置和系统”的中国专利申请的优先权,并要求于2022年9月20日提交中国国家知识产权局、申请号为202211148350.3、申请名称为“模型训练方法、装置、系统及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application requires the priority of a Chinese patent application submitted to the State Intellectual Property Office of China on June 29, 2022, with the application number 202210760755.6 and the application name "Deep Learning Method, Device and System", and requires priority in September 2022 The priority of the Chinese patent application submitted to the State Intellectual Property Office of China on the 20th with the application number 202211148350.3 and the application title "Model training method, device, system and related equipment", the entire content of which is incorporated into this application by reference.
技术领域Technical field
本申请涉及人工智能技术领域,尤其涉及一种模型训练方法、装置、系统及相关设备。This application relates to the field of artificial intelligence technology, and in particular to a model training method, device, system and related equipment.
背景技术Background technique
随着人工智能(artificAIl intelligence,AI)的发展,AI模型的规模也在逐渐增大。例如,自然语言处理(natural language processing,NLP)领域中的盘古(pangu)模型的参数量可以高达2000亿,训练样本的数据量高达40太字节。相应地,较大的模型参数量和样本数据量,会使得训练AI模型所需的算力较高。With the development of artificial intelligence (AI), the scale of AI models is gradually increasing. For example, the parameter volume of the Pangu model in the field of natural language processing (NLP) can be as high as 200 billion, and the data volume of training samples can be as high as 40 terabytes. Correspondingly, a larger amount of model parameters and sample data will require higher computing power to train the AI model.
目前,可以通过对AI模型进行分布式训练,以解决AI模型训练所需的庞大算力。具体实现时,AI模型的训练样本可以被等分成多个样本子集,并且每个样本子集以及该AI模型被分配给一个设备,从而每个设备利用一个样本子集迭代训练该AI模型,并生成用于更新该AI模型的梯度(gradient)。然后,不同设备之间交互各自生成的梯度数据并进行梯度融合,计算出全局的梯度数据(也即对所有设备生成的梯度数据进行梯度融合后所得到的数据),再根据该全局的梯度数据更新每个设备所负责训练的AI模型中的参数。基于上述方式,对AI模型中的参数进行多轮的迭代更新,最终完成对AI模型的训练。Currently, the huge computing power required for AI model training can be solved through distributed training of AI models. During specific implementation, the training samples of the AI model can be equally divided into multiple sample subsets, and each sample subset and the AI model are assigned to a device, so that each device uses a sample subset to iteratively train the AI model. And generate the gradient used to update the AI model. Then, different devices interact with each other to generate gradient data and perform gradient fusion to calculate the global gradient data (that is, the data obtained after gradient fusion of the gradient data generated by all devices), and then based on the global gradient data Update the parameters in the AI model trained by each device. Based on the above method, multiple rounds of iterative updates are performed on the parameters in the AI model, and the training of the AI model is finally completed.
实际应用场景中,不同设备之间通常采用ring-allreduce的方式交互梯度数据,即不同设备之间交互的梯度数据的流向可以构成一个环,并且,不同设备之间通过执行多次数据交互以及梯度数据融合,可以使得每个设备均能获得全局的梯度数据。但是,这种交互梯度数据的方式,通常会导致AI模型的训练效率较低、通信资源消耗较大。In actual application scenarios, different devices usually use ring-allreduce to interact with gradient data. That is, the flow of gradient data interacted between different devices can form a ring, and different devices can perform multiple data interactions and gradient data exchanges between different devices. Data fusion allows each device to obtain global gradient data. However, this way of interacting gradient data usually results in lower training efficiency of the AI model and greater consumption of communication resources.
发明内容Contents of the invention
提供一种模型训练方法、装置、系统、存储介质以及计算机程序产品,以提高AI模型的训练效率、减少训练AI模型所需消耗的通信资源。Provide a model training method, device, system, storage medium and computer program product to improve the training efficiency of AI models and reduce the communication resources required to train AI models.
第一方面,本申请实施例提供一种模型训练方法,该方法可以由相应的模型训练装置执行,具体地,模型训练装置获取待训练的AI模型并确定多个通信域,该AI模型例如可以是具有较大的模型参数量或样本数据量的AI模型,如盘古模型等,所确定的每个通信域包括多个设备,例如可以将训练AI模型的所有设备分别划入多个通信域等;在每轮利用多个通信域中的设备分布式训练AI模型的过程中,模型训练装置利用每个通信域对应的局部梯度数据更新该通信域所训练的AI模型,其中,每个通信域对应的局部梯度数据根据该通信域中的多个设备分别生成的梯度数据进行梯度融合得到,并且,当间隔多轮利用该多个通信域中的设备分布式训练AI模型时(所间隔的轮数可以固定的轮数,也可以是随机的轮数),模型训练装置利用全部梯度数据更新每个通信域分别训练的AI模型,该全部梯度数据根据多个通信域中的多个设备分别生成的梯度数据进行梯度融合得到,以便利用全局的梯度数据实现对全局的AI模型进行更新。In the first aspect, embodiments of the present application provide a model training method, which can be executed by a corresponding model training device. Specifically, the model training device obtains an AI model to be trained and determines multiple communication domains. The AI model can, for example, It is an AI model with a large amount of model parameters or sample data, such as the Pangu model, etc. Each communication domain determined includes multiple devices. For example, all devices used to train the AI model can be divided into multiple communication domains, etc. ; In each round of distributed training of the AI model using equipment in multiple communication domains, the model training device uses the local gradient data corresponding to each communication domain to update the AI model trained in the communication domain, where each communication domain The corresponding local gradient data is obtained by gradient fusion based on the gradient data generated by multiple devices in the communication domain, and when the AI model is distributedly trained using the devices in the multiple communication domains in multiple rounds (interval rounds) The number of rounds can be a fixed number or a random number). The model training device uses all the gradient data to update the AI model trained separately in each communication domain. The all gradient data is generated separately based on multiple devices in multiple communication domains. The gradient data is obtained by gradient fusion, so that the global gradient data can be used to update the global AI model.
由于每间隔多轮训练才会利用所有设备训练AI模型所生成的梯度数据更新AI模型,而在中间的每轮模型训练过程中,每个通信域单独利用其内部的多个设备生成的梯度数据更新该通信域所训练的AI模型,这可以缓解部分通信域在一段时间内训练AI模型的进度较低导致AI模型的整体训练进度被拉低的问题,从而能够提高AI模型的整体训练效率。并且,模型训练装置在每间隔多轮会利用全局的梯度数据更新AI模型,这可以保证AI模型的训练效果能够达到较高水平。在此基础上,由于中间的每轮模型训练过程中,不同通信域内的设备之间可以不用交互梯度数据,这可以有效减少训练AI模型所需消耗的通信资源。Since the gradient data generated by training the AI model on all devices is used to update the AI model every multiple rounds of training, during each round of model training in the middle, each communication domain independently uses the gradient data generated by multiple devices within it. Updating the AI model trained by the communication domain can alleviate the problem that the overall training progress of the AI model is lowered due to low progress in training the AI model in some communication domains over a period of time, thereby improving the overall training efficiency of the AI model. Moreover, the model training device will use global gradient data to update the AI model at multiple intervals, which can ensure that the training effect of the AI model can reach a high level. On this basis, since during each round of model training, devices in different communication domains do not need to exchange gradient data, which can effectively reduce the communication resources required to train the AI model.
在一种可能的实施方式中,在每轮训练过程中,针对多个通信域中的各个通信域可以独立训练和更新AI模型,以其中一个目标通信域为例,在每轮训练过程中,目标通信域中的多个设备之间交互各自生成的梯度数据,并根据多个设备之间交互的梯度数据进行梯度融合,生成目标通信域对应的局部梯度数据,从而利用该目标通信域对应的局部梯度数据更新该目标通信域所训练的AI模型。针对其它通信域,也可以是参照类似方式训练各自所负责的AI模型。如此,每个通信域可以通过在内部交互梯度数据的方式,实现在局部范围内对AI模型的进行梯度更新,并且,不同通信域之间更新AI模型的效率可以不受其它通信域的影响,以此可以提高AI模型的整体训练效率。In a possible implementation, during each round of training, the AI model can be independently trained and updated for each of the multiple communication domains. Taking one of the target communication domains as an example, during each round of training, Multiple devices in the target communication domain interact with each other to generate gradient data, and perform gradient fusion based on the gradient data interacted between multiple devices to generate local gradient data corresponding to the target communication domain, thereby utilizing the gradient data corresponding to the target communication domain. The local gradient data updates the AI model trained in the target communication domain. For other communication domains, similar methods can also be used to train the AI models for which they are responsible. In this way, each communication domain can realize the gradient update of the AI model in a local range by exchanging gradient data internally, and the efficiency of updating the AI model between different communication domains can not be affected by other communication domains. This can improve the overall training efficiency of the AI model.
在一种可能的实施方式中,目标通信域中的多个设备之间在交互各自生成的梯度数据时,具体可以是获取该目标通信域对应的激活操作的版本号以及交互操作的版本号,其中,激活操作用于触发目标通信与中的不同设备之间交互梯度数据,交互操作是指目标通信域中的不同设备之间交互梯度数据的操作,从而当激活操作的版本号大于或者等于交互操作的版本号时,目标通信域中的多个设备之间交互各自生成的梯度数据。如此,各个通信域可以通过限制交互梯度数据的次数,来避免部分通信域执行交互梯度数据的次数过多,以此避免多个通信域之间存在异步冲突。In a possible implementation, when multiple devices in the target communication domain interact with each other to generate gradient data, the specific method may be to obtain the version number of the activation operation corresponding to the target communication domain and the version number of the interactive operation, Among them, the activation operation is used to trigger the interaction of gradient data between different devices in the target communication domain. The interaction operation refers to the operation of exchanging gradient data between different devices in the target communication domain. Therefore, when the version number of the activation operation is greater than or equal to the interaction When operating the version number, multiple devices in the target communication domain interact with each other to generate gradient data. In this way, each communication domain can prevent some communication domains from executing interactive gradient data too many times by limiting the number of times they interact with gradient data, thereby avoiding asynchronous conflicts between multiple communication domains.
在一种可能的实施方式中,目标通信域中的多个设备之间的物理连接为环状连接,如基于HCCS环状方式进行连接等。In a possible implementation, the physical connection between multiple devices in the target communication domain is a ring connection, such as a connection based on a HCCS ring mode.
在一种可能的实施方式中,在确定多个通信域时,可以将亲和性较高的多个设备划入同一通信域中,具体实现时,模型训练装置可以获取设备拓扑关系,该设备拓扑关系用于指示训练AI模型的多个设备之间的连接关系,从而根据该设备拓扑关系,对用于训练AI模型的多个设备进行划分,得到多个通信域,其中,每个通信域中的不同设备之间的通信速率高于不同通信域的设备之间的通信速率。如此,通过将通信速率较高的设备划入同一通信域,可以提高后续模型训练过程中该通信域内的不同设备之间交互梯度数据的效率,以此可以提高AI模型的整体训练效率。In a possible implementation, when determining multiple communication domains, multiple devices with higher affinity can be classified into the same communication domain. During specific implementation, the model training device can obtain the device topology relationship. The topological relationship is used to indicate the connection relationship between multiple devices used to train the AI model. According to the device topological relationship, the multiple devices used to train the AI model are divided to obtain multiple communication domains. Each communication domain The communication rate between different devices in the network is higher than the communication rate between devices in different communication domains. In this way, by classifying devices with higher communication rates into the same communication domain, the efficiency of interactive gradient data between different devices in the communication domain during subsequent model training can be improved, thereby improving the overall training efficiency of the AI model.
在一种可能的实施方式中,可以由用户将训练AI模型的所有设备配置各个设备所属的通信域。具体实现时,模型训练装置可以生成第一配置界面,该第一配置界面用于向用户呈现用于训练该AI模型的多个设备的标识,以便由用户在该第一配置界面上配置各个设备所属的通信域,从而模型训练装置可以响应于用户的第一配置操作,确定用于训练AI模型的多个设备中的每个设备所属的通信域,以此划分得到多个通信域。如此,可以由用户实现对多个通信域的配置,以便于用户对AI模型的训练进行干预,实现较优的模型训练效果。In a possible implementation, the user can configure all the devices for training the AI model into the communication domain to which each device belongs. During specific implementation, the model training device may generate a first configuration interface, which is used to present to the user the identities of multiple devices used to train the AI model, so that the user can configure each device on the first configuration interface. The communication domain to which it belongs, so that the model training device can respond to the user's first configuration operation and determine the communication domain to which each of the multiple devices used to train the AI model belongs, thereby dividing the multiple communication domains. In this way, the user can configure multiple communication domains, so that the user can intervene in the training of the AI model and achieve better model training effects.
在一种可能的实施方式中,在训练AI模型之前,模型训练装置还可以生成第二配置界面,该第二配置界面用于向用户呈现多种交互策略,其中的每种交互策略用于指示通信域中的多个设备之间交互梯度数据的一种方式,如allgather、allreduce、ring-allreduce、半倍(having-doubling)allreduce策略等,从而响应于用户针对该多种交互策略的第二配置操作,确定每个通信域中的多个设备之间交互梯度数据的方式。其中,不同通信域中的多个设备之间可以采用同一交互策略交互梯度数据,或者可以采用不同交互策略交互梯度数据等。如此,可以由用户对每个通信域内的交互策略进行人为配置,以便用户对AI模型的训练进行干预,如可以根据每个通信域内的设备特性配置最合适的交互策略等,从而实现较优的模型训练效果。In a possible implementation, before training the AI model, the model training device can also generate a second configuration interface. The second configuration interface is used to present multiple interaction strategies to the user, where each interaction strategy is used to indicate A way to interact gradient data between multiple devices in the communication domain, such as allgather, allreduce, ring-allreduce, half-doubling (having-doubling) allreduce strategy, etc., thereby responding to the user's second request for these multiple interaction strategies. Configure operations that determine how gradient data is exchanged between multiple devices in each communication domain. Among them, multiple devices in different communication domains can use the same interaction strategy to exchange gradient data, or can use different interaction strategies to exchange gradient data, etc. In this way, the user can manually configure the interaction strategy in each communication domain, so that the user can intervene in the training of the AI model. For example, the most appropriate interaction strategy can be configured according to the device characteristics in each communication domain, so as to achieve better results. Model training effect.
在一种可能的实施方式中,不同通信域中的设备位于同一计算节点,或者,多个通信域中的设备分别位于不同的计算节点。In a possible implementation, devices in different communication domains are located on the same computing node, or devices in multiple communication domains are located on different computing nodes.
在一种可能的实施方式中,每个通信域中的设备包括处理器、芯片或者服务器等,以此可以基于不同粒度的设备实现对AI模型的分布式训练。In a possible implementation, the devices in each communication domain include processors, chips, servers, etc., so that distributed training of AI models can be implemented based on devices with different granularities.
第二方面,本申请实施例提供一种模型训练装置,所述装置包括:获取模块,用于获取待训练的AI模型;确定模块,用于确定多个通信域,所述多个通信域中的每个通信域包括多个设备;更新模块,用于在每轮利用所述多个通信域中的设备分布式训练所述AI模型的过程中,利用每个通信域对应的局部梯度数据更新该通信域所训练的所述AI模型,每个通信域对应的局部梯度数据根据该通信域中的多个设备分别生成的梯度数据进行梯度融合得到;并且,当间隔多轮利用所述多个通信域中的设备分布式训练所述AI模型时,利用全局梯度数据更新每个通信域分别训练的所述AI模型,所述全局梯度数据根据所述多个通信域中的多个设备分别生成的梯度数据进行梯度融合得到。In the second aspect, embodiments of the present application provide a model training device. The device includes: an acquisition module for acquiring an AI model to be trained; and a determination module for determining multiple communication domains. Each communication domain includes multiple devices; an update module is configured to update the local gradient data corresponding to each communication domain during each round of distributed training of the AI model using devices in the multiple communication domains. For the AI model trained in the communication domain, the local gradient data corresponding to each communication domain is obtained by gradient fusion based on the gradient data respectively generated by multiple devices in the communication domain; and, when multiple rounds are separated, the multiple When the AI model is distributedly trained by the devices in the communication domain, the AI model trained separately in each communication domain is updated using global gradient data. The global gradient data is generated separately based on multiple devices in the multiple communication domains. The gradient data is obtained by gradient fusion.
在一种可能的实施方式中,所述更新模块,用于:目标通信域中的多个设备之间交互各自生成的梯度数据,所述目标通信域为所述多个通信域的其中一个通信域;所述目标通信域根据所述多个设备之间交互的梯度数据进行梯度融合,生成所述目标通信域对应的局部梯度数据;所述目标通信域利用所述目标通信域对应的局部梯度数据更新所述目标通信域所训练的所述AI模型。In a possible implementation, the update module is configured to exchange gradient data generated by each of multiple devices in a target communication domain, and the target communication domain is one of the multiple communication domains. domain; the target communication domain performs gradient fusion according to the gradient data interacted between the multiple devices to generate local gradient data corresponding to the target communication domain; the target communication domain utilizes the local gradient corresponding to the target communication domain The data updates the AI model trained in the target communication domain.
在一种可能的实施方式中,所述更新模块,用于:获取所述目标通信域对应的激活操作的版本号以及交互操作的版本号,所述激活操作用于触发所述目标通信域中的不同设备之间交互梯度数据,所述交互操作为所述目标通信域中的不同设备之间交互梯度数据的操作;当激活操作的版本号大于或者等于交互操作的版本号,所述目标通信域中的多个设备之间交互各自生成的梯度数据。In a possible implementation, the update module is configured to: obtain the version number of the activation operation corresponding to the target communication domain and the version number of the interactive operation, and the activation operation is used to trigger the activation operation in the target communication domain. Gradient data is exchanged between different devices, and the interactive operation is an operation of exchanging gradient data between different devices in the target communication domain; when the version number of the activation operation is greater than or equal to the version number of the interactive operation, the target communication Multiple devices in the domain interact with each other to generate gradient data.
在一种可能的实施方式中,所述目标通信域中的多个设备之间的物理连接为环状连接。In a possible implementation, the physical connection between multiple devices in the target communication domain is a ring connection.
在一种可能的实施方式中,所述确定模块,用于:获取设备拓扑关系,所述设备拓扑关系指示用于训练所述AI模型的多个设备之间的连接关系;根据所述设备拓扑关系,对用于训练所述AI模型的多个设备进行划分,得到所述多个通信域,其中,每个通信域中的不同设备之间的通信速率高于不同通信域的设备之间的通信速率。In a possible implementation, the determining module is configured to: obtain a device topology relationship, which indicates a connection relationship between multiple devices used to train the AI model; according to the device topology relationship, divide the multiple devices used to train the AI model to obtain the multiple communication domains, wherein the communication rate between different devices in each communication domain is higher than the communication rate between devices in different communication domains. Communication rate.
在一种可能的实施方式中,所述确定模块,用于:生成第一配置界面,所述第一配置界面用于向用户呈现用于训练所述AI模型的多个设备的标识;响应于用户的第一配置操作,确定所述用于训练所述AI模型的多个设备中的每个设备所属的通信域。In a possible implementation, the determining module is configured to: generate a first configuration interface, the first configuration interface is used to present to the user identification of multiple devices used to train the AI model; in response to The user's first configuration operation is to determine the communication domain to which each of the plurality of devices used to train the AI model belongs.
在一种可能的实施方式中,所述确定模块,还用于:在训练所述AI模型之前,生成第二配置界面,所述第二配置界面用于向用户呈现多种交互策略,所述多种交互策略中的每种交互策略用于指示通信域中的多个设备之间交互梯度数据的一种方式;响应于用户针对所述多种交互策略的第二配置操作,确定所述每个通信域中的多个设备之间交互梯度数据的方式。In a possible implementation, the determination module is further configured to: before training the AI model, generate a second configuration interface, the second configuration interface being used to present multiple interaction strategies to the user, the Each of the plurality of interaction strategies is used to indicate a way of interacting gradient data between multiple devices in the communication domain; in response to a second configuration operation by the user for the plurality of interaction strategies, determining each of the interaction strategies A way for multiple devices in a communication domain to exchange gradient data.
在一种可能的实施方式中,不同通信域中的设备位于同一计算节点,或者,所述多个通信域中的设备分别位于不同的计算节点。In a possible implementation, devices in different communication domains are located on the same computing node, or devices in multiple communication domains are located on different computing nodes.
在一种可能的实施方式中,每个通信域中的设备包括处理器、芯片或服务器。In a possible implementation, the devices in each communication domain include a processor, a chip or a server.
由于第二方面提供的模型训练装置,对应于第一方面提供的模型训练方法,因此,第二方面以及第二方面中各实施方式所具有技术效果,可以参见相应的第一方面以及第一方面中各实施方式所具有的技术效果,在此不做赘述。Since the model training device provided in the second aspect corresponds to the model training method provided in the first aspect, the technical effects of the second aspect and each embodiment in the second aspect can be found in the corresponding first aspect and the first aspect. The technical effects of each embodiment will not be described in detail here.
第三方面,本申请实施例提供一种模型训练系统,其特征在于,所述模型训练系统包括多个设备,所述模型训练系统用于执行上述第一方面以及第一方面的任一实现方式中的模型训练方法。In a third aspect, embodiments of the present application provide a model training system, which is characterized in that the model training system includes a plurality of devices, and the model training system is used to perform the above-mentioned first aspect and any implementation of the first aspect. model training methods.
第四方面,本申请实施例提供一种计算设备,所述计算设备包括处理器和存储器;所述存储器用于存储指令,所述处理器执行所述存储器存储的该指令,以使所述计算设备执行上述第一方面或第一方面任一种可能实现方式中的模型训练方法。需要说明的是,该存储器可以集成于处理器中,也可以是独立于处理器之外。所述计算设备还可以包括总线。其中,处理器通过总线连接存储器。其中,存储器可以包括可读存储器以及随机存取存储器。In a fourth aspect, embodiments of the present application provide a computing device. The computing device includes a processor and a memory; the memory is used to store instructions, and the processor executes the instructions stored in the memory, so that the calculation The device performs the model training method in the above first aspect or any possible implementation manner of the first aspect. It should be noted that the memory can be integrated into the processor or independent of the processor. The computing device may also include a bus. Among them, the processor is connected to the memory through a bus. The memory may include readable memory and random access memory.
第五方面,本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有程序或指令,当其在至少一个计算机上运行时,使得至少一个计算机执行上述第一方面以及第一方面的任一实现方式中的模型训练方法。In a fifth aspect, embodiments of the present application further provide a computer-readable storage medium, which stores programs or instructions that, when run on at least one computer, cause at least one computer to execute the above-mentioned first step. aspect and the model training method in any implementation of the first aspect.
第六方面,本申请实施例还提供一种包含指令的计算机程序产品,当其在至少一个计算机上运行时,使得至少一个计算机执行上述第一方面以及第一方面的任一实现方式中的模型训练方法。。In a sixth aspect, embodiments of the present application further provide a computer program product containing instructions that, when run on at least one computer, causes at least one computer to execute the model in the above first aspect and any implementation of the first aspect. Training methods. .
另外,第二方面至六中任一种实现方式所带来的技术效果可参见第一方面中不同实现方式所带来的技术效果,此处不再赘述。In addition, the technical effects brought by any one of the second to sixth aspects can be found in the technical effects brought by different implementations in the first aspect, which will not be described again here.
附图说明Description of drawings
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请中记载的一些实施例,对于本领域普通技术人员来讲,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some implementations recorded in the present application. For example, those of ordinary skill in the art can also obtain other drawings based on these drawings.
图1为本申请实施例提供的4个设备交互梯度数据的示意图;Figure 1 is a schematic diagram of interactive gradient data of four devices provided by the embodiment of the present application;
图2为本申请实施例提供的一示例性模型训练系统的架构示意图;Figure 2 is a schematic architectural diagram of an exemplary model training system provided by an embodiment of the present application;
图3为本申请实施例提供的一种模型训练方法的流程示意图;Figure 3 is a schematic flow chart of a model training method provided by an embodiment of the present application;
图4为用于训练AI模型的NPU1至NPU8之间的拓扑结构示意图; Figure 4 is a schematic diagram of the topology between NPU1 to NPU8 used to train the AI model;
图5为本申请实施例提供的一示例性配置界面示意图;Figure 5 is a schematic diagram of an exemplary configuration interface provided by an embodiment of the present application;
图6为4个处理器之间交互梯度数据的示意图;Figure 6 is a schematic diagram of interactive gradient data between four processors;
图7为本申请实施例提供的另一示例性配置界面示意图;Figure 7 is a schematic diagram of another exemplary configuration interface provided by an embodiment of the present application;
图8为处理器2通知其余处理器交互梯度数据的示意图;Figure 8 is a schematic diagram of processor 2 notifying other processors of interactive gradient data;
图9为本申请实施例提供的一示例性服务器的架构示意图;Figure 9 is an architectural schematic diagram of an exemplary server provided by an embodiment of the present application;
图10为本申请实施例提供的针对盘古模型进行分布式训练的流程示意图;Figure 10 is a schematic flowchart of distributed training for the Pangu model provided by the embodiment of the present application;
图11为本申请实施例提供的一种模型训练装置的结构示意图;Figure 11 is a schematic structural diagram of a model training device provided by an embodiment of the present application;
图12为本申请实施例提供的一种计算设备的硬件结构示意图。Figure 12 is a schematic diagram of the hardware structure of a computing device provided by an embodiment of the present application.
具体实施方式Detailed ways
实际应用时,当待训练的AI模型中的参数量以及用于训练该AI模型的样本数据量较大时,单个设备所具备的有限算力可能难以单独完成对该AI模型的训练,因此,可以通过分布式训练的方式,综合多个设备的算力共同训练该AI模型。其中,训练AI模型的设备,可以是处理器级别的设备,如神经网络处理器(neural-network processing unit,NPU)、图形处理器(graphics processing unit,GPU)等。或者,训练AI模型的设备,可以是芯片级别的设备,如可以是与主机连接的多个芯片等。或者,训练AI模型的设备,可以是服务器级别的设备,如可以是多个独立的服务器等。其中,当训练AI模型的多个设备为处理器级别的设备或者芯片级别的设备时,该多个处理器可以位于同一服务器(该服务器可以构成计算节点),也可以位于不同的服务器。当训练AI模型的多个设备为服务器级别的设备时,该多个设备可以位于同一数据中心(该数据中心可以视为一个计算节点),或者,该多个设备可以位于不同的数据中心,即可以跨数据中心对AI模型进行分布式训练。In practical applications, when the amount of parameters in the AI model to be trained and the amount of sample data used to train the AI model are large, the limited computing power of a single device may be difficult to complete training of the AI model alone. Therefore, The AI model can be trained through distributed training, integrating the computing power of multiple devices. Among them, the equipment for training AI models can be processor-level equipment, such as neural network processor (neural-network processing unit, NPU), graphics processor (graphics processing unit, GPU), etc. Alternatively, the device for training the AI model can be a chip-level device, such as multiple chips connected to the host. Alternatively, the device for training the AI model can be a server-level device, such as multiple independent servers. When the multiple devices for training the AI model are processor-level devices or chip-level devices, the multiple processors may be located on the same server (the server may constitute a computing node), or may be located on different servers. When the multiple devices for training the AI model are server-level devices, the multiple devices can be located in the same data center (the data center can be regarded as a computing node), or the multiple devices can be located in different data centers, that is, AI models can be trained distributedly across data centers.
在迭代训练AI模型的过程中,多个设备之间通常会采用ring-allreduce的方式交互各个设备每轮训练AI模型所生成的梯度数据,并通过对交互得到的梯度数据进行梯度融合后所生成的新的梯度数据,更新AI模型参数。In the process of iteratively training the AI model, multiple devices usually use the ring-allreduce method to interact with the gradient data generated by each device in each round of AI model training, and generate gradient data through gradient fusion of the gradient data obtained from the interaction. The new gradient data updates the AI model parameters.
以利用4个设备训练AI模型为例,如图1所示的设备1至设备4,在每轮迭代训练AI模型的过程中,各个设备分别利用一个样本子集训练AI模型,并生成相应的梯度数据,则,设备1至设备4可以分别对训练得到的梯度数据按照设备数量划分为4个分片。其中,设备1的梯度数据可以被划分为分片a1、b1、c1、d1,设备2的梯度数据可以被划分为分片a2、b2、c2、d2,设备3的梯度数据可以被划分为分片a3、b3、c3、d3,设备4的梯度数据可以被划分为分片a4、b4、c4、d4。则,设备1至设备4在第一次交互过程中,设备1向设备2发送分片a1,设备2向设备3发送分片b2,设备3向设备4发送分片c3,设备4向设备1发送分片d4。然后,各个设备可以将自身存储的梯度数据的分片与接收到的分片进行梯度融合,生成新的梯度数据,如设备1可以将设备4发送的分片d4与其存储的分片d1进行梯度融合,生成新的梯度数据的分片D1,并利用D1覆盖分片d1。例如,假设分片d4为{3,5,4,2,7,2},分片d1为{1,5,6,9,11,21},则对分片d4与d1进行梯度融合所得到的D1可以是{2,5,5,6,9,12}(对应位置的数值进行相加后计算平均值,该平均值向上取整)。类似地,设备2可以生成新的梯度数据的分片B2,并利用B2覆盖分片b2;设备3可以生成新的梯度数据的分片C2,并利用C2覆盖分片c2;设备4可以生成新的梯度数据的分片C4,并利用C4覆盖分片c4Take the use of 4 devices to train an AI model as an example. Device 1 to device 4 are shown in Figure 1. During each round of iterative training of the AI model, each device uses a subset of samples to train the AI model and generate the corresponding Gradient data, then Device 1 to Device 4 can respectively divide the gradient data obtained by training into 4 shards according to the number of devices. Among them, the gradient data of device 1 can be divided into slices a 1 , b 1 , c 1 , d 1 , the gradient data of device 2 can be divided into slices a 2 , b 2 , c 2 , d 2 , and device 3 The gradient data of can be divided into slices a 3 , b 3 , c 3 , and d 3 , and the gradient data of device 4 can be divided into slices a 4 , b 4 , c 4 , and d 4 . Then, during the first interaction between device 1 and device 4, device 1 sends fragment a 1 to device 2, device 2 sends fragment b 2 to device 3, device 3 sends fragment c 3 to device 4, and device 4 sends fragment c 3 to device 4. Send fragment d 4 to device 1. Then, each device can gradient fuse its own stored fragments of gradient data with the received fragments to generate new gradient data. For example, device 1 can combine fragment d 4 sent by device 4 with its stored fragment d 1 Perform gradient fusion to generate a new fragment D 1 of gradient data, and use D 1 to cover fragment d 1 . For example, assuming that shard d 4 is {3,5,4,2,7,2} and shard d 1 is {1,5,6,9,11,21}, then for shard d 4 and d 1 D 1 obtained by gradient fusion can be {2,5,5,6,9,12} (the values at the corresponding positions are added and the average is calculated, and the average is rounded up). Similarly, device 2 can generate a new gradient data fragment B 2 and use B 2 to overwrite fragment b 2 ; device 3 can generate a new gradient data fragment C 2 and use C 2 to overwrite fragment c 2 ; Device 4 can generate a new slice C 4 of gradient data, and use C 4 to overwrite slice c 4 .
接着,设备1至设备4进行第二次交互,设备1将分片D1发送被设备2,设备2将分片A2发送被设备3,设备3将分片B3发送被设备4,设备4将分片C4发送被设备1,并且,各个设备再次利用自身存储的梯度数据分片与接收到的分片进行梯度融合,生成新的梯度数据分片,并将其替换原先存储的梯度数据分片。通过设备1与设备4之间的多次交互,每个设备中均存在一个梯度数据分片,为利用设备1与设备4中的相应分片进行梯度融合得到,如图1所示。Then, device 1 to device 4 interact for the second time. Device 1 sends fragment D 1 to device 2. Device 2 sends fragment A 2 to device 3. Device 3 sends fragment B 3 to device 4. Device 2 4. Send fragment C 4 to device 1, and each device once again uses its own stored gradient data fragments to perform gradient fusion with the received fragments, generates new gradient data fragments, and replaces them with the originally stored gradients. Data sharding. Through multiple interactions between device 1 and device 4, there is a gradient data fragment in each device, which is obtained by gradient fusion using the corresponding fragments in device 1 and device 4, as shown in Figure 1.
然后,设备1至设备4继续进行交互,将每个设备所存储的经过梯度融合后的分片共享给其它设备,使得每个设备均能获得基于所有设备中的梯度数据进行梯度融合后所得到的梯度融合结果,如图1所示,从而各个设备可以利用该梯度融合结果更新AI模型中的参数。如此,多个设备可以完成一轮对于AI模型的训练过程。Then, device 1 to device 4 continue to interact and share the gradient fusion stored by each device with other devices, so that each device can obtain the gradient fusion based on the gradient data in all devices. The gradient fusion result is shown in Figure 1, so that each device can use the gradient fusion result to update the parameters in the AI model. In this way, multiple devices can complete a round of training process for the AI model.
由于不同设备之间交互梯度数据的速度通常存在差异,如受负载或者资源规格等影响,部分设备发送/接收梯度数据的时延较高,这就拉低了多个设备之间交互梯度数据的整体效率,从而影响AI模型的训练效率。并且,多个设备之间频繁的交互梯度数据,也会导致训练AI模型所需消耗的通信资源较高。Since there are usually differences in the speed of interaction of gradient data between different devices, such as due to load or resource specifications, some devices have a higher delay in sending/receiving gradient data, which reduces the speed of interaction of gradient data between multiple devices. Overall efficiency, thus affecting the training efficiency of the AI model. In addition, frequent interaction of gradient data between multiple devices will also lead to high consumption of communication resources for training AI models.
基于此,本申请实施例提供一种模型训练方法,该方法可以由相应的模型训练装置执行,用于提高AI模型的训练效率。具体实现时,模型训练装置获取待训练的AI模型,并确定多个通信域,该多个通信域中的每个通信域包括多个设备;模型训练装置在每轮利用该多个通信域训练AI模型的过程中,利用每个通信域对应的局部梯度数据更新该通信域所训练的AI模型,其中,每个通信域对应的局部梯度数据根据该通信域中的多个设备分别生成的梯度数据进行梯度融合得到;并且,在每间隔多轮利用该多个通信域训练该AI模型的过程中,模型训练装置利用全局梯度数据更新每个通信域分别训练的AI模型,该全局梯度数据根据该多个通信域中的多个设备分别生成的梯度数据进行梯度融合得到。Based on this, embodiments of the present application provide a model training method, which can be executed by a corresponding model training device to improve the training efficiency of the AI model. During specific implementation, the model training device obtains the AI model to be trained and determines multiple communication domains. Each communication domain in the multiple communication domains includes multiple devices; the model training device uses the multiple communication domains for training in each round. In the process of the AI model, the local gradient data corresponding to each communication domain is used to update the AI model trained in the communication domain. The local gradient data corresponding to each communication domain is based on the gradients generated by multiple devices in the communication domain. The data is obtained by gradient fusion; and, in the process of training the AI model using the multiple communication domains at each interval, the model training device uses global gradient data to update the AI model trained separately in each communication domain, and the global gradient data is based on Gradient data generated by multiple devices in the multiple communication domains are obtained by gradient fusion.
由于每间隔多轮训练才会利用所有设备训练AI模型所生成的梯度数据更新AI模型,而在中间的每轮模型训练过程中,每个通信域单独利用其内部的多个设备生成的梯度数据更新该通信域所训练的AI模型,这可以缓解部分通信域在一段时间内训练AI模型的进度较低导致AI模型的整体训练进度被拉低的问题,从而能够提高AI模型的整体训练效率。Since the gradient data generated by training the AI model on all devices is used to update the AI model every multiple rounds of training, during each round of model training in the middle, each communication domain independently uses the gradient data generated by multiple devices within it. Updating the AI model trained by the communication domain can alleviate the problem that the overall training progress of the AI model is lowered due to low progress in training the AI model in some communication domains over a period of time, thereby improving the overall training efficiency of the AI model.
为便于理解,仍以设备1至设备4迭代训练AI模型为例,模型训练装置可以将设备1与设备2划入通信域1,将设备3与设备4划入通信域2。在每轮训练AI模型的过程中,通信域1内的设备1以及设备2分别训练AI模型,并得到相应的梯度数据,模型训练装置可以将对通信域1内的梯度数据进行梯度融合,并利用生成的局部梯度数据更新设备1以及设备2所负责训练的AI模型。同时,模型训练装置也会在该轮训练过程中,利用通信域2中生成的局部梯度数据更新设备3以及设备4所负责训练的AI模型。For ease of understanding, the iterative training of the AI model from device 1 to device 4 is still used as an example. The model training device can classify device 1 and device 2 into communication domain 1, and classify device 3 and device 4 into communication domain 2. During each round of training the AI model, device 1 and device 2 in communication domain 1 train the AI model respectively and obtain corresponding gradient data. The model training device can perform gradient fusion on the gradient data in communication domain 1, and The generated local gradient data is used to update the AI model trained by device 1 and device 2. At the same time, the model training device will also use the local gradient data generated in the communication domain 2 to update the AI model trained by the device 3 and the device 4 during this round of training.
假设在第一轮训练AI模型的过程中,通信域1完成AI模型训练和更新的耗时为40秒,通信域2完成AI模型训练和更新的耗时为60秒;在第二轮训练AI模型的过程中,通信域1完成AI模型训练和更新的耗时为55秒,通信域2完成AI模型训练和更新的耗时为40秒,然后再基于这两个通信域2在第二轮训练生成的局部梯度数据在全局更新AI模型,假设耗时10秒。由于通信域1与通信域2在两轮模型训练过程中相互独立,并且,子通信域1的模型训练耗时为95秒(即40秒+55秒),子通信域2的耗时为100秒(60秒+40秒),这使得AI模型的整体训练耗时为110秒(即100秒+10秒),其小于现有的基于ring-allreduce方式训练AI模型的方式所产生耗时125秒(即60秒+55秒+10秒),从而能够提高AI模型的整体训练效率。Assume that in the first round of training the AI model, communication domain 1 takes 40 seconds to complete the AI model training and update, and communication domain 2 takes 60 seconds to complete the AI model training and update; in the second round of training the AI During the model process, it takes 55 seconds for communication domain 1 to complete the AI model training and update, and 40 seconds for communication domain 2 to complete the AI model training and update. Then based on these two communication domain 2, in the second round The local gradient data generated by training updates the AI model globally, assuming it takes 10 seconds. Since communication domain 1 and communication domain 2 are independent of each other during the two rounds of model training, and the model training of sub-communication domain 1 takes 95 seconds (i.e. 40 seconds + 55 seconds), the time-consuming of sub-communication domain 2 is 100 seconds (60 seconds + 40 seconds), which makes the overall training time of the AI model 110 seconds (i.e. 100 seconds + 10 seconds), which is less than the existing ring-allreduce-based method of training AI models that takes 125 seconds seconds (i.e. 60 seconds + 55 seconds + 10 seconds), thereby improving the overall training efficiency of the AI model.
并且,模型训练装置在每间隔多轮会利用全局的梯度数据更新AI模型,这可以保证AI模型的训练效果能够达到较高水平。在此基础上,由于中间的每轮模型训练过程中,不同通信域内的设备之间可以不用交互梯度数据,这可以有效减少训练AI模型所需消耗的通信资源。Moreover, the model training device will use global gradient data to update the AI model at multiple intervals, which can ensure that the training effect of the AI model can reach a high level. On this basis, since during each round of model training, devices in different communication domains do not need to exchange gradient data, which can effectively reduce the communication resources required to train the AI model.
示例性地,上述用于执行模型训练方法的模型训练装置,可以部署于图2所示的系统架构。其中,图2所示的系统架构,可以包括深度学习框架201、计算架构202、固件和驱动203、硬件层204。For example, the above-mentioned model training device for executing the model training method can be deployed in the system architecture shown in Figure 2. Among them, the system architecture shown in Figure 2 may include a deep learning framework 201, a computing architecture 202, firmware and drivers 203, and a hardware layer 204.
深度学习框架201,可以综合开发组件以及预训练模型等资源,屏蔽用户对于底层复杂硬件的感知,并为用户提供快速开发AI模型的服务。示例性地,深度学习框架201,例如可以是TensorFlow框架、PyTorch框架、或MindSpore框架等,或者可以是其它类型的深度学习框架,对此并不进行限定。Deep learning framework 201 can comprehensively develop components, pre-trained models and other resources, shield users from the underlying complex hardware, and provide users with services for rapid development of AI models. For example, the deep learning framework 201 may be, for example, the TensorFlow framework, the PyTorch framework, the MindSpore framework, etc., or may be other types of deep learning frameworks, which are not limited.
计算架构202,用于提供开放的编程接口、支持用户快速构建基于AI模型的AI应用和业务,并调用硬件层204中的多个处理器实现AI模型训练的并行化能力。进一步地,计算架构102还可以实现对AI模型进行图级和算子级的编译优化和自动调优等功能。示例性地,计算架构202例如可以是神经网络计算架构(compute architecture for neural networks,CANN)等,或者可以是其它可适用的架构。The computing architecture 202 is used to provide an open programming interface, support users to quickly build AI applications and services based on AI models, and call multiple processors in the hardware layer 204 to achieve parallelization capabilities for AI model training. Furthermore, the computing architecture 102 can also implement functions such as graph-level and operator-level compilation optimization and automatic tuning of AI models. For example, the computing architecture 202 may be, for example, a neural network computing architecture (compute architecture for neural networks, CANN), etc., or may be other applicable architectures.
固件和驱动203,用于响应计算架构202对于硬件层204的调用,使用硬件层204中的多个处理器,以执行相应的数据处理操作,如使用硬件层204中的多个处理器并行化训练AI模型等。Firmware and driver 203 are used to respond to the call of the computing architecture 202 to the hardware layer 204 and use multiple processors in the hardware layer 204 to perform corresponding data processing operations, such as using multiple processors in the hardware layer 204 for parallelization. Training AI models, etc.
硬件层204,包括多个处理器,如图2中的处理器1至处理器8,还包括其它器件如存储器、网卡等(图2中未输出),用于为上层提供数据处理能力,支持上层的服务。示例性地,硬件层204中包括的处理器,例如可以包括中央处理器(central processing unit,CPU)、神经网络处理器(neural-network processing unit,NPU)、图形处理器(graphics processing unit,GPU)、数据处理单元(data processing unit,DPU)中一种或者多种,或者可以包括其它类型的处理器,对此并不进行限定。The hardware layer 204 includes multiple processors, such as processor 1 to processor 8 in Figure 2, and other devices such as memory, network card, etc. (not shown in Figure 2), which are used to provide data processing capabilities for the upper layer and support Upper level services. For example, the processors included in the hardware layer 204 may include, for example, a central processing unit (CPU), a neural network processing unit (NPU), a graphics processing unit (GPU). ), one or more of data processing unit (DPU), or may include other types of processors, which are not limited.
图2所示的系统架构仅作为一种示例性说明,实际应用时,模型训练装置也可以是部署于其它类型的系统架构中,用于实现对AI模型的分布式训练。比如,在其它可能的系统架构中,硬件层204可以包括多个服务器,即可以是以服务器为粒度对AI模型进行分布式训练。 The system architecture shown in Figure 2 is only used as an example. In actual application, the model training device can also be deployed in other types of system architecture to implement distributed training of AI models. For example, in other possible system architectures, the hardware layer 204 may include multiple servers, that is, the AI model may be distributedly trained at the server granularity.
为使本申请的上述目的、特征和优点能够更加明显易懂,下面将结合附图对本申请实施例中的各种非限定性实施方式进行示例性说明。显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,基于上述内容所获得的所有其它实施例,都属于本申请保护的范围。In order to make the above objects, features and advantages of the present application more obvious and understandable, various non-limiting implementations in the embodiments of the present application will be illustratively described below with reference to the accompanying drawings. Obviously, the described embodiments are part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained based on the above content belong to the protection scope of this application.
如图3所示,为本申请实施例中一种模型训练方法的流程示意图,该方法可以应用于如图2所示的系统架构中。实际应用时,该方法也可以应用于其它可适用的系统架构。为便于理解与描述,下面以应用于图2所示的系统架构为例进行示例性说明,该方法具体可以包括:As shown in Figure 3, it is a schematic flow chart of a model training method in an embodiment of the present application. This method can be applied to the system architecture shown in Figure 2. In practical applications, this method can also be applied to other applicable system architectures. To facilitate understanding and description, the following is an example of an application to the system architecture shown in Figure 2. The method may specifically include:
S301:获取待训练的AI模型。S301: Obtain the AI model to be trained.
实际应用场景中,用户在深度学习框架201上开发AI应用时,可以将向深度学习框架201提供用于实现该AI应用的AI模型,从而深度学习框架201可以将该AI模型提供给计算架构202中模型训练装置,以触发对该AI模型的训练。In actual application scenarios, when users develop AI applications on the deep learning framework 201, they can provide the AI model used to implement the AI application to the deep learning framework 201, so that the deep learning framework 201 can provide the AI model to the computing architecture 202. The model training device is used to trigger the training of the AI model.
作为一种实现示例,用户可以在深度学习框架201上编写训练脚本(基于特定格式编写的可执行文件),该训练脚本中可以集成有用户在该深度学习框架201上构建的AI模型的文件。然后,深度学习框架201可以将该训练脚本提供给计算架构202中的模型训练装置,从而模型训练装置可以从该训练脚本中解析出AI模型,并根据训练脚本所指示的模型训练逻辑,对AI模型进行分布式训练。As an implementation example, the user can write a training script (an executable file written based on a specific format) on the deep learning framework 201, and the training script can be integrated with the file of the AI model built by the user on the deep learning framework 201. Then, the deep learning framework 201 can provide the training script to the model training device in the computing architecture 202, so that the model training device can parse the AI model from the training script and train the AI according to the model training logic indicated by the training script. The model undergoes distributed training.
作为另一种实现示例,深度学习框架201可以向用户提供配置界面,该配置界面中可以呈现多个已经完成构建的AI模型等,从而深度学习框架201可以根据用户针对AI模型选择操作,确定用户选择的待训练的AI模型。进一步地,配置界面中还可以呈现多种能够用于训练AI模型的深度学习算法,以便用户在该配置界面上选择深度学习算法,并基于所选择的深度学习算法配置相应的参数,如学习率、损失函数等。然后,深度学习框架201可以将用户所选择的AI模型、深度学习算法以及配置的参数提供给模型训练装置,以便模型训练装置基于该深度学习算法以及配置的参数,对AI模型进行分布式训练。当然,模型训练装置也可以通过其它方式获取待训练的AI模型,本实施例对此并不进行限定。As another implementation example, the deep learning framework 201 can provide the user with a configuration interface, which can present multiple completed AI models, etc., so that the deep learning framework 201 can determine the user's operation according to the user's selection of operations for the AI model. The selected AI model to be trained. Furthermore, the configuration interface can also present a variety of deep learning algorithms that can be used to train AI models, so that users can select a deep learning algorithm on the configuration interface and configure corresponding parameters, such as learning rate, based on the selected deep learning algorithm. , loss function, etc. Then, the deep learning framework 201 can provide the AI model, deep learning algorithm, and configured parameters selected by the user to the model training device, so that the model training device performs distributed training on the AI model based on the deep learning algorithm and configured parameters. Of course, the model training device can also obtain the AI model to be trained through other methods, which is not limited in this embodiment.
S302:确定多个通信域,该多个通信域中的每个通信域包括多个设备。S302: Determine multiple communication domains, each of the multiple communication domains including multiple devices.
本实施例中,模型训练装置可以利用硬件层204中的N个处理器训练AI模型,N为正整数(如N为8、16等)。并且,在训练AI模型之前,模型训练装置可以先将训练AI模型的N个处理器划分成多个集合,每个集合包括至少两个处理器,并且,每个集合内的处理器可以构成一个通信域。比如,模型训练装置可以将8个处理器划分成2个通信域,每个通信域内包括4个处理器等。In this embodiment, the model training device can use N processors in the hardware layer 204 to train the AI model, where N is a positive integer (such as N is 8, 16, etc.). Moreover, before training the AI model, the model training device can first divide the N processors for training the AI model into multiple sets, each set including at least two processors, and the processors in each set can form a communication domain. For example, the model training device can divide 8 processors into 2 communication domains, each communication domain including 4 processors, etc.
其中,在训练AI模型的过程中,不同通信域内的处理器之间可以独立训练AI模型,比如,当通信域1内的处理器完成一轮针对AI模型的训练后,可以不用等待通信域2内的处理器同样完成一轮AI模型训练,而直接执行下一轮训练AI模型的过程。并且,每个通信域内的处理器之间可以通过allreduce、ring-allreduce等方式交互每轮训练AI模型所生成的梯度数据。Among them, during the process of training the AI model, processors in different communication domains can train the AI model independently. For example, when the processor in communication domain 1 completes a round of training for the AI model, it does not need to wait for communication domain 2. The internal processor also completes one round of AI model training and directly executes the next round of training the AI model. Moreover, processors in each communication domain can interact with each other through allreduce, ring-allreduce and other methods to generate gradient data generated by each round of AI model training.
为便于理解,本实施例提供了以下两种确定多个通信域的实现示例:To facilitate understanding, this embodiment provides the following two implementation examples for determining multiple communication domains:
在第一种实现示例中,模型训练装置可以根据设备之间的亲和性,将亲和性较高的多个设备划入同一通信域。具体实现时,模型训练装置可以获取硬件层204中的N个处理器之间的设备拓扑关系,该设备拓扑关系能够用于指示该N个处理器之间的连接关系,从而模型训练装置可以根据该设备拓扑关系,对用于训练AI模型的N个处理器进行划分,得到多个通信域,每个通信域中的不同处理之间的通信速率高于不同通信域的处理器之间的通信速率。实际应用时,模型训练装置可以根据该设备拓扑关系,将物理连接为环状连接的多个处理器划入一个通信域中,以此可以划分得到多个通信域。示例性地,同一通信域内的处理器之间,例如可以是基于华为缓存一致性系统(huawei cache coherence system,HCCS)环状连接方式建立物理连接。In the first implementation example, the model training device can classify multiple devices with higher affinity into the same communication domain based on the affinity between the devices. During specific implementation, the model training device can obtain the device topology relationship between the N processors in the hardware layer 204. The device topology relationship can be used to indicate the connection relationship between the N processors, so that the model training device can obtain the device topology relationship between the N processors. The device topology divides the N processors used to train the AI model to obtain multiple communication domains. The communication rate between different processes in each communication domain is higher than the communication between processors in different communication domains. rate. In actual application, the model training device can divide multiple processors that are physically connected in a ring connection into one communication domain according to the topological relationship of the device, so that multiple communication domains can be divided. For example, physical connections may be established between processors in the same communication domain based on a Huawei cache coherence system (huawei cache coherence system, HCCS) ring connection method.
比如,假设用于训练AI模型的N个处理器包括如图4所示的NPU1至NPU8,其中NPU1至NPU4之间采用全互联模式(full mesh)进行连接,NPU5至NPU8之间也采用全互联模式进行连接,NPU1与NPU5之间可以通过CPU进行连接。则,模型训练装置可以根据NPU1至NPU8之间的拓扑结构,确定将NPU1至NPU4划入通信域1中,将NPU5至NPU8划入通信域2中。通常情况下,NPU1至NPU4之间的通信速率,高于跨通信域的NPU之间的通信速率。For example, assume that the N processors used to train the AI model include NPU1 to NPU8 as shown in Figure 4. NPU1 to NPU4 are connected in full mesh mode, and NPU5 to NPU8 are also connected in full mesh mode. mode to connect, NPU1 and NPU5 can be connected through the CPU. Then, the model training device can determine to classify NPU1 to NPU4 into communication domain 1 and classify NPU5 to NPU8 into communication domain 2 according to the topological structure between NPU1 to NPU8. Normally, the communication rate between NPU1 to NPU4 is higher than the communication rate between NPUs across communication domains.
在第二种实现示例中,模型训练装置可以生成配置界面,例如可以生成如图5所示的配置界面,该配置界面包括M个可用于训练AI模型的处理器的标识(如处理器名称等),M为大于或者等于N的正整数,从而模型训练装置可以通过深度学习框架201,将该配置界面呈现给用户,以供用户从呈现的M个处理器中选择此次用于训练AI模型的N个处理器,并进一步为用户所选择的每个处理器配置其所属的通信域。相应地,模型训练装置可以执行通信域的初始化流程,具体可以是响应于用户的配置操作,确定用于训练AI模型的N个处理器中的每个处理器所属的通信域,以此划分得到多个通信域,并确定每个通信域的规模。其中,每个通信域包括的处理器的数量可以相同,也可以不同。In the second implementation example, the model training device can generate a configuration interface. For example, it can generate a configuration interface as shown in Figure 5. The configuration interface includes the identifiers of M processors (such as processor names, etc.) that can be used to train the AI model. ), M is a positive integer greater than or equal to N, so the model training device can present the configuration interface to the user through the deep learning framework 201, so that the user can select from the presented M processors for training the AI model this time. N processors, and further configure the communication domain to which each processor selected by the user belongs. Correspondingly, the model training device may execute the initialization process of the communication domain. Specifically, it may respond to the user's configuration operation, determine the communication domain to which each of the N processors used to train the AI model belongs, and divide it to obtain Multiple communication domains and determining the size of each communication domain. The number of processors included in each communication domain may be the same or different.
举例来说,假设图5所示的配置界面上呈现有16个可供用户选择的处理器,并基于用户针对处理器的选择操作,确定选择处理器1至处理器8来训练AI模型。然后,用户可以在该配置界面上创建两个通信域,分别为通信域1以及通信域2,并在该配置界面上指定处理器1至处理器8分别所属的通信域。如此,模型训练装置可以根据用户对于各个处理器所属通信域的配置,确定每个通信域中包括的处理器,以此得到多个通信域。For example, assume that the configuration interface shown in Figure 5 presents 16 processors for the user to select, and based on the user's selection operation on the processor, it is determined to select processor 1 to processor 8 to train the AI model. Then, the user can create two communication domains on the configuration interface, namely communication domain 1 and communication domain 2, and specify the communication domains to which processors 1 to 8 respectively belong on the configuration interface. In this way, the model training device can determine the processors included in each communication domain according to the user's configuration of the communication domain to which each processor belongs, thereby obtaining multiple communication domains.
当然,上述确定通信域的实现方式仅作为一些示例性说明,实际应用时,模型训练装置也可以通过其它方式确定多个通信域,如模型训练装置在确定用户所选择的多个处理器后,可以将位于同一服务器内的处理器划入一个通信域等,本实施例对此并不进行限定。Of course, the above-mentioned implementation methods of determining communication domains are only examples. In actual application, the model training device can also determine multiple communication domains in other ways. For example, after the model training device determines multiple processors selected by the user, Processors located in the same server can be classified into a communication domain, etc., which is not limited in this embodiment.
S303:在每轮利用多个通信域训练AI模型的过程中,利用每个通信域对应的局部梯度数据更新该通信域所训练的AI模型,每个通信域对应的局部梯度数据根据该通信域中的多个处理器分别生成的梯度数据进行梯度融合得到。S303: In the process of training the AI model using multiple communication domains in each round, use the local gradient data corresponding to each communication domain to update the AI model trained in the communication domain. The local gradient data corresponding to each communication domain is based on the communication domain. The gradient data generated by multiple processors in the system are obtained by gradient fusion.
在确定多个通信域后,模型训练装置可以利用多个通信域中的处理器对AI模型进行分布式训练。After determining multiple communication domains, the model training device can use processors in the multiple communication domains to perform distributed training on the AI model.
具体实现时,模型训练装置可以为每个处理器分配AI模型以及用于训练该AI模型的训练样本子集,不同处理器所分配到的AI模型相同、所分配到的训练样本子集不同,每个训练样本子集包括至少一个训练样本。然后,在每轮训练过程中,每个处理器可以利用分配到的训练样本子集训练AI模型,并根据AI模型基于训练样本的推理结果与实际结果的差异生成梯度数据,该梯度数据用于对AI模型中的参数进行梯度更新。由于每个处理器基于部分训练样本(也即训练样本子集)训练AI模型,因此,不同处理器之间可以交互各自生成的梯度数据并进行梯度融合,并基于梯度融合后的结果对各个处理器上的AI模型中的参数进行梯度数据,以此达到利用多个训练样本子集训练AI模型的效果。During specific implementation, the model training device can allocate an AI model and a subset of training samples for training the AI model to each processor. Different processors are assigned the same AI model but different subsets of training samples. Each training sample subset includes at least one training sample. Then, during each round of training, each processor can train the AI model using the assigned subset of training samples, and generate gradient data based on the difference between the AI model's inference results based on the training samples and the actual results. The gradient data is used for Perform gradient updates on parameters in the AI model. Since each processor trains the AI model based on part of the training samples (that is, a subset of training samples), different processors can interact with the gradient data generated by each processor and perform gradient fusion, and perform each process based on the results of gradient fusion. The parameters in the AI model on the processor are used to perform gradient data to achieve the effect of using multiple training sample subsets to train the AI model.
本实施例中,在每轮模型训练的过程中,并非在所有的处理器之间交互梯度数据,每个处理器仅在其所属的通信域内交互梯度数据,并进行梯度融合以及模型参数更新,这使得不同通信域内的模型训练过程互不干扰。以利用图4所示的NPU1至NPU8训练AI模型为例,在每轮模型训练过程中,NPU1至NPU4之间仅在通信域1内交互梯度数据,与通信域2中的NPU5至NPU8之间不交互梯度数据。类似地,NPU5至NPU8之间仅在通信域2内交互梯度数据,与通信域1内的NPU不交互梯度数据。如此,每个通信域在完成梯度数据交互、梯度融合以及模型参数更新后,可以直接执行下一轮的模型训练过程,无需等待其它通信域完成一轮的AI模型训练。In this embodiment, during each round of model training, gradient data is not exchanged between all processors. Each processor only exchanges gradient data within the communication domain to which it belongs, and performs gradient fusion and model parameter updating. This prevents the model training processes in different communication domains from interfering with each other. Take the AI model training using NPU1 to NPU8 shown in Figure 4 as an example. During each round of model training, NPU1 to NPU4 only interact with gradient data in communication domain 1, and between NPU5 to NPU8 in communication domain 2. Do not interact with gradient data. Similarly, NPU5 to NPU8 only exchange gradient data in communication domain 2, and do not exchange gradient data with the NPU in communication domain 1. In this way, each communication domain can directly execute the next round of model training process after completing gradient data interaction, gradient fusion, and model parameter update, without waiting for other communication domains to complete a round of AI model training.
其中,每个通信域内的多个处理器之间,可以基于任意策略交互梯度数据。为便于理解与说明,下面以多个通信域中的其中一个通信域(以下称之为目标通信域)为例进行示例性说明,其余通信域中的处理器之间可以参照类似过程进行数据交互,每个通信域内基于所有处理器的梯度数据进行梯度融合所生成的新的梯度数据,称之为局部梯度数据。示例性地,目标通信域中的多个处理器之间,可以基于以下几种实现方式进行数据交互。Among them, multiple processors in each communication domain can exchange gradient data based on any strategy. For ease of understanding and explanation, one communication domain among multiple communication domains (hereinafter referred to as the target communication domain) is taken as an example for illustrative explanation. Processors in the remaining communication domains can refer to similar processes for data exchange. , the new gradient data generated by gradient fusion based on the gradient data of all processors in each communication domain is called local gradient data. For example, data interaction can be carried out between multiple processors in the target communication domain based on the following implementation methods.
在第一种实现示例中,目标通信域内的多个处理器,可以基于allgather、allreduce、ring-allreduce、半倍(having-doubling)allreduce中的任意一种策略交互梯度数据。In the first implementation example, multiple processors in the target communication domain can interact with gradient data based on any one of strategies including allgather, allreduce, ring-allreduce, and having-doubling allreduce.
以基于allreduce策略交互梯度数据为例,假设目标通信域内包括4个处理器,分别为处理器1、处理器2、处理器3以及处理器4,并且,这4个处理器上的梯度数据分别为梯度数据a、梯度数据b、梯度数据c以及梯度数据d,如图6所示。则,在第一次交互时,处理器1可以与处理器2交互梯度数据,同时,处理器3与处理器4交互梯度数据。此时,处理器1以及处理器2可以通过对梯度数据a以及梯度数据b进行梯度融合,生成梯度数据M;处理器3以及处理器4可以通过对梯度数据c以及梯度数据d进行梯度融合,生成梯度数据N。在第二次交互时,处理器1可以与处理器3交互梯度数据,具体为处理器1向处理器3发送梯度数据M,处理器3向处理器1发送梯度数据N。同时,处理器2与处理器4交互梯度数据。此时,处理器1以及处理器3可以通过对梯度数据M以及梯度数据N进行梯度融合,生成梯度数据X,该梯度数据X即为根据梯度数据a、梯度数据b、梯度数据c以及梯度数据d进行梯度融合所生成的数据。并且,处理器2以及处理器4也能通过对梯度数据M以及梯度数据N进行梯度融合,生成梯度数据X。如此,经过两次交互后,每个处理器均可以获得根据目标通信域内的所有处理器的梯度数据进行梯度融合后所生成的梯度数据X。Taking interactive gradient data based on the allreduce strategy as an example, assume that the target communication domain includes 4 processors, namely processor 1, processor 2, processor 3 and processor 4, and the gradient data on these four processors are respectively are gradient data a, gradient data b, gradient data c and gradient data d, as shown in Figure 6. Then, in the first interaction, processor 1 can exchange gradient data with processor 2, and at the same time, processor 3 and processor 4 exchange gradient data. At this time, processor 1 and processor 2 can perform gradient fusion on gradient data a and gradient data b to generate gradient data M; processor 3 and processor 4 can perform gradient fusion on gradient data c and gradient data d, Generate gradient data N. During the second interaction, processor 1 can exchange gradient data with processor 3. Specifically, processor 1 sends gradient data M to processor 3, and processor 3 sends gradient data N to processor 1. At the same time, processor 2 and processor 4 exchange gradient data. At this time, the processor 1 and the processor 3 can perform gradient fusion on the gradient data M and the gradient data N to generate the gradient data X. The gradient data dThe data generated by gradient fusion. Furthermore, the processor 2 and the processor 4 can also perform gradient fusion on the gradient data M and the gradient data N to generate the gradient data X. In this way, after two interactions, each processor can obtain the gradient data X generated by gradient fusion based on the gradient data of all processors in the target communication domain.
进一步地,每个通信域内所采用的交互梯度数据的策略,可以由用户进行配置。比如,模型训练装置可以通过深度学习框架201向用户呈现如图7所示的配置界面,从而用户在该配置界面上可以为每个通信域配置交互策略。具体地,如图7所示,配置界面可以针对每个通信域提供多种交互策略的候选项,如allgather、allreduce、ring-allreduce、半倍allreduce等,从而用户可以从多个候选项中,为每个通信域配置一种交互策略,并且,不同通信域所采用的交互策略可以相同,或者可以不相同,本实施例对此并不进行限定。Furthermore, the strategy for interacting gradient data adopted in each communication domain can be configured by the user. For example, the model training device can present the configuration interface as shown in Figure 7 to the user through the deep learning framework 201, so that the user can configure the interaction strategy for each communication domain on the configuration interface. Specifically, as shown in Figure 7, the configuration interface can provide multiple interaction strategy candidates for each communication domain, such as allgather, allreduce, ring-allreduce, half-allreduce, etc., so that users can choose from multiple candidates. An interaction strategy is configured for each communication domain, and the interaction strategies adopted by different communication domains may be the same or different, which is not limited in this embodiment.
在第二种实现方式中,目标通信域中的不同处理器训练AI模型的速度可能存在差异。则,目标通信域中的不同处理器在交互梯度数据时,对于已完成AI模型训练的处理器,可以优先交互梯度数据,不用等待目标通信域内的其余所有处理器也完成AI模型训练,以此可以提高目标通信域内的多个处理器之间交互梯度数据的效率。In the second implementation, there may be differences in the speed at which different processors in the target communication domain train the AI model. Then, when different processors in the target communication domain exchange gradient data, processors that have completed AI model training can give priority to the interaction of gradient data without waiting for all other processors in the target communication domain to also complete AI model training. The efficiency of interacting gradient data between multiple processors within the target communication domain can be improved.
仍以目标通信域中包括4个处理器为例,处理器1至处理器4并行训练AI模型,假设处理器2在目标通信域中最先完成AI模型的训练,则处理器2可以生成激活消息,并利用该激活消息通知处理器1、处理器3以及处理器4开始交互梯度数据。实际应用时,基于处理器之间的物理连接以及通信规则,处理器2可以先向处理器1发送激活消息,以通知处理器1开始交互梯度数据,然后再向处理器4发送激活消息,并由处理器1向处理器2发送激活消息,以通知处理器2开始交互梯度数据,如图8所示。这样,若处理器1第二个完成AI模型训练,则处理器2与处理器3可以直接交互梯度数据(并进行梯度数据融合)。然后,若处理器3第三个完成AI模型训练,则处理器2可以再与处理器3交互梯度数据。并且,当处理器4也完成AI模型训练,处理器2再与处理器4交互梯度数据。这样,处理器2可以获得目标通信域中的所有处理器生成的梯度数据。最后,处理器2可以将基于4个处理器的梯度数据所生成的局部梯度数据发送给其余各处理器,以便利用该局部梯度数据对各个处理器上的AI模型进行参数更新,如图8所示。Still taking the example of the target communication domain including 4 processors, processor 1 to processor 4 train the AI model in parallel. Assuming that processor 2 is the first to complete the training of the AI model in the target communication domain, processor 2 can generate activation message, and use the activation message to notify processor 1, processor 3 and processor 4 to start exchanging gradient data. In actual application, based on the physical connection and communication rules between processors, processor 2 can first send an activation message to processor 1 to notify processor 1 to start interacting with gradient data, and then send an activation message to processor 4, and The processor 1 sends an activation message to the processor 2 to notify the processor 2 to start exchanging gradient data, as shown in Figure 8. In this way, if processor 1 is the second to complete AI model training, processor 2 and processor 3 can directly interact with gradient data (and perform gradient data fusion). Then, if processor 3 completes AI model training for the third time, processor 2 can exchange gradient data with processor 3 again. Moreover, when processor 4 also completes AI model training, processor 2 then exchanges gradient data with processor 4. In this way, processor 2 can obtain gradient data generated by all processors in the target communication domain. Finally, processor 2 can send the local gradient data generated based on the gradient data of the four processors to the remaining processors, so as to use the local gradient data to update the parameters of the AI model on each processor, as shown in Figure 8 Show.
进一步地,由于在对AI模型进行多轮训练的过程中,各个通信域独立执行AI模型的训练和梯度数据融合的过程,为此,各个通信域可以通过限制交互梯度数据的次数,来避免部分通信域执行交互梯度数据的次数过多,以此避免多个通信域之间存在异步冲突。具体实现时,目标通信域内最先完成AI模型训练的处理器可以生成激活消息,该激活消息用于通知该目标通信域中的其余各个处理器开始交互梯度数据,并且,该激活消息中包括激活操作的版本号,该激活操作用于触发目标通信域中的不同处理器之间交互梯度数据。并且,最先完成AI模型训练的处理器可以获取当前已执行的交互操作的版本号,该交互操作即为目标通信域中不同设备之间交互梯度数据的操作,从而该处理器可以比较激活操作的版本号与交互操作的版本号。并且,当激活操作的版本号大于或者等于交互操作的版本号时,该处理器开始与其它处理器交互梯度数据;否则,多个处理器之间不执行梯度数据的交互操作。Furthermore, since during the multi-round training process of the AI model, each communication domain independently performs the training of the AI model and the process of gradient data fusion, for this reason, each communication domain can avoid some problems by limiting the number of interactive gradient data. The communication domain executes interactive gradient data too many times to avoid asynchronous conflicts between multiple communication domains. During specific implementation, the first processor in the target communication domain to complete AI model training can generate an activation message. The activation message is used to notify the remaining processors in the target communication domain to start interacting with gradient data, and the activation message includes the activation message. The version number of the operation used to trigger the exchange of gradient data between different processors in the target communication domain. Moreover, the processor that completes the AI model training first can obtain the version number of the currently executed interaction operation, which is the operation of the interaction gradient data between different devices in the target communication domain, so that the processor can compare the activation operations. The version number and the version number of the interoperability. Moreover, when the version number of the activation operation is greater than or equal to the version number of the interaction operation, the processor starts to interact with gradient data with other processors; otherwise, the interaction operation of gradient data is not performed between multiple processors.
在第三种实现示例中,目标通信域中的多个处理器,可以通过共享内存交互梯度数据。具体实现时,目标通信域中的多个处理器可以配置有共享内存,并且,多个处理器均能访问该共享内存。这样,当目标通信域中的每个处理器在完成一轮对于AI模型的训练并生成梯度数据后,可以将该梯度数据发送至共享内存中的指定区域,从而共享内存中可以存储有多个处理器分别生成的梯度数据。如此,每个处理器可以从共享内存中访问得到该目标通信域中的所有处理器生成的梯度数据,并通过对这些梯度数据进行梯度融合,可以得到局部梯度数据。In the third implementation example, multiple processors in the target communication domain can exchange gradient data through shared memory. During specific implementation, multiple processors in the target communication domain can be configured with shared memory, and multiple processors can access the shared memory. In this way, when each processor in the target communication domain completes a round of training for the AI model and generates gradient data, it can send the gradient data to a designated area in the shared memory, so that multiple processors can be stored in the shared memory. Gradient data generated by the processor respectively. In this way, each processor can access the gradient data generated by all processors in the target communication domain from the shared memory, and by performing gradient fusion on these gradient data, local gradient data can be obtained.
需要说明的是,在每轮训练AI模型的过程中,每个通信域中的多个处理器,均可以参照上述方式交互梯度数据并生成局部梯度数据。并且,上述几种在通信域内交互梯度数据的实现方式仅作为一些示例,在其它实施例中,每个通信域内的多个设备之间也可以采用其它方式交互梯度数据,本实施例对此并不进行限定。It should be noted that during each round of training the AI model, multiple processors in each communication domain can interact with gradient data and generate local gradient data by referring to the above method. Moreover, the above-mentioned implementation methods of exchanging gradient data in the communication domain are only examples. In other embodiments, multiple devices in each communication domain can also use other methods to exchange gradient data. This embodiment does not Not limited.
S304:当间隔多轮利用多个通信域中的设备分布式训练AI模型时,利用全局梯度数据更新每个通信域分别训练的AI模型,其中,该全局梯度数据根据所有通信域中的多个处理器分别生成的梯度数据进行梯度融合得到。S304: When the AI model is distributedly trained using devices in multiple communication domains at multiple rounds, use global gradient data to update the AI models trained separately in each communication domain, where the global gradient data is based on multiple data in all communication domains. The gradient data generated by the processors are obtained by gradient fusion.
由于每个通信域均是利用部分训练样本子集训练AI模型,这使得每个通信域所训练出的AI模型的推理性能(如推理准确性等),通常难以达到基于全集的训练样本所训练出的AI模型的推理性能。为此,本实施例中,在对AI模型完成多轮训练后,多个通信域之间可以交互梯度数据,以便基于所有处理器生成的梯度数据更新各个处理器上的AI模型,具体可以是对所有处理器生成的梯度数据进行梯度融合,并利用梯度融合所生成的新的梯度数据(以下称之为全局梯度数据)更新各个处理器上的AI模型中的参数。如此,最终训练出的AI模型的推理性能,通常能够达到基于全集训练样本训练该AI模型的推理性能。Since each communication domain uses a subset of training samples to train the AI model, the inference performance (such as inference accuracy, etc.) of the AI model trained in each communication domain is usually difficult to reach the level of training based on the full set of training samples. The inference performance of the AI model. To this end, in this embodiment, after completing multiple rounds of training on the AI model, gradient data can be exchanged between multiple communication domains to update the AI model on each processor based on the gradient data generated by all processors. Specifically, it can be Perform gradient fusion on the gradient data generated by all processors, and use the new gradient data generated by gradient fusion (hereinafter referred to as global gradient data) to update the parameters in the AI model on each processor. In this way, the inference performance of the finally trained AI model can usually reach the inference performance of the AI model trained based on the full set of training samples.
在一种可能的实施方式中,各个通信域单独训练AI模型,并且,每个通信域内的处理器在每轮训练AI模型时,可以统计当前对于该AI模型的迭代次数。若当前的迭代次数为T值的整数倍,则不仅该通信域内的多个处理器按照上文所述的方式交互梯度数据,并通过梯度融合生成该通信域对应的局部梯度数据,而且,该通信域还与其它通信域之间交互局部梯度数据,以使得每个通信域能够获得所有通信域分别生成的局部梯度数据。如此,通过对所有通信域生成的局部梯度数据进行梯度融合,得到可以全局梯度数据,并利用该全局梯度数据对每个通信域内的AI模型进行参数更新。In a possible implementation, each communication domain trains the AI model independently, and the processor in each communication domain can count the current number of iterations for the AI model when training the AI model in each round. If the current number of iterations is an integer multiple of the T value, not only will multiple processors in the communication domain interact with gradient data in the manner described above and generate local gradient data corresponding to the communication domain through gradient fusion, but also The communication domain also exchanges local gradient data with other communication domains, so that each communication domain can obtain local gradient data generated separately by all communication domains. In this way, by gradient fusion of the local gradient data generated in all communication domains, global gradient data can be obtained, and the global gradient data can be used to update the parameters of the AI model in each communication domain.
其中,多个通信域之间交互局部梯度数据的方式,与每个通信域内的多个处理器之间交互梯度数据的方式类似。例如,多个通信域之间可以基于allgather、allreduce、ring-allreduce、半倍allreduce中的任意一种策略交互局部梯度数据,或者,多个通信域按照完成(m*T)轮模型训练的顺序依次交互局部梯度数据(m为正整数),或者多个通信域基于共享存储区域交互局部梯度数据等,本实施例对此并不进行限定。The way in which local gradient data is exchanged between multiple communication domains is similar to the way in which gradient data is exchanged between multiple processors in each communication domain. For example, multiple communication domains can exchange local gradient data based on any one of allgather, allreduce, ring-allreduce, and half-allreduce strategies, or multiple communication domains can complete (m*T) rounds of model training in the order Local gradient data are exchanged sequentially (m is a positive integer), or multiple communication domains exchange local gradient data based on a shared storage area, etc. This embodiment is not limited to this.
上述实施方式中,不同通信域之间交互局部梯度数据,而在另一种可能的实施方式中,多个通信域之间可以直接交互各个处理器分别生成的梯度数据。In the above embodiment, local gradient data is exchanged between different communication domains. In another possible implementation, gradient data generated by each processor can be directly exchanged between multiple communication domains.
比如,当每个通信域内的处理器迭代训练AI模型的次数为T值的整数倍时,每个通信域内的一个处理器可以汇总该通信域内的各个处理器分别生成的梯度数据,得到该通信域对应的梯度数据集合,该梯度数据集合包括该通信域中的所有处理器分别生成的梯度数据,从而多个通信域中负责汇总梯度数据的处理器之间,可以交互各自的梯度数据集合。或者,当每个通信域内的处理器迭代训练AI模型的次数为T值的整数倍时,参与AI模型训练的所有处理器之间直接交互各自生成的梯度数据等。如此,每个通信域均能获得所有通信域内的处理器分别生成的梯度数据,从而通过对所有的梯度数据进行梯度融合,可以得到全局梯度数据,以便利用该全局梯度数据对每个通信域内的AI模型进行参数更新。For example, when the number of times the processor in each communication domain iteratively trains the AI model is an integer multiple of the T value, a processor in each communication domain can summarize the gradient data generated by each processor in the communication domain to obtain the communication The gradient data set corresponding to the domain includes the gradient data generated by all processors in the communication domain, so that the processors responsible for summarizing gradient data in multiple communication domains can interact with their respective gradient data sets. Or, when the number of times the processors in each communication domain iteratively train the AI model is an integer multiple of the T value, all processors participating in AI model training directly interact with each other to generate gradient data, etc. In this way, each communication domain can obtain the gradient data generated by the processors in all communication domains, so that by gradient fusion of all the gradient data, the global gradient data can be obtained, so that the global gradient data can be used to calculate the gradient data in each communication domain. The AI model performs parameter updates.
其中,上述实现方式中,是以多个通信域每间隔(T-1)轮交互梯度数据(或者局部梯度数据)为例进行示例性说明,在其它实施例中,多个通信域之间每次交互梯度数据所间隔的模型训练次数可以不是固定值。比如,在分布式训练AI模型的过程中,当每个通信域迭代训练AI模型的次数均达到1000次时,多个通信域之间第一次交互梯度数据(或者局部梯度数据),间隔的模型训练次数为1000;然后,当每个通信域迭代训练AI模型的次数均达到1900次时,多个通信域之间第二次交互梯度数据(或者局部梯度数据),间隔的模型训练次数为900;当每个通信域迭代训练AI模型的次数均达到2700次时,多个通信域之间第三次交互梯度数据(或者局部梯度数据),间隔的模型训练次数为800;当每个通信域迭代训练AI模型的次数均达到3400次时,多个通信域之间第四次交互梯度数据(或者局部梯度数据),间隔的模型训练次数为700,以此类推。Among them, in the above implementation, multiple communication domains interact with gradient data (or local gradient data) every interval (T-1) as an example for illustration. In other embodiments, multiple communication domains interact with each other every time. The number of model training times between the interaction gradient data may not be a fixed value. For example, in the process of distributed AI model training, when each communication domain iteratively trains the AI model for 1,000 times, gradient data (or local gradient data) are exchanged between multiple communication domains for the first time. The number of model training times is 1000; then, when the number of times each communication domain iteratively trains the AI model reaches 1900 times, gradient data (or local gradient data) are exchanged between multiple communication domains for the second time, and the number of interval model training times is 900; when the number of times each communication domain iteratively trains the AI model reaches 2700, the third interaction of gradient data (or local gradient data) between multiple communication domains, the number of interval model training is 800; when each communication domain When the number of domain iteration training AI models reaches 3400 times, the gradient data (or local gradient data) between multiple communication domains is interacted for the fourth time, and the number of interval model training times is 700, and so on.
需要说明的是,本实施例中,是以通信域中的设备具体为处理器进行示例性说明,在其它实施例中,通信域中的设备也可以是芯片或者服务器,其对AI模型进行分布式训练的具体实现过程,可参照本实施例的相关之处描述进行理解,在此不做赘述。It should be noted that in this embodiment, the device in the communication domain is specifically a processor for illustration. In other embodiments, the device in the communication domain may also be a chip or a server, which distributes the AI model. The specific implementation process of the training can be understood with reference to the relevant descriptions of this embodiment, and will not be described in detail here.
本实施例中,由于每间隔多轮模型训练才会利用所有处理器训练AI模型所生成的梯度数据更新AI模型,而在中间的每轮模型训练过程中,每个通信域单独利用其内部的多个处理器生成的梯度数据更新该通信域所训练的AI模型,这可以缓解部分通信域训练AI模型的进度较低而对AI模型的整体训练进度所产生的影响,也即能够提高AI模型的整体训练效率。比如,第1轮集合1的进度拉低3秒,第2轮集合2的进度拉低5秒,整体进度不会被拉低为8秒,而是其中的最慢的一个进度,即5秒。并且,每间隔多轮模型训练利用全局梯度数据更新AI模型,可以保证AI模型的训练效果达到较高水平,在此基础上,由于中间的每轮模型训练过程中,不同通信域内的处理器之间可以不用交互梯度数据,这可以有效减少训练AI模型所需消耗的通信资源。In this embodiment, since the gradient data generated by training the AI model on all processors is used to update the AI model every multiple rounds of model training, during each round of model training in the middle, each communication domain uses its internal The gradient data generated by multiple processors updates the AI model trained in the communication domain. This can alleviate the impact of the low progress of training the AI model in some communication domains on the overall training progress of the AI model, that is, it can improve the AI model. overall training efficiency. For example, if the progress of Set 1 in the first round is reduced by 3 seconds, and the progress of Set 2 in the second round is reduced by 5 seconds, the overall progress will not be reduced to 8 seconds, but the slowest one, which is 5 seconds. . Moreover, using global gradient data to update the AI model at each interval of multiple rounds of model training can ensure that the training effect of the AI model reaches a high level. On this basis, due to the intervening process of each round of model training, the relationship between processors in different communication domains There is no need to interact with gradient data, which can effectively reduce the communication resources required to train the AI model.
下面,结合具体应用场景,对AI模型进行分布式训练的具体实现过程进行介绍。在该应用场景中,图1所述的系统架构可以部署于服务器中,该服务器包括4个CPU,并且可以外接8个NPU芯片,如图9所示,从而利用该服务器中的NPU1至NPU8可以实现对盘古模型(一种AI模型)进行分布式训练。在其它实施例中,也可以是基于多个服务器中的NPU芯片实现对盘古模型的分布式训练,其训练方式与利用一个服务器中的多个NPU分布式训练盘古模型的实现方式类似,可进行参照理解。Next, combined with specific application scenarios, the specific implementation process of distributed training of AI models is introduced. In this application scenario, the system architecture described in Figure 1 can be deployed in a server. The server includes 4 CPUs and can be connected to 8 NPU chips, as shown in Figure 9. Thus, NPU1 to NPU8 in the server can Implement distributed training of the Pangu model (an AI model). In other embodiments, distributed training of the Pangu model can also be implemented based on NPU chips in multiple servers. The training method is similar to the implementation of distributed training of the Pangu model using multiple NPUs in a server. Refer to understanding.
在图9所示的服务器中,每个CPU可以支持8个第4代双倍速率双列直插式存储模块(double data rate 4 dual inline memory modules,DDR4 DIMM),并且,CPU1至CPU4之间可以全互联(full mesh)。服务器中的CPU可以提供90GB/s(吉字节每秒)的带宽能力,其中,每个CPU提供的单向带宽可以为30GB/s,双向带宽为60GB/s。In the server shown in Figure 9, each CPU can support 8 generation 4 double-rate dual inline memory modules (double data rate 4 dual inline memory modules, DDR4 DIMM), and between CPU1 to CPU4 Can be fully interconnected (full mesh). The CPU in the server can provide a bandwidth capability of 90GB/s (gigabytes per second), of which each CPU can provide a one-way bandwidth of 30GB/s and a two-way bandwidth of 60GB/s.
服务器外接的8个NPU芯片中,NPU1至NPU4可以全互联,可以位于一个NPU主板,NPU5至NPU8全互联,可以位于另一个NPU主板,并且,服务器外接的8个NPU芯片与CPU之间存在连接,例如可以基于外围元件快速互连(peripheral component interconnect express,PCIE)总线进行连接等(图9中仅示出NPU与CPU之间的部分连接),从而NPU1至NPU4可以通过服务器中的CPU与NPU5至NPU8进行数据交互。每个NPU主板可以提供90GB/s的带宽能力,其中,每个NPU提供的单向带宽可以为30GB/s,双向带宽为60GB/sAmong the 8 NPU chips external to the server, NPU1 to NPU4 can be fully interconnected and can be located on one NPU motherboard. NPU5 to NPU8 can be fully interconnected and can be located on another NPU motherboard. Moreover, there is a connection between the 8 NPU chips external to the server and the CPU. , for example, the connection can be based on the peripheral component interconnect express (PCIE) bus (only part of the connection between the NPU and the CPU is shown in Figure 9), so that NPU1 to NPU4 can pass through the CPU and NPU5 in the server. Go to NPU8 for data exchange. Each NPU motherboard can provide a bandwidth capacity of 90GB/s, of which each NPU can provide a one-way bandwidth of 30GB/s and a two-way bandwidth of 60GB/s.
基于图9所示的服务器,可以实现对盘古模型进行分布式训练,该分布式训练过程如图10所示。用户可以向服务器提供训练脚本,该训练脚本中可以包括盘古模型的文件,并指定利用NPU1至NPU8训练该盘古模型,并定义NPU1至NPU4属于通信域1、定义NPU5至NPU8属于通信域2。Based on the server shown in Figure 9, distributed training of the Pangu model can be implemented. The distributed training process is shown in Figure 10. Users can provide a training script to the server. The training script can include the Pangu model file, specify NPU1 to NPU8 to train the Pangu model, and define NPU1 to NPU4 to belong to communication domain 1, and NPU5 to NPU8 to belong to communication domain 2.
如此,服务器中主机侧的CPU,可以从该训练脚本中解析出待训练的盘古模型,并确定此次用于分布式训练盘古模型的多个NPU以及每个NPU所属的通信域。In this way, the CPU on the host side of the server can parse the Pangu model to be trained from the training script, and determine the multiple NPUs used for distributed training of the Pangu model this time and the communication domain to which each NPU belongs.
然后,CPU可以根据训练脚本提取计算图,该计算图中包括多个节点,不同节点之间存在边进行连接。其中,计算图中的节点,用于指示训练脚本中定义的计算,节点之间的边用于指示不同计算之间的依赖关系,所提取的计算图可以保存至闪存卡(trans-flash card)Then, the CPU can extract the calculation graph according to the training script. The calculation graph includes multiple nodes, and there are edges connecting different nodes. Among them, the nodes in the calculation graph are used to indicate the calculations defined in the training script, and the edges between the nodes are used to indicate the dependencies between different calculations. The extracted calculation graph can be saved to a trans-flash card.
接着,CPU可以对闪存卡中的计算图进行编译,生成中间表示(intermedAIte representation,IR),并将IR提供给编译器。其中,编译器可以定义有一个或者多个算子库,如图10所示的神经网络(neural network,NN)算子库、华为集合通信库(huawei collective communication library,HCCL)算子库等。示例性地,NN算子库中可以包括卷积层算子、池化层算子、损失函数等;HCCL算子库中可以包括用于定义数据通信方式的算子,如allreduce算子、allgather算子等。Then, the CPU can compile the calculation graph in the flash card, generate an intermediate representation (IR), and provide the IR to the compiler. Among them, the compiler can define one or more operator libraries, such as the neural network (NN) operator library shown in Figure 10, the Huawei collective communication library (huawei collective communication library, HCCL) operator library, etc. For example, the NN operator library can include convolution layer operators, pooling layer operators, loss functions, etc.; the HCCL operator library can include operators used to define data communication methods, such as allreduce operators, allgather Operator etc.
如此,CPU可以利用编译器,确定分布式训练盘古模型所需依次执行的算子,并以此生成相应的设备指令,并将该设备指令下发给设备侧的NPU。In this way, the CPU can use the compiler to determine the operators that need to be executed sequentially for distributed training of the Pangu model, generate corresponding device instructions, and send the device instructions to the NPU on the device side.
设备侧的NPU1以及NPU8可以基于主机侧下发的设备指令,循环执行相应的算子并对盘古模型进行梯度更新,直至满足迭代终止条件,以此实现对盘古模型的分布式训练。其中,在分布式训练盘古模型的过程中,通信域1中的NPU1至NPU4,与通信域2中的NPU5至NPU8分别对盘古模型单独训练,并且,通信域1与通信域2在每间隔(T-1)轮模型训练交互训练盘古模型所生成的梯度数据,以实现在全局对盘古模型进行梯度更新,其具体训练过程,可参见前述图3所示实施例的相关之处描述,在此不做赘述。NPU1 and NPU8 on the device side can execute the corresponding operators in a loop based on the device instructions issued by the host side and perform gradient updates on the Pangu model until the iteration termination conditions are met, thereby realizing distributed training of the Pangu model. Among them, in the process of distributed training of the Pangu model, NPU1 to NPU4 in communication domain 1 and NPU5 to NPU8 in communication domain 2 train the Pangu model separately, and communication domain 1 and communication domain 2 are trained at every interval ( T-1) Round model training interactively trains the gradient data generated by the Pangu model to achieve global gradient update of the Pangu model. For the specific training process, please refer to the relevant description of the embodiment shown in Figure 3, here No further details will be given.
最后,设备侧在完成针对盘古模型的分布式训练后,可以将训练结果发送给主机侧,该训练结果例如可以包括完成训练的盘古模型、盘古模型的属性信息(如推理准确性)等。Finally, after completing the distributed training of the Pangu model, the device side can send the training results to the host side. The training results may include, for example, the trained Pangu model, attribute information of the Pangu model (such as inference accuracy), etc.
上文中结合图1至图10,详细描述了本申请所提供的模型训练方法,下面将结合图11至图12,分别描述根据本申请所提供的模型训练装置以及计算设备。The model training method provided by the present application is described in detail above with reference to Figures 1 to 10. The model training device and computing device provided by the present application will be described below with reference to Figures 11 to 12.
与上述方法同样的发明构思,本申请实施例还提供一种模型训练装置。参见图11,为本申请实施例提供的一种模型训练装置的结构示意图,图11所示的模型训练装置1100,例如可以是上述图3所示实施例中提及的模型训练装置。如图11所示,模型训练装置1100包括:With the same inventive concept as the above method, embodiments of the present application also provide a model training device. Referring to FIG. 11 , which is a schematic structural diagram of a model training device provided by an embodiment of the present application. The model training device 1100 shown in FIG. 11 may, for example, be the model training device mentioned in the embodiment shown in FIG. 3 . As shown in Figure 11, the model training device 1100 includes:
获取模块1101,用于获取待训练的AI模型; Acquisition module 1101, used to obtain the AI model to be trained;
确定模块1102,用于确定多个通信域,所述多个通信域中的每个通信域包括多个设备;Determining module 1102, configured to determine multiple communication domains, each of the multiple communication domains including multiple devices;
更新模块1103,用于在每轮利用所述多个通信域中的设备分布式训练所述AI模型的过程中,利用每个通信域对应的局部梯度数据更新该通信域所训练的所述AI模型,每个通信域对应的局部梯度数据根据该通信域中的多个设备分别生成的梯度数据进行梯度融合得到;并且,当间隔多轮利用所述多个通信域中的设备分布式训练所述AI模型时,利用全局梯度数据更新每个通信域分别训练的所述AI模型,所述全局梯度数据根据所述多个通信域中的多个设备分别生成的梯度数据进行梯度融合得到。Update module 1103, configured to use the local gradient data corresponding to each communication domain to update the AI trained in the communication domain during each round of distributed training of the AI model using devices in the multiple communication domains. Model, the local gradient data corresponding to each communication domain is obtained by gradient fusion based on the gradient data generated by multiple devices in the communication domain; and, when multiple rounds are used, the distributed training of the devices in the multiple communication domains is used. When describing the AI model, the AI model trained separately in each communication domain is updated using global gradient data, which is obtained by gradient fusion based on gradient data respectively generated by multiple devices in the multiple communication domains.
在一种可能的实施方式中,所述更新模块1103,用于:In a possible implementation, the update module 1103 is used to:
目标通信域中的多个设备之间交互各自生成的梯度数据,所述目标通信域为所述多个通信域的其中一个通信域;Multiple devices in a target communication domain interact with each other to generate gradient data, and the target communication domain is one of the multiple communication domains;
所述目标通信域根据所述多个设备之间交互的梯度数据进行梯度融合,生成所述目标通信域对应的局部梯度数据;The target communication domain performs gradient fusion based on the gradient data interacted between the multiple devices to generate local gradient data corresponding to the target communication domain;
所述目标通信域利用所述目标通信域对应的局部梯度数据更新所述目标通信域所训练的所述AI模型。The target communication domain uses the local gradient data corresponding to the target communication domain to update the AI model trained by the target communication domain.
在一种可能的实施方式中,所述更新模块1103,用于:In a possible implementation, the update module 1103 is used to:
获取所述目标通信域对应的激活操作的版本号以及交互操作的版本号,所述激活操作用于触发所述目标通信域中的不同设备之间交互梯度数据,所述交互操作为所述目标通信域中的不同设备之间交互梯度数据的操作;Obtain the version number of the activation operation corresponding to the target communication domain and the version number of the interactive operation. The activation operation is used to trigger interactive gradient data between different devices in the target communication domain. The interactive operation is the target Operation of interactive gradient data between different devices in the communication domain;
当激活操作的版本号大于或者等于交互操作的版本号,所述目标通信域中的多个设备之间交互各自生成的梯度数据。When the version number of the activation operation is greater than or equal to the version number of the interaction operation, multiple devices in the target communication domain interact with each other to generate gradient data.
在一种可能的实施方式中,所述目标通信域中的多个设备之间的物理连接为环状连接。In a possible implementation, the physical connection between multiple devices in the target communication domain is a ring connection.
在一种可能的实施方式中,所述确定模块1102,用于:In a possible implementation, the determining module 1102 is used to:
获取设备拓扑关系,所述设备拓扑关系指示用于训练所述AI模型的多个设备之间的连接关系;Obtain a device topology relationship, which indicates a connection relationship between multiple devices used to train the AI model;
根据所述设备拓扑关系,对用于训练所述AI模型的多个设备进行划分,得到所述多个通信域,其中,每个通信域中的不同设备之间的通信速率高于不同通信域的设备之间的通信速率。According to the device topological relationship, multiple devices used to train the AI model are divided to obtain the multiple communication domains, wherein the communication rate between different devices in each communication domain is higher than that in different communication domains The communication rate between devices.
在一种可能的实施方式中,所述确定模块1102,用于:In a possible implementation, the determining module 1102 is used to:
生成第一配置界面,所述第一配置界面用于向用户呈现用于训练所述AI模型的多个设备的标识;Generate a first configuration interface, the first configuration interface being used to present to the user identification of multiple devices used to train the AI model;
响应于用户的第一配置操作,确定所述用于训练所述AI模型的多个设备中的每个设备所属的通信域。In response to the user's first configuration operation, a communication domain to which each of the plurality of devices for training the AI model belongs is determined.
在一种可能的实施方式中,所述确定模块1102,还用于:In a possible implementation, the determination module 1102 is also used to:
在训练所述AI模型之前,生成第二配置界面,所述第二配置界面用于向用户呈现多种交互策略,所述多种交互策略中的每种交互策略用于指示通信域中的多个设备之间交互梯度数据的一种方式;Before training the AI model, a second configuration interface is generated. The second configuration interface is used to present multiple interaction strategies to the user, and each of the multiple interaction strategies is used to indicate multiple interaction strategies in the communication domain. A way to exchange gradient data between devices;
响应于用户针对所述多种交互策略的第二配置操作,确定所述每个通信域中的多个设备之间交互梯度数据的方式。In response to a user's second configuration operation for the plurality of interaction strategies, a manner of interacting gradient data between the plurality of devices in each communication domain is determined.
在一种可能的实施方式中,不同通信域中的设备位于同一计算节点,或者,所述多个通信域中的设备分别位于不同的计算节点。In a possible implementation, devices in different communication domains are located on the same computing node, or devices in multiple communication domains are located on different computing nodes.
在一种可能的实施方式中,每个通信域中的设备包括处理器、芯片或服务器。In a possible implementation, the devices in each communication domain include a processor, a chip or a server.
图11所示的数据处理装置100对应于图3所示实施例中的数据处理装置,故数据处理装置100中的各个功能模块的具体实现及其所具有的技术效果,可以参见前述实施例的相关之处描述,在此不做赘述。The data processing device 100 shown in Figure 11 corresponds to the data processing device in the embodiment shown in Figure 3. Therefore, the specific implementation of each functional module in the data processing device 100 and the technical effects thereof can be referred to the previous embodiments. Relevant descriptions will not be repeated here.
本申请实施例还提供一种计算设备,如图12所示,计算设备1200中可以包括通信接口1210、处理器1220。可选的,计算设备1200中还可以包括存储器1230。其中,存储器1230可以设置于计算设备1200内部,还可以设置于计算设备1200外部。示例性地,上述图3所示实施例中数据处理装置执行的各个动作均可以由处理器1220实现。处理器1220可以通过通信接口1210获取待训练的AI模型以及多个通信域,并用于实现图3中所执行的方法。在实现过程中,处理流程的各步骤可以通过处理器1220中的硬件的集成逻辑电路或者软件形式的指令完成图3中执行的方法。为了简洁,在此不再赘述。处理器1220用于实现上述方法所执行的程序代码可以存储在存储器1230中。存储器1230和处理器1220连接,如耦合连接等。An embodiment of the present application also provides a computing device. As shown in Figure 12, the computing device 1200 may include a communication interface 1210 and a processor 1220. Optionally, the computing device 1200 may also include a memory 1230. The memory 1230 may be disposed inside the computing device 1200 or may be disposed outside the computing device 1200 . For example, each action performed by the data processing device in the embodiment shown in FIG. 3 can be implemented by the processor 1220. The processor 1220 can obtain the AI model to be trained and multiple communication domains through the communication interface 1210, and use it to implement the method executed in Figure 3. During the implementation process, each step of the processing flow can complete the method executed in Figure 3 through instructions in the form of hardware integrated logic circuits or software in the processor 1220. For the sake of brevity, no further details will be given here. The program code executed by the processor 1220 to implement the above method may be stored in the memory 1230 . The memory 1230 and the processor 1220 are connected, such as coupling connection, etc.
本申请实施例的一些特征可以由处理器1220执行存储器1230中的程序指令或者软件代码来完成/支持。存储器1230上在加载的软件组件可以从功能或者逻辑上进行概括。 Some features of the embodiments of the present application may be implemented/supported by the processor 1220 executing program instructions or software codes in the memory 1230. The software components loaded on memory 1230 may be summarized functionally or logically.
本申请实施例中涉及到的任一通信接口可以是电路、总线、收发器或者其它任意可以用于进行信息交互的装置。比如计算设备1200中的通信接口1210,示例性地,该其它装置可以是与该计算设备1200相连的设备等。Any communication interface involved in the embodiments of this application may be a circuit, bus, transceiver, or any other device that can be used for information exchange. For example, the communication interface 1210 in the computing device 1200, for example, the other device may be a device connected to the computing device 1200, or the like.
本申请实施例中涉及的处理器可以是通用处理器、数字信号处理器、专用集成电路、现场可编程门阵列或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件,可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。The processor involved in the embodiments of this application may be a general-purpose processor, a digital signal processor, an application-specific integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component, which may implement or Execute each method, step and logical block diagram disclosed in the embodiment of this application. A general-purpose processor may be a microprocessor or any conventional processor, etc. The steps of the methods disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware processor for execution, or can be executed by a combination of hardware and software modules in the processor.
本申请实施例中的耦合是装置、模块或模块之间的间接耦合或通信连接,可以是电性,机械或其它的形式,用于装置、模块或模块之间的信息交互。The coupling in the embodiment of this application is an indirect coupling or communication connection between devices, modules or modules, which may be in electrical, mechanical or other forms, and is used for information interaction between devices, modules or modules.
处理器可能和存储器协同操作。存储器可以是非易失性存储器,比如硬盘或固态硬盘等,还可以是易失性存储器,例如随机存取存储器。存储器是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。The processor may operate in conjunction with the memory. The memory can be a non-volatile memory, such as a hard disk or a solid state drive, or a volatile memory, such as a random access memory. Memory is, but is not limited to, any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
本申请实施例中不限定上述通信接口、处理器以及存储器之间的具体连接介质。比如存储器、处理器以及通信接口之间可以通过总线连接。所述总线可以分为地址总线、数据总线、控制总线等。The embodiments of the present application do not limit the specific connection medium between the above communication interface, processor and memory. For example, the memory, processor and communication interface can be connected through a bus. The bus can be divided into address bus, data bus, control bus, etc.
基于以上实施例,本申请实施例还提供了一种计算机存储介质,该存储介质中存储软件程序,该软件程序在被一个或多个处理器读取并执行时可实现上述任意一个或多个实施例提供的模型训练装置执行的方法。所述计算机存储介质可以包括:U盘、移动硬盘、只读存储器、随机存取存储器、磁碟或者光盘等各种可以存储程序代码的介质。Based on the above embodiments, embodiments of the present application also provide a computer storage medium, which stores a software program. When read and executed by one or more processors, the software program can implement any one or more of the above. The embodiment provides a method executed by the model training device. The computer storage medium may include: U disk, mobile hard disk, read-only memory, random access memory, magnetic disk or optical disk and other various media that can store program codes.
本领域内的技术人员应明白,本申请的实施例可提供为方法、装置、系统、存储介质或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that embodiments of the present application may be provided as methods, devices, systems, storage media or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each process and/or block in the flowchart illustrations and/or block diagrams, and combinations of processes and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine, such that the instructions executed by the processor of the computer or other programmable data processing device produce a use A device for realizing the functions specified in one process or multiple processes of the flowchart and/or one block or multiple blocks of the block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, the instructions The device implements the functions specified in a process or processes of the flowchart and/or a block or blocks of the block diagram.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device. Instructions provide steps for implementing the functions specified in a process or processes of a flowchart diagram and/or a block or blocks of a block diagram.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。The terms "first", "second", etc. in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that the terms so used are interchangeable under appropriate circumstances, and are merely a way of distinguishing objects with the same attributes in describing the embodiments of the present application.
显然,本领域的技术人员可以对本申请实施例进行各种改动和变型而不脱离本申请实施例的范围。这样,倘若本申请实施例的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。 Obviously, those skilled in the art can make various changes and modifications to the embodiments of the present application without departing from the scope of the embodiments of the present application. In this way, if these modifications and variations of the embodiments of the present application fall within the scope of the claims of this application and equivalent technologies, then this application is also intended to include these modifications and variations.

Claims (22)

  1. 一种模型训练方法,其特征在于,所述方法包括:A model training method, characterized in that the method includes:
    获取待训练的AI模型;Get the AI model to be trained;
    确定多个通信域,所述多个通信域中的每个通信域包括多个设备;determining a plurality of communication domains, each communication domain of the plurality of communication domains including a plurality of devices;
    在每轮利用所述多个通信域中的设备分布式训练所述AI模型的过程中,利用每个通信域对应的局部梯度数据更新该通信域所训练的所述AI模型,每个通信域对应的局部梯度数据根据该通信域中的多个设备分别生成的梯度数据进行梯度融合得到;In each round of distributed training of the AI model using devices in the multiple communication domains, the AI model trained in each communication domain is updated using the local gradient data corresponding to the communication domain. Each communication domain The corresponding local gradient data is obtained by gradient fusion based on the gradient data generated by multiple devices in the communication domain;
    当间隔多轮利用所述多个通信域中的设备分布式训练所述AI模型时,利用全局梯度数据更新每个通信域分别训练的所述AI模型,所述全局梯度数据根据所述多个通信域中的多个设备分别生成的梯度数据进行梯度融合得到。When the AI model is distributedly trained using devices in the multiple communication domains at intervals of multiple rounds, the AI model trained separately in each communication domain is updated using global gradient data. The global gradient data is based on the multiple It is obtained by gradient fusion of gradient data generated by multiple devices in the communication domain.
  2. 根据权利要求1所述的方法,其特征在于,所述利用每个通信域对应的局部梯度数据更新该通信域所训练的所述AI模型,包括:The method according to claim 1, characterized in that, using the local gradient data corresponding to each communication domain to update the AI model trained in the communication domain includes:
    目标通信域中的多个设备之间交互各自生成的梯度数据,所述目标通信域为所述多个通信域的其中一个通信域;Multiple devices in a target communication domain interact with each other to generate gradient data, and the target communication domain is one of the multiple communication domains;
    所述目标通信域根据所述多个设备之间交互的梯度数据进行梯度融合,生成所述目标通信域对应的局部梯度数据;The target communication domain performs gradient fusion based on the gradient data interacted between the multiple devices to generate local gradient data corresponding to the target communication domain;
    所述目标通信域利用所述目标通信域对应的局部梯度数据更新所述目标通信域所训练的所述AI模型。The target communication domain uses the local gradient data corresponding to the target communication domain to update the AI model trained by the target communication domain.
  3. 根据权利要求2所述的方法,其特征在于,所述目标通信域中的多个设备之间交互各自生成的梯度数据,包括:The method according to claim 2, characterized in that, multiple devices in the target communication domain interact with each other to generate gradient data, including:
    获取所述目标通信域对应的激活操作的版本号以及交互操作的版本号,所述激活操作用于触发所述目标通信域中的不同设备之间交互梯度数据,所述交互操作为所述目标通信域中的不同设备之间交互梯度数据的操作;Obtain the version number of the activation operation corresponding to the target communication domain and the version number of the interactive operation. The activation operation is used to trigger interactive gradient data between different devices in the target communication domain. The interactive operation is the target Operation of interactive gradient data between different devices in the communication domain;
    当激活操作的版本号大于或者等于交互操作的版本号,所述目标通信域中的多个设备之间交互各自生成的梯度数据。When the version number of the activation operation is greater than or equal to the version number of the interaction operation, multiple devices in the target communication domain interact with each other to generate gradient data.
  4. 根据权利要求2或3所述的方法,其特征在于,所述目标通信域中的多个设备之间的物理连接为环状连接。The method according to claim 2 or 3, characterized in that the physical connection between multiple devices in the target communication domain is a ring connection.
  5. 根据权利要求1至4任一项所述的方法,其特征在于,所述确定多个通信域,包括:The method according to any one of claims 1 to 4, characterized in that determining multiple communication domains includes:
    获取设备拓扑关系,所述设备拓扑关系指示用于训练所述AI模型的多个设备之间的连接关系;Obtain a device topology relationship, which indicates a connection relationship between multiple devices used to train the AI model;
    根据所述设备拓扑关系,对用于训练所述AI模型的多个设备进行划分,得到所述多个通信域,其中,每个通信域中的不同设备之间的通信速率高于不同通信域的设备之间的通信速率。According to the device topological relationship, multiple devices used to train the AI model are divided to obtain the multiple communication domains, wherein the communication rate between different devices in each communication domain is higher than that in different communication domains The communication rate between devices.
  6. 根据权利要求1至4任一项所述的方法,其特征在于,所述确定多个通信域,包括:The method according to any one of claims 1 to 4, characterized in that determining multiple communication domains includes:
    生成第一配置界面,所述第一配置界面用于向用户呈现用于训练所述AI模型的多个设备的标识;Generate a first configuration interface, the first configuration interface being used to present to the user identification of multiple devices used to train the AI model;
    响应于用户的第一配置操作,确定所述用于训练所述AI模型的多个设备中的每个设备所属的通信域。In response to the user's first configuration operation, a communication domain to which each of the plurality of devices for training the AI model belongs is determined.
  7. 根据权利要求1至6任一项所述的方法,其特征在于,在训练所述AI模型之前,所述方法还包括:The method according to any one of claims 1 to 6, characterized in that, before training the AI model, the method further includes:
    生成第二配置界面,所述第二配置界面用于向用户呈现多种交互策略,所述多种交互策略中的每种交互策略用于指示通信域中的多个设备之间交互梯度数据的一种方式;Generate a second configuration interface, the second configuration interface being used to present multiple interaction strategies to the user, each of the multiple interaction strategies being used to indicate interaction gradient data between multiple devices in the communication domain a method;
    响应于用户针对所述多种交互策略的第二配置操作,确定所述每个通信域中的多个设备之间交互梯度数据的方式。In response to a user's second configuration operation for the plurality of interaction strategies, a manner of interacting gradient data between the plurality of devices in each communication domain is determined.
  8. 根据权利要求1至7任一项所述的方法,其特征在于,不同通信域中的设备位于同一计算节点,或者,所述多个通信域中的设备分别位于不同的计算节点。The method according to any one of claims 1 to 7, characterized in that devices in different communication domains are located on the same computing node, or devices in the multiple communication domains are located on different computing nodes.
  9. 根据权利要求1至8任一项所述的方法,其特征在于,每个通信域中的设备包括处理器、芯片或服务器。The method according to any one of claims 1 to 8, characterized in that the equipment in each communication domain includes a processor, a chip or a server.
  10. 一种模型训练装置,其特征在于,所述装置包括:A model training device, characterized in that the device includes:
    获取模块,用于获取待训练的AI模型;Acquisition module, used to obtain the AI model to be trained;
    确定模块,用于确定多个通信域,所述多个通信域中的每个通信域包括多个设备; a determining module configured to determine a plurality of communication domains, each of the plurality of communication domains including a plurality of devices;
    更新模块,用于在每轮利用所述多个通信域中的设备分布式训练所述AI模型的过程中,利用每个通信域对应的局部梯度数据更新该通信域所训练的所述AI模型,每个通信域对应的局部梯度数据根据该通信域中的多个设备分别生成的梯度数据进行梯度融合得到;并且,当间隔多轮利用所述多个通信域中的设备分布式训练所述AI模型时,利用全局梯度数据更新每个通信域分别训练的所述AI模型,所述全局梯度数据根据所述多个通信域中的多个设备分别生成的梯度数据进行梯度融合得到。An update module, configured to use the local gradient data corresponding to each communication domain to update the AI model trained in the communication domain during each round of distributed training of the AI model using devices in the multiple communication domains. , the local gradient data corresponding to each communication domain is obtained by gradient fusion based on the gradient data generated respectively by multiple devices in the communication domain; and, when multiple rounds are used to distribute the training using the devices in the multiple communication domains, When using the AI model, global gradient data is used to update the AI model trained separately in each communication domain. The global gradient data is obtained by gradient fusion based on gradient data respectively generated by multiple devices in the multiple communication domains.
  11. 根据权利要求10所述的装置,其特征在于,所述更新模块,用于:The device according to claim 10, characterized in that the update module is used for:
    目标通信域中的多个设备之间交互各自生成的梯度数据,所述目标通信域为所述多个通信域的其中一个通信域;Multiple devices in a target communication domain interact with each other to generate gradient data, and the target communication domain is one of the multiple communication domains;
    所述目标通信域根据所述多个设备之间交互的梯度数据进行梯度融合,生成所述目标通信域对应的局部梯度数据;The target communication domain performs gradient fusion based on the gradient data interacted between the multiple devices to generate local gradient data corresponding to the target communication domain;
    所述目标通信域利用所述目标通信域对应的局部梯度数据更新所述目标通信域所训练的所述AI模型。The target communication domain uses the local gradient data corresponding to the target communication domain to update the AI model trained by the target communication domain.
  12. 根据权利要求11所述的装置,其特征在于,所述更新模块,用于:The device according to claim 11, characterized in that the update module is used for:
    获取所述目标通信域对应的激活操作的版本号以及交互操作的版本号,所述激活操作用于触发所述目标通信域中的不同设备之间交互梯度数据,所述交互操作为所述目标通信域中的不同设备之间交互梯度数据的操作;Obtain the version number of the activation operation corresponding to the target communication domain and the version number of the interactive operation. The activation operation is used to trigger interactive gradient data between different devices in the target communication domain. The interactive operation is the target Operation of interactive gradient data between different devices in the communication domain;
    当激活操作的版本号大于或者等于交互操作的版本号,所述目标通信域中的多个设备之间交互各自生成的梯度数据。When the version number of the activation operation is greater than or equal to the version number of the interaction operation, multiple devices in the target communication domain interact with each other to generate gradient data.
  13. 根据权利要求11或12所述的装置,其特征在于,所述目标通信域中的多个设备之间的物理连接为环状连接。The apparatus according to claim 11 or 12, characterized in that the physical connection between multiple devices in the target communication domain is a ring connection.
  14. 根据权利要求10至13任一项所述的装置,其特征在于,所述确定模块,用于:The device according to any one of claims 10 to 13, characterized in that the determining module is used to:
    获取设备拓扑关系,所述设备拓扑关系指示用于训练所述AI模型的多个设备之间的连接关系;Obtain a device topology relationship, which indicates a connection relationship between multiple devices used to train the AI model;
    根据所述设备拓扑关系,对用于训练所述AI模型的多个设备进行划分,得到所述多个通信域,其中,每个通信域中的不同设备之间的通信速率高于不同通信域的设备之间的通信速率。According to the device topological relationship, multiple devices used to train the AI model are divided to obtain the multiple communication domains, wherein the communication rate between different devices in each communication domain is higher than that in different communication domains The communication rate between devices.
  15. 根据权利要求10至13任一项所述的装置,其特征在于,所述确定模块,用于:The device according to any one of claims 10 to 13, characterized in that the determining module is used for:
    生成第一配置界面,所述第一配置界面用于向用户呈现用于训练所述AI模型的多个设备的标识;Generate a first configuration interface, the first configuration interface being used to present to the user identification of multiple devices used to train the AI model;
    响应于用户的第一配置操作,确定所述用于训练所述AI模型的多个设备中的每个设备所属的通信域。In response to the user's first configuration operation, a communication domain to which each of the plurality of devices for training the AI model belongs is determined.
  16. 根据权利要求10至15任一项所述的装置,其特征在于,所述确定模块,还用于:The device according to any one of claims 10 to 15, characterized in that the determining module is also used to:
    在训练所述AI模型之前,生成第二配置界面,所述第二配置界面用于向用户呈现多种交互策略,所述多种交互策略中的每种交互策略用于指示通信域中的多个设备之间交互梯度数据的一种方式;Before training the AI model, a second configuration interface is generated. The second configuration interface is used to present multiple interaction strategies to the user, and each of the multiple interaction strategies is used to indicate multiple interaction strategies in the communication domain. A way to exchange gradient data between devices;
    响应于用户针对所述多种交互策略的第二配置操作,确定所述每个通信域中的多个设备之间交互梯度数据的方式。In response to a user's second configuration operation for the plurality of interaction strategies, a manner of interacting gradient data between the plurality of devices in each communication domain is determined.
  17. 根据权利要求10至16任一项所述的装置,其特征在于,不同通信域中的设备位于同一计算节点,或者,所述多个通信域中的设备分别位于不同的计算节点。The apparatus according to any one of claims 10 to 16, wherein the devices in different communication domains are located on the same computing node, or the devices in the multiple communication domains are located on different computing nodes.
  18. 根据权利要求10至17任一项所述的装置,其特征在于,每个通信域中的设备包括处理器、芯片或服务器。The apparatus according to any one of claims 10 to 17, characterized in that the equipment in each communication domain includes a processor, a chip or a server.
  19. 一种模型训练系统,其特征在于,所述模型训练系统包括多个设备,所述模型训练系统用于执行如权利要求1至9中任一项所述的方法。A model training system, characterized in that the model training system includes a plurality of devices, and the model training system is used to perform the method according to any one of claims 1 to 9.
  20. 一种计算设备,其特征在于,包括处理器和存储器;A computing device, characterized by including a processor and a memory;
    所述处理器用于执行所述存储器中存储的指令,以使得所述计算设备执行权利要求1至9中任一项所述的方法。The processor is configured to execute instructions stored in the memory, so that the computing device performs the method according to any one of claims 1 to 9.
  21. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有指令,当其在至少一个计算设备上运行时,使得所述至少一个计算设备执行如权利要求1至9中任一项所述的方法。A computer-readable storage medium, characterized in that the computer-readable storage medium stores instructions that, when run on at least one computing device, cause the at least one computing device to execute the instructions in claims 1 to 9 any of the methods described.
  22. 一种包含指令的计算机程序产品,其特征在于,当其在至少一个计算设备上运行时,使得所述至少一个计算设备执行如权利要求1至9中任一项所述的方法。 A computer program product containing instructions which, when run on at least one computing device, cause said at least one computing device to perform a method according to any one of claims 1 to 9.
PCT/CN2023/101224 2022-06-29 2023-06-20 Model training method, apparatus and system, and related device WO2024001861A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202210760755.6 2022-06-29
CN202210760755 2022-06-29
CN202211148350.3A CN117312839A (en) 2022-06-29 2022-09-20 Model training method, device, system and related equipment
CN202211148350.3 2022-09-20

Publications (1)

Publication Number Publication Date
WO2024001861A1 true WO2024001861A1 (en) 2024-01-04

Family

ID=89236100

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/101224 WO2024001861A1 (en) 2022-06-29 2023-06-20 Model training method, apparatus and system, and related device

Country Status (2)

Country Link
CN (1) CN117312839A (en)
WO (1) WO2024001861A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113656175A (en) * 2021-08-18 2021-11-16 北京百度网讯科技有限公司 Method, apparatus and program product for training models based on distributed systems
CN113867959A (en) * 2021-09-29 2021-12-31 苏州浪潮智能科技有限公司 Training task resource scheduling method, device, equipment and medium
KR20210158558A (en) * 2020-06-24 2021-12-31 대구대학교 산학협력단 Edge computing system using ai block-type module architecture
CN114221736A (en) * 2020-09-04 2022-03-22 华为技术有限公司 Data processing method, device, equipment and medium
CN114579311A (en) * 2022-03-04 2022-06-03 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for executing distributed computing task

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20210158558A (en) * 2020-06-24 2021-12-31 대구대학교 산학협력단 Edge computing system using ai block-type module architecture
CN114221736A (en) * 2020-09-04 2022-03-22 华为技术有限公司 Data processing method, device, equipment and medium
CN113656175A (en) * 2021-08-18 2021-11-16 北京百度网讯科技有限公司 Method, apparatus and program product for training models based on distributed systems
CN113867959A (en) * 2021-09-29 2021-12-31 苏州浪潮智能科技有限公司 Training task resource scheduling method, device, equipment and medium
CN114579311A (en) * 2022-03-04 2022-06-03 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for executing distributed computing task

Also Published As

Publication number Publication date
CN117312839A (en) 2023-12-29

Similar Documents

Publication Publication Date Title
US11487589B2 (en) Self-adaptive batch dataset partitioning for distributed deep learning using hybrid set of accelerators
CN112997138A (en) Artificial intelligence enabled management of storage media access
US9424079B2 (en) Iteration support in a heterogeneous dataflow engine
US11294599B1 (en) Registers for restricted memory
JP7451614B2 (en) On-chip computational network
US11900113B2 (en) Data flow processing method and related device
US11494681B1 (en) Quantum instruction compiler for optimizing hybrid algorithms
US11281967B1 (en) Event-based device performance monitoring
US11694075B2 (en) Partitioning control dependency edge in computation graph
WO2023142502A1 (en) Loop instruction processing method and apparatus, and chip, electronic device, and storage medium
KR20110028212A (en) Autonomous subsystem architecture
WO2022057310A1 (en) Method, apparatus and system for training graph neural network
US11416749B2 (en) Execution synchronization and tracking
US20210158131A1 (en) Hierarchical partitioning of operators
CN116151363B (en) Distributed Reinforcement Learning System
KR20200053318A (en) System managing calculation processing graph of artificial neural network and method managing calculation processing graph using thereof
US11275661B1 (en) Test generation of a distributed system
US11461622B2 (en) Dynamic code loading for multiple executions on a sequential processor
CN113011553A (en) Accelerator, method of operating an accelerator, and apparatus including an accelerator
US11308396B2 (en) Neural network layer-by-layer debugging
US10684834B2 (en) Method and apparatus for detecting inter-instruction data dependency
WO2024001861A1 (en) Model training method, apparatus and system, and related device
US9189448B2 (en) Routing image data across on-chip networks
EP4377844A1 (en) Mixing sparsity compression
US20210373790A1 (en) Inference in memory

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23830041

Country of ref document: EP

Kind code of ref document: A1