WO2024001861A1

WO2024001861A1 - Model training method, apparatus and system, and related device

Info

Publication number: WO2024001861A1
Application number: PCT/CN2023/101224
Authority: WO
Inventors: 郝日佩
Original assignee: 华为技术有限公司
Priority date: 2022-06-29
Filing date: 2023-06-20
Publication date: 2024-01-04
Also published as: CN117312839A

Abstract

Provided is a model training method, comprising: acquiring an AI model to be trained, and determining a plurality of communication domains; and during the process of training the AI model in each round, by using local gradient data corresponding to each communication domain, updating the AI model, wherein the local gradient data corresponding to each communication domain is obtained by performing gradient fusion according to gradient data respectively generated by a plurality of devices in the communication domain, and when the AI model is trained at an interval of a plurality of rounds, updating, by using total gradient data, the AI model respectively trained in each communication domain, wherein the total gradient data is obtained by performing gradient fusion according to the gradient data in the plurality of communication domains. Therefore, since it is only at an interval of a plurality of rounds of training that an AI model is updated by using gradient data generated by all devices training the AI model, the problem of the overall training progress of an AI model being slowed down due to the progress of training the AI model within a period of time in some communication domains being relatively low can be alleviated, such that the overall training efficiency of the AI model can be improved.

Description

Model training methods, devices, systems and related equipment

This application requires the priority of a Chinese patent application submitted to the State Intellectual Property Office of China on June 29, 2022, with the application number 202210760755.6 and the application name "Deep Learning Method, Device and System", and requires priority in September 2022 The priority of the Chinese patent application submitted to the State Intellectual Property Office of China on the 20th with the application number 202211148350.3 and the application title "Model training method, device, system and related equipment", the entire content of which is incorporated into this application by reference.

Technical field

This application relates to the field of artificial intelligence technology, and in particular to a model training method, device, system and related equipment.

Background technique

With the development of artificial intelligence (AI), the scale of AI models is gradually increasing. For example, the parameter volume of the Pangu model in the field of natural language processing (NLP) can be as high as 200 billion, and the data volume of training samples can be as high as 40 terabytes. Correspondingly, a larger amount of model parameters and sample data will require higher computing power to train the AI model.

Currently, the huge computing power required for AI model training can be solved through distributed training of AI models. During specific implementation, the training samples of the AI model can be equally divided into multiple sample subsets, and each sample subset and the AI model are assigned to a device, so that each device uses a sample subset to iteratively train the AI model. And generate the gradient used to update the AI model. Then, different devices interact with each other to generate gradient data and perform gradient fusion to calculate the global gradient data (that is, the data obtained after gradient fusion of the gradient data generated by all devices), and then based on the global gradient data Update the parameters in the AI model trained by each device. Based on the above method, multiple rounds of iterative updates are performed on the parameters in the AI model, and the training of the AI model is finally completed.

In actual application scenarios, different devices usually use ring-allreduce to interact with gradient data. That is, the flow of gradient data interacted between different devices can form a ring, and different devices can perform multiple data interactions and gradient data exchanges between different devices. Data fusion allows each device to obtain global gradient data. However, this way of interacting gradient data usually results in lower training efficiency of the AI model and greater consumption of communication resources.

Contents of the invention

Provide a model training method, device, system, storage medium and computer program product to improve the training efficiency of AI models and reduce the communication resources required to train AI models.

In the first aspect, embodiments of the present application provide a model training method, which can be executed by a corresponding model training device. Specifically, the model training device obtains an AI model to be trained and determines multiple communication domains. The AI model can, for example, It is an AI model with a large amount of model parameters or sample data, such as the Pangu model, etc. Each communication domain determined includes multiple devices. For example, all devices used to train the AI model can be divided into multiple communication domains, etc. ; In each round of distributed training of the AI model using equipment in multiple communication domains, the model training device uses the local gradient data corresponding to each communication domain to update the AI model trained in the communication domain, where each communication domain The corresponding local gradient data is obtained by gradient fusion based on the gradient data generated by multiple devices in the communication domain, and when the AI model is distributedly trained using the devices in the multiple communication domains in multiple rounds (interval rounds) The number of rounds can be a fixed number or a random number). The model training device uses all the gradient data to update the AI model trained separately in each communication domain. The all gradient data is generated separately based on multiple devices in multiple communication domains. The gradient data is obtained by gradient fusion, so that the global gradient data can be used to update the global AI model.

Since the gradient data generated by training the AI model on all devices is used to update the AI model every multiple rounds of training, during each round of model training in the middle, each communication domain independently uses the gradient data generated by multiple devices within it. Updating the AI model trained by the communication domain can alleviate the problem that the overall training progress of the AI model is lowered due to low progress in training the AI model in some communication domains over a period of time, thereby improving the overall training efficiency of the AI model. Moreover, the model training device will use global gradient data to update the AI model at multiple intervals, which can ensure that the training effect of the AI model can reach a high level. On this basis, since during each round of model training, devices in different communication domains do not need to exchange gradient data, which can effectively reduce the communication resources required to train the AI model.

In a possible implementation, during each round of training, the AI model can be independently trained and updated for each of the multiple communication domains. Taking one of the target communication domains as an example, during each round of training, Multiple devices in the target communication domain interact with each other to generate gradient data, and perform gradient fusion based on the gradient data interacted between multiple devices to generate local gradient data corresponding to the target communication domain, thereby utilizing the gradient data corresponding to the target communication domain. The local gradient data updates the AI model trained in the target communication domain. For other communication domains, similar methods can also be used to train the AI models for which they are responsible. In this way, each communication domain can realize the gradient update of the AI model in a local range by exchanging gradient data internally, and the efficiency of updating the AI model between different communication domains can not be affected by other communication domains. This can improve the overall training efficiency of the AI model.

In a possible implementation, when multiple devices in the target communication domain interact with each other to generate gradient data, the specific method may be to obtain the version number of the activation operation corresponding to the target communication domain and the version number of the interactive operation, Among them, the activation operation is used to trigger the interaction of gradient data between different devices in the target communication domain. The interaction operation refers to the operation of exchanging gradient data between different devices in the target communication domain. Therefore, when the version number of the activation operation is greater than or equal to the interaction When operating the version number, multiple devices in the target communication domain interact with each other to generate gradient data. In this way, each communication domain can prevent some communication domains from executing interactive gradient data too many times by limiting the number of times they interact with gradient data, thereby avoiding asynchronous conflicts between multiple communication domains.

In a possible implementation, the physical connection between multiple devices in the target communication domain is a ring connection, such as a connection based on a HCCS ring mode.

In a possible implementation, when determining multiple communication domains, multiple devices with higher affinity can be classified into the same communication domain. During specific implementation, the model training device can obtain the device topology relationship. The topological relationship is used to indicate the connection relationship between multiple devices used to train the AI model. According to the device topological relationship, the multiple devices used to train the AI model are divided to obtain multiple communication domains. Each communication domain The communication rate between different devices in the network is higher than the communication rate between devices in different communication domains. In this way, by classifying devices with higher communication rates into the same communication domain, the efficiency of interactive gradient data between different devices in the communication domain during subsequent model training can be improved, thereby improving the overall training efficiency of the AI model.

In a possible implementation, the user can configure all the devices for training the AI model into the communication domain to which each device belongs. During specific implementation, the model training device may generate a first configuration interface, which is used to present to the user the identities of multiple devices used to train the AI model, so that the user can configure each device on the first configuration interface. The communication domain to which it belongs, so that the model training device can respond to the user's first configuration operation and determine the communication domain to which each of the multiple devices used to train the AI model belongs, thereby dividing the multiple communication domains. In this way, the user can configure multiple communication domains, so that the user can intervene in the training of the AI model and achieve better model training effects.

In a possible implementation, before training the AI model, the model training device can also generate a second configuration interface. The second configuration interface is used to present multiple interaction strategies to the user, where each interaction strategy is used to indicate A way to interact gradient data between multiple devices in the communication domain, such as allgather, allreduce, ring-allreduce, half-doubling (having-doubling) allreduce strategy, etc., thereby responding to the user's second request for these multiple interaction strategies. Configure operations that determine how gradient data is exchanged between multiple devices in each communication domain. Among them, multiple devices in different communication domains can use the same interaction strategy to exchange gradient data, or can use different interaction strategies to exchange gradient data, etc. In this way, the user can manually configure the interaction strategy in each communication domain, so that the user can intervene in the training of the AI model. For example, the most appropriate interaction strategy can be configured according to the device characteristics in each communication domain, so as to achieve better results. Model training effect.

In a possible implementation, devices in different communication domains are located on the same computing node, or devices in multiple communication domains are located on different computing nodes.

In a possible implementation, the devices in each communication domain include processors, chips, servers, etc., so that distributed training of AI models can be implemented based on devices with different granularities.

In the second aspect, embodiments of the present application provide a model training device. The device includes: an acquisition module for acquiring an AI model to be trained; and a determination module for determining multiple communication domains. Each communication domain includes multiple devices; an update module is configured to update the local gradient data corresponding to each communication domain during each round of distributed training of the AI model using devices in the multiple communication domains. For the AI model trained in the communication domain, the local gradient data corresponding to each communication domain is obtained by gradient fusion based on the gradient data respectively generated by multiple devices in the communication domain; and, when multiple rounds are separated, the multiple When the AI model is distributedly trained by the devices in the communication domain, the AI model trained separately in each communication domain is updated using global gradient data. The global gradient data is generated separately based on multiple devices in the multiple communication domains. The gradient data is obtained by gradient fusion.

In a possible implementation, the update module is configured to exchange gradient data generated by each of multiple devices in a target communication domain, and the target communication domain is one of the multiple communication domains. domain; the target communication domain performs gradient fusion according to the gradient data interacted between the multiple devices to generate local gradient data corresponding to the target communication domain; the target communication domain utilizes the local gradient corresponding to the target communication domain The data updates the AI model trained in the target communication domain.

In a possible implementation, the update module is configured to: obtain the version number of the activation operation corresponding to the target communication domain and the version number of the interactive operation, and the activation operation is used to trigger the activation operation in the target communication domain. Gradient data is exchanged between different devices, and the interactive operation is an operation of exchanging gradient data between different devices in the target communication domain; when the version number of the activation operation is greater than or equal to the version number of the interactive operation, the target communication Multiple devices in the domain interact with each other to generate gradient data.

In a possible implementation, the physical connection between multiple devices in the target communication domain is a ring connection.

In a possible implementation, the determining module is configured to: obtain a device topology relationship, which indicates a connection relationship between multiple devices used to train the AI model; according to the device topology relationship, divide the multiple devices used to train the AI model to obtain the multiple communication domains, wherein the communication rate between different devices in each communication domain is higher than the communication rate between devices in different communication domains. Communication rate.

In a possible implementation, the determining module is configured to: generate a first configuration interface, the first configuration interface is used to present to the user identification of multiple devices used to train the AI model; in response to The user's first configuration operation is to determine the communication domain to which each of the plurality of devices used to train the AI model belongs.

In a possible implementation, the determination module is further configured to: before training the AI model, generate a second configuration interface, the second configuration interface being used to present multiple interaction strategies to the user, the Each of the plurality of interaction strategies is used to indicate a way of interacting gradient data between multiple devices in the communication domain; in response to a second configuration operation by the user for the plurality of interaction strategies, determining each of the interaction strategies A way for multiple devices in a communication domain to exchange gradient data.

In a possible implementation, the devices in each communication domain include a processor, a chip or a server.

Since the model training device provided in the second aspect corresponds to the model training method provided in the first aspect, the technical effects of the second aspect and each embodiment in the second aspect can be found in the corresponding first aspect and the first aspect. The technical effects of each embodiment will not be described in detail here.

In a third aspect, embodiments of the present application provide a model training system, which is characterized in that the model training system includes a plurality of devices, and the model training system is used to perform the above-mentioned first aspect and any implementation of the first aspect. model training methods.

In a fourth aspect, embodiments of the present application provide a computing device. The computing device includes a processor and a memory; the memory is used to store instructions, and the processor executes the instructions stored in the memory, so that the calculation The device performs the model training method in the above first aspect or any possible implementation manner of the first aspect. It should be noted that the memory can be integrated into the processor or independent of the processor. The computing device may also include a bus. Among them, the processor is connected to the memory through a bus. The memory may include readable memory and random access memory.

In a fifth aspect, embodiments of the present application further provide a computer-readable storage medium, which stores programs or instructions that, when run on at least one computer, cause at least one computer to execute the above-mentioned first step. aspect and the model training method in any implementation of the first aspect.

In a sixth aspect, embodiments of the present application further provide a computer program product containing instructions that, when run on at least one computer, causes at least one computer to execute the model in the above first aspect and any implementation of the first aspect. Training methods. .

In addition, the technical effects brought by any one of the second to sixth aspects can be found in the technical effects brought by different implementations in the first aspect, which will not be described again here.

Description of drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some implementations recorded in the present application. For example, those of ordinary skill in the art can also obtain other drawings based on these drawings.

Figure 1 is a schematic diagram of interactive gradient data of four devices provided by the embodiment of the present application;

Figure 2 is a schematic architectural diagram of an exemplary model training system provided by an embodiment of the present application;

Figure 3 is a schematic flow chart of a model training method provided by an embodiment of the present application;

Figure 4 is a schematic diagram of the topology between NPU1 to NPU8 used to train the AI model;

Figure 5 is a schematic diagram of an exemplary configuration interface provided by an embodiment of the present application;

Figure 6 is a schematic diagram of interactive gradient data between four processors;

Figure 7 is a schematic diagram of another exemplary configuration interface provided by an embodiment of the present application;

Figure 8 is a schematic diagram of processor 2 notifying other processors of interactive gradient data;

Figure 9 is an architectural schematic diagram of an exemplary server provided by an embodiment of the present application;

Figure 10 is a schematic flowchart of distributed training for the Pangu model provided by the embodiment of the present application;

Figure 11 is a schematic structural diagram of a model training device provided by an embodiment of the present application;

Figure 12 is a schematic diagram of the hardware structure of a computing device provided by an embodiment of the present application.

Detailed ways

In practical applications, when the amount of parameters in the AI model to be trained and the amount of sample data used to train the AI model are large, the limited computing power of a single device may be difficult to complete training of the AI model alone. Therefore, The AI model can be trained through distributed training, integrating the computing power of multiple devices. Among them, the equipment for training AI models can be processor-level equipment, such as neural network processor (neural-network processing unit, NPU), graphics processor (graphics processing unit, GPU), etc. Alternatively, the device for training the AI model can be a chip-level device, such as multiple chips connected to the host. Alternatively, the device for training the AI model can be a server-level device, such as multiple independent servers. When the multiple devices for training the AI model are processor-level devices or chip-level devices, the multiple processors may be located on the same server (the server may constitute a computing node), or may be located on different servers. When the multiple devices for training the AI model are server-level devices, the multiple devices can be located in the same data center (the data center can be regarded as a computing node), or the multiple devices can be located in different data centers, that is, AI models can be trained distributedly across data centers.

In the process of iteratively training the AI model, multiple devices usually use the ring-allreduce method to interact with the gradient data generated by each device in each round of AI model training, and generate gradient data through gradient fusion of the gradient data obtained from the interaction. The new gradient data updates the AI model parameters.

Take the use of 4 devices to train an AI model as an example. Device 1 to device 4 are shown in Figure 1. During each round of iterative training of the AI model, each device uses a subset of samples to train the AI model and generate the corresponding Gradient data, then Device 1 to Device 4 can respectively divide the gradient data obtained by training into 4 shards according to the number of devices. Among them, the gradient data of device 1 can be divided into slices a ₁ , b ₁ , c ₁ , d ₁ , the gradient data of device 2 can be divided into slices a ₂ , b ₂ , c ₂ , d ₂ , and device 3 The gradient data of can be divided into slices a ₃ , b ₃ , c ₃ , and d ₃ , and the gradient data of device 4 can be divided into slices a ₄ , b ₄ , c ₄ , and d ₄ . Then, during the first interaction between device 1 and device 4, device 1 sends fragment a ₁ to device 2, device 2 sends fragment b ₂ to device 3, device 3 sends fragment c 3 to device 4, and device 4 sends fragment c ₃ to device 4. Send fragment d ₄ to device 1. Then, each device can gradient fuse its own stored fragments of gradient data with the received fragments to generate new gradient data. For example, device 1 can combine fragment d ₄ sent by device 4 with its stored fragment d ₁ Perform gradient fusion to generate a new fragment D ₁ of gradient data, and use D ₁ to cover fragment d ₁ . For example, assuming that shard d ₄ is {3,5,4,2,7,2} and shard d ₁ is {1,5,6,9,11,21}, then for shard d ₄ and d ₁ D ₁ obtained by gradient fusion can be {2,5,5,6,9,12} (the values at the corresponding positions are added and the average is calculated, and the average is rounded up). Similarly, device 2 can generate a new gradient data fragment B ₂ and use B ₂ to overwrite fragment b ₂ ; device 3 can generate a new gradient data fragment C ₂ and use C ₂ to overwrite fragment c ₂ ; Device 4 can generate a new slice C ₄ of gradient data, and use C ₄ to overwrite slice c ₄ .

Then, device 1 to device 4 interact for the second time. Device 1 sends fragment D ₁ to device 2. Device 2 sends fragment A ₂ to device 3. Device 3 sends fragment B ₃ to device 4. Device 2 4. Send fragment C ₄ to device 1, and each device once again uses its own stored gradient data fragments to perform gradient fusion with the received fragments, generates new gradient data fragments, and replaces them with the originally stored gradients. Data sharding. Through multiple interactions between device 1 and device 4, there is a gradient data fragment in each device, which is obtained by gradient fusion using the corresponding fragments in device 1 and device 4, as shown in Figure 1.

Then, device 1 to device 4 continue to interact and share the gradient fusion stored by each device with other devices, so that each device can obtain the gradient fusion based on the gradient data in all devices. The gradient fusion result is shown in Figure 1, so that each device can use the gradient fusion result to update the parameters in the AI model. In this way, multiple devices can complete a round of training process for the AI model.

Since there are usually differences in the speed of interaction of gradient data between different devices, such as due to load or resource specifications, some devices have a higher delay in sending/receiving gradient data, which reduces the speed of interaction of gradient data between multiple devices. Overall efficiency, thus affecting the training efficiency of the AI model. In addition, frequent interaction of gradient data between multiple devices will also lead to high consumption of communication resources for training AI models.

Based on this, embodiments of the present application provide a model training method, which can be executed by a corresponding model training device to improve the training efficiency of the AI model. During specific implementation, the model training device obtains the AI model to be trained and determines multiple communication domains. Each communication domain in the multiple communication domains includes multiple devices; the model training device uses the multiple communication domains for training in each round. In the process of the AI model, the local gradient data corresponding to each communication domain is used to update the AI model trained in the communication domain. The local gradient data corresponding to each communication domain is based on the gradients generated by multiple devices in the communication domain. The data is obtained by gradient fusion; and, in the process of training the AI model using the multiple communication domains at each interval, the model training device uses global gradient data to update the AI model trained separately in each communication domain, and the global gradient data is based on Gradient data generated by multiple devices in the multiple communication domains are obtained by gradient fusion.

Since the gradient data generated by training the AI model on all devices is used to update the AI model every multiple rounds of training, during each round of model training in the middle, each communication domain independently uses the gradient data generated by multiple devices within it. Updating the AI model trained by the communication domain can alleviate the problem that the overall training progress of the AI model is lowered due to low progress in training the AI model in some communication domains over a period of time, thereby improving the overall training efficiency of the AI model.

For ease of understanding, the iterative training of the AI model from device 1 to device 4 is still used as an example. The model training device can classify device 1 and device 2 into communication domain 1, and classify device 3 and device 4 into communication domain 2. During each round of training the AI model, device 1 and device 2 in communication domain 1 train the AI model respectively and obtain corresponding gradient data. The model training device can perform gradient fusion on the gradient data in communication domain 1, and The generated local gradient data is used to update the AI model trained by device 1 and device 2. At the same time, the model training device will also use the local gradient data generated in the communication domain 2 to update the AI model trained by the device 3 and the device 4 during this round of training.

Assume that in the first round of training the AI model, communication domain 1 takes 40 seconds to complete the AI model training and update, and communication domain 2 takes 60 seconds to complete the AI model training and update; in the second round of training the AI During the model process, it takes 55 seconds for communication domain 1 to complete the AI model training and update, and 40 seconds for communication domain 2 to complete the AI model training and update. Then based on these two communication domain 2, in the second round The local gradient data generated by training updates the AI model globally, assuming it takes 10 seconds. Since communication domain 1 and communication domain 2 are independent of each other during the two rounds of model training, and the model training of sub-communication domain 1 takes 95 seconds (i.e. 40 seconds + 55 seconds), the time-consuming of sub-communication domain 2 is 100 seconds (60 seconds + 40 seconds), which makes the overall training time of the AI model 110 seconds (i.e. 100 seconds + 10 seconds), which is less than the existing ring-allreduce-based method of training AI models that takes 125 seconds seconds (i.e. 60 seconds + 55 seconds + 10 seconds), thereby improving the overall training efficiency of the AI model.

Moreover, the model training device will use global gradient data to update the AI model at multiple intervals, which can ensure that the training effect of the AI model can reach a high level. On this basis, since during each round of model training, devices in different communication domains do not need to exchange gradient data, which can effectively reduce the communication resources required to train the AI model.

For example, the above-mentioned model training device for executing the model training method can be deployed in the system architecture shown in Figure 2. Among them, the system architecture shown in Figure 2 may include a deep learning framework 201, a computing architecture 202, firmware and drivers 203, and a hardware layer 204.

Deep learning framework 201 can comprehensively develop components, pre-trained models and other resources, shield users from the underlying complex hardware, and provide users with services for rapid development of AI models. For example, the deep learning framework 201 may be, for example, the TensorFlow framework, the PyTorch framework, the MindSpore framework, etc., or may be other types of deep learning frameworks, which are not limited.

The computing architecture 202 is used to provide an open programming interface, support users to quickly build AI applications and services based on AI models, and call multiple processors in the hardware layer 204 to achieve parallelization capabilities for AI model training. Furthermore, the computing architecture 102 can also implement functions such as graph-level and operator-level compilation optimization and automatic tuning of AI models. For example, the computing architecture 202 may be, for example, a neural network computing architecture (compute architecture for neural networks, CANN), etc., or may be other applicable architectures.

Firmware and driver 203 are used to respond to the call of the computing architecture 202 to the hardware layer 204 and use multiple processors in the hardware layer 204 to perform corresponding data processing operations, such as using multiple processors in the hardware layer 204 for parallelization. Training AI models, etc.

The hardware layer 204 includes multiple processors, such as processor 1 to processor 8 in Figure 2, and other devices such as memory, network card, etc. (not shown in Figure 2), which are used to provide data processing capabilities for the upper layer and support Upper level services. For example, the processors included in the hardware layer 204 may include, for example, a central processing unit (CPU), a neural network processing unit (NPU), a graphics processing unit (GPU). ), one or more of data processing unit (DPU), or may include other types of processors, which are not limited.

The system architecture shown in Figure 2 is only used as an example. In actual application, the model training device can also be deployed in other types of system architecture to implement distributed training of AI models. For example, in other possible system architectures, the hardware layer 204 may include multiple servers, that is, the AI model may be distributedly trained at the server granularity.

In order to make the above objects, features and advantages of the present application more obvious and understandable, various non-limiting implementations in the embodiments of the present application will be illustratively described below with reference to the accompanying drawings. Obviously, the described embodiments are part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained based on the above content belong to the protection scope of this application.

As shown in Figure 3, it is a schematic flow chart of a model training method in an embodiment of the present application. This method can be applied to the system architecture shown in Figure 2. In practical applications, this method can also be applied to other applicable system architectures. To facilitate understanding and description, the following is an example of an application to the system architecture shown in Figure 2. The method may specifically include:

S301: Obtain the AI model to be trained.

In actual application scenarios, when users develop AI applications on the deep learning framework 201, they can provide the AI model used to implement the AI application to the deep learning framework 201, so that the deep learning framework 201 can provide the AI model to the computing architecture 202. The model training device is used to trigger the training of the AI model.

As an implementation example, the user can write a training script (an executable file written based on a specific format) on the deep learning framework 201, and the training script can be integrated with the file of the AI model built by the user on the deep learning framework 201. Then, the deep learning framework 201 can provide the training script to the model training device in the computing architecture 202, so that the model training device can parse the AI model from the training script and train the AI according to the model training logic indicated by the training script. The model undergoes distributed training.

As another implementation example, the deep learning framework 201 can provide the user with a configuration interface, which can present multiple completed AI models, etc., so that the deep learning framework 201 can determine the user's operation according to the user's selection of operations for the AI model. The selected AI model to be trained. Furthermore, the configuration interface can also present a variety of deep learning algorithms that can be used to train AI models, so that users can select a deep learning algorithm on the configuration interface and configure corresponding parameters, such as learning rate, based on the selected deep learning algorithm. , loss function, etc. Then, the deep learning framework 201 can provide the AI model, deep learning algorithm, and configured parameters selected by the user to the model training device, so that the model training device performs distributed training on the AI model based on the deep learning algorithm and configured parameters. Of course, the model training device can also obtain the AI model to be trained through other methods, which is not limited in this embodiment.

S302: Determine multiple communication domains, each of the multiple communication domains including multiple devices.

In this embodiment, the model training device can use N processors in the hardware layer 204 to train the AI model, where N is a positive integer (such as N is 8, 16, etc.). Moreover, before training the AI model, the model training device can first divide the N processors for training the AI model into multiple sets, each set including at least two processors, and the processors in each set can form a communication domain. For example, the model training device can divide 8 processors into 2 communication domains, each communication domain including 4 processors, etc.

Among them, during the process of training the AI model, processors in different communication domains can train the AI model independently. For example, when the processor in communication domain 1 completes a round of training for the AI model, it does not need to wait for communication domain 2. The internal processor also completes one round of AI model training and directly executes the next round of training the AI model. Moreover, processors in each communication domain can interact with each other through allreduce, ring-allreduce and other methods to generate gradient data generated by each round of AI model training.

To facilitate understanding, this embodiment provides the following two implementation examples for determining multiple communication domains:

In the first implementation example, the model training device can classify multiple devices with higher affinity into the same communication domain based on the affinity between the devices. During specific implementation, the model training device can obtain the device topology relationship between the N processors in the hardware layer 204. The device topology relationship can be used to indicate the connection relationship between the N processors, so that the model training device can obtain the device topology relationship between the N processors. The device topology divides the N processors used to train the AI model to obtain multiple communication domains. The communication rate between different processes in each communication domain is higher than the communication between processors in different communication domains. rate. In actual application, the model training device can divide multiple processors that are physically connected in a ring connection into one communication domain according to the topological relationship of the device, so that multiple communication domains can be divided. For example, physical connections may be established between processors in the same communication domain based on a Huawei cache coherence system (huawei cache coherence system, HCCS) ring connection method.

For example, assume that the N processors used to train the AI model include NPU1 to NPU8 as shown in Figure 4. NPU1 to NPU4 are connected in full mesh mode, and NPU5 to NPU8 are also connected in full mesh mode. mode to connect, NPU1 and NPU5 can be connected through the CPU. Then, the model training device can determine to classify NPU1 to NPU4 into communication domain 1 and classify NPU5 to NPU8 into communication domain 2 according to the topological structure between NPU1 to NPU8. Normally, the communication rate between NPU1 to NPU4 is higher than the communication rate between NPUs across communication domains.

In the second implementation example, the model training device can generate a configuration interface. For example, it can generate a configuration interface as shown in Figure 5. The configuration interface includes the identifiers of M processors (such as processor names, etc.) that can be used to train the AI model. ), M is a positive integer greater than or equal to N, so the model training device can present the configuration interface to the user through the deep learning framework 201, so that the user can select from the presented M processors for training the AI model this time. N processors, and further configure the communication domain to which each processor selected by the user belongs. Correspondingly, the model training device may execute the initialization process of the communication domain. Specifically, it may respond to the user's configuration operation, determine the communication domain to which each of the N processors used to train the AI model belongs, and divide it to obtain Multiple communication domains and determining the size of each communication domain. The number of processors included in each communication domain may be the same or different.

For example, assume that the configuration interface shown in Figure 5 presents 16 processors for the user to select, and based on the user's selection operation on the processor, it is determined to select processor 1 to processor 8 to train the AI model. Then, the user can create two communication domains on the configuration interface, namely communication domain 1 and communication domain 2, and specify the communication domains to which processors 1 to 8 respectively belong on the configuration interface. In this way, the model training device can determine the processors included in each communication domain according to the user's configuration of the communication domain to which each processor belongs, thereby obtaining multiple communication domains.

Of course, the above-mentioned implementation methods of determining communication domains are only examples. In actual application, the model training device can also determine multiple communication domains in other ways. For example, after the model training device determines multiple processors selected by the user, Processors located in the same server can be classified into a communication domain, etc., which is not limited in this embodiment.

S303: In the process of training the AI model using multiple communication domains in each round, use the local gradient data corresponding to each communication domain to update the AI model trained in the communication domain. The local gradient data corresponding to each communication domain is based on the communication domain. The gradient data generated by multiple processors in the system are obtained by gradient fusion.

After determining multiple communication domains, the model training device can use processors in the multiple communication domains to perform distributed training on the AI model.

During specific implementation, the model training device can allocate an AI model and a subset of training samples for training the AI model to each processor. Different processors are assigned the same AI model but different subsets of training samples. Each training sample subset includes at least one training sample. Then, during each round of training, each processor can train the AI model using the assigned subset of training samples, and generate gradient data based on the difference between the AI model's inference results based on the training samples and the actual results. The gradient data is used for Perform gradient updates on parameters in the AI model. Since each processor trains the AI model based on part of the training samples (that is, a subset of training samples), different processors can interact with the gradient data generated by each processor and perform gradient fusion, and perform each process based on the results of gradient fusion. The parameters in the AI model on the processor are used to perform gradient data to achieve the effect of using multiple training sample subsets to train the AI model.

In this embodiment, during each round of model training, gradient data is not exchanged between all processors. Each processor only exchanges gradient data within the communication domain to which it belongs, and performs gradient fusion and model parameter updating. This prevents the model training processes in different communication domains from interfering with each other. Take the AI model training using NPU1 to NPU8 shown in Figure 4 as an example. During each round of model training, NPU1 to NPU4 only interact with gradient data in communication domain 1, and between NPU5 to NPU8 in communication domain 2. Do not interact with gradient data. Similarly, NPU5 to NPU8 only exchange gradient data in communication domain 2, and do not exchange gradient data with the NPU in communication domain 1. In this way, each communication domain can directly execute the next round of model training process after completing gradient data interaction, gradient fusion, and model parameter update, without waiting for other communication domains to complete a round of AI model training.

Among them, multiple processors in each communication domain can exchange gradient data based on any strategy. For ease of understanding and explanation, one communication domain among multiple communication domains (hereinafter referred to as the target communication domain) is taken as an example for illustrative explanation. Processors in the remaining communication domains can refer to similar processes for data exchange. , the new gradient data generated by gradient fusion based on the gradient data of all processors in each communication domain is called local gradient data. For example, data interaction can be carried out between multiple processors in the target communication domain based on the following implementation methods.

In the first implementation example, multiple processors in the target communication domain can interact with gradient data based on any one of strategies including allgather, allreduce, ring-allreduce, and having-doubling allreduce.

Taking interactive gradient data based on the allreduce strategy as an example, assume that the target communication domain includes 4 processors, namely processor 1, processor 2, processor 3 and processor 4, and the gradient data on these four processors are respectively are gradient data a, gradient data b, gradient data c and gradient data d, as shown in Figure 6. Then, in the first interaction, processor 1 can exchange gradient data with processor 2, and at the same time, processor 3 and processor 4 exchange gradient data. At this time, processor 1 and processor 2 can perform gradient fusion on gradient data a and gradient data b to generate gradient data M; processor 3 and processor 4 can perform gradient fusion on gradient data c and gradient data d, Generate gradient data N. During the second interaction, processor 1 can exchange gradient data with processor 3. Specifically, processor 1 sends gradient data M to processor 3, and processor 3 sends gradient data N to processor 1. At the same time, processor 2 and processor 4 exchange gradient data. At this time, the processor 1 and the processor 3 can perform gradient fusion on the gradient data M and the gradient data N to generate the gradient data X. The gradient data dThe data generated by gradient fusion. Furthermore, the processor 2 and the processor 4 can also perform gradient fusion on the gradient data M and the gradient data N to generate the gradient data X. In this way, after two interactions, each processor can obtain the gradient data X generated by gradient fusion based on the gradient data of all processors in the target communication domain.

Furthermore, the strategy for interacting gradient data adopted in each communication domain can be configured by the user. For example, the model training device can present the configuration interface as shown in Figure 7 to the user through the deep learning framework 201, so that the user can configure the interaction strategy for each communication domain on the configuration interface. Specifically, as shown in Figure 7, the configuration interface can provide multiple interaction strategy candidates for each communication domain, such as allgather, allreduce, ring-allreduce, half-allreduce, etc., so that users can choose from multiple candidates. An interaction strategy is configured for each communication domain, and the interaction strategies adopted by different communication domains may be the same or different, which is not limited in this embodiment.

In the second implementation, there may be differences in the speed at which different processors in the target communication domain train the AI model. Then, when different processors in the target communication domain exchange gradient data, processors that have completed AI model training can give priority to the interaction of gradient data without waiting for all other processors in the target communication domain to also complete AI model training. The efficiency of interacting gradient data between multiple processors within the target communication domain can be improved.

Still taking the example of the target communication domain including 4 processors, processor 1 to processor 4 train the AI model in parallel. Assuming that processor 2 is the first to complete the training of the AI model in the target communication domain, processor 2 can generate activation message, and use the activation message to notify processor 1, processor 3 and processor 4 to start exchanging gradient data. In actual application, based on the physical connection and communication rules between processors, processor 2 can first send an activation message to processor 1 to notify processor 1 to start interacting with gradient data, and then send an activation message to processor 4, and The processor 1 sends an activation message to the processor 2 to notify the processor 2 to start exchanging gradient data, as shown in Figure 8. In this way, if processor 1 is the second to complete AI model training, processor 2 and processor 3 can directly interact with gradient data (and perform gradient data fusion). Then, if processor 3 completes AI model training for the third time, processor 2 can exchange gradient data with processor 3 again. Moreover, when processor 4 also completes AI model training, processor 2 then exchanges gradient data with processor 4. In this way, processor 2 can obtain gradient data generated by all processors in the target communication domain. Finally, processor 2 can send the local gradient data generated based on the gradient data of the four processors to the remaining processors, so as to use the local gradient data to update the parameters of the AI model on each processor, as shown in Figure 8 Show.

Furthermore, since during the multi-round training process of the AI model, each communication domain independently performs the training of the AI model and the process of gradient data fusion, for this reason, each communication domain can avoid some problems by limiting the number of interactive gradient data. The communication domain executes interactive gradient data too many times to avoid asynchronous conflicts between multiple communication domains. During specific implementation, the first processor in the target communication domain to complete AI model training can generate an activation message. The activation message is used to notify the remaining processors in the target communication domain to start interacting with gradient data, and the activation message includes the activation message. The version number of the operation used to trigger the exchange of gradient data between different processors in the target communication domain. Moreover, the processor that completes the AI model training first can obtain the version number of the currently executed interaction operation, which is the operation of the interaction gradient data between different devices in the target communication domain, so that the processor can compare the activation operations. The version number and the version number of the interoperability. Moreover, when the version number of the activation operation is greater than or equal to the version number of the interaction operation, the processor starts to interact with gradient data with other processors; otherwise, the interaction operation of gradient data is not performed between multiple processors.

In the third implementation example, multiple processors in the target communication domain can exchange gradient data through shared memory. During specific implementation, multiple processors in the target communication domain can be configured with shared memory, and multiple processors can access the shared memory. In this way, when each processor in the target communication domain completes a round of training for the AI model and generates gradient data, it can send the gradient data to a designated area in the shared memory, so that multiple processors can be stored in the shared memory. Gradient data generated by the processor respectively. In this way, each processor can access the gradient data generated by all processors in the target communication domain from the shared memory, and by performing gradient fusion on these gradient data, local gradient data can be obtained.

It should be noted that during each round of training the AI model, multiple processors in each communication domain can interact with gradient data and generate local gradient data by referring to the above method. Moreover, the above-mentioned implementation methods of exchanging gradient data in the communication domain are only examples. In other embodiments, multiple devices in each communication domain can also use other methods to exchange gradient data. This embodiment does not Not limited.

S304: When the AI model is distributedly trained using devices in multiple communication domains at multiple rounds, use global gradient data to update the AI models trained separately in each communication domain, where the global gradient data is based on multiple data in all communication domains. The gradient data generated by the processors are obtained by gradient fusion.

Since each communication domain uses a subset of training samples to train the AI model, the inference performance (such as inference accuracy, etc.) of the AI model trained in each communication domain is usually difficult to reach the level of training based on the full set of training samples. The inference performance of the AI model. To this end, in this embodiment, after completing multiple rounds of training on the AI model, gradient data can be exchanged between multiple communication domains to update the AI model on each processor based on the gradient data generated by all processors. Specifically, it can be Perform gradient fusion on the gradient data generated by all processors, and use the new gradient data generated by gradient fusion (hereinafter referred to as global gradient data) to update the parameters in the AI model on each processor. In this way, the inference performance of the finally trained AI model can usually reach the inference performance of the AI model trained based on the full set of training samples.

In a possible implementation, each communication domain trains the AI model independently, and the processor in each communication domain can count the current number of iterations for the AI model when training the AI model in each round. If the current number of iterations is an integer multiple of the T value, not only will multiple processors in the communication domain interact with gradient data in the manner described above and generate local gradient data corresponding to the communication domain through gradient fusion, but also The communication domain also exchanges local gradient data with other communication domains, so that each communication domain can obtain local gradient data generated separately by all communication domains. In this way, by gradient fusion of the local gradient data generated in all communication domains, global gradient data can be obtained, and the global gradient data can be used to update the parameters of the AI model in each communication domain.

The way in which local gradient data is exchanged between multiple communication domains is similar to the way in which gradient data is exchanged between multiple processors in each communication domain. For example, multiple communication domains can exchange local gradient data based on any one of allgather, allreduce, ring-allreduce, and half-allreduce strategies, or multiple communication domains can complete (m*T) rounds of model training in the order Local gradient data are exchanged sequentially (m is a positive integer), or multiple communication domains exchange local gradient data based on a shared storage area, etc. This embodiment is not limited to this.

In the above embodiment, local gradient data is exchanged between different communication domains. In another possible implementation, gradient data generated by each processor can be directly exchanged between multiple communication domains.

For example, when the number of times the processor in each communication domain iteratively trains the AI model is an integer multiple of the T value, a processor in each communication domain can summarize the gradient data generated by each processor in the communication domain to obtain the communication The gradient data set corresponding to the domain includes the gradient data generated by all processors in the communication domain, so that the processors responsible for summarizing gradient data in multiple communication domains can interact with their respective gradient data sets. Or, when the number of times the processors in each communication domain iteratively train the AI model is an integer multiple of the T value, all processors participating in AI model training directly interact with each other to generate gradient data, etc. In this way, each communication domain can obtain the gradient data generated by the processors in all communication domains, so that by gradient fusion of all the gradient data, the global gradient data can be obtained, so that the global gradient data can be used to calculate the gradient data in each communication domain. The AI model performs parameter updates.

Among them, in the above implementation, multiple communication domains interact with gradient data (or local gradient data) every interval (T-1) as an example for illustration. In other embodiments, multiple communication domains interact with each other every time. The number of model training times between the interaction gradient data may not be a fixed value. For example, in the process of distributed AI model training, when each communication domain iteratively trains the AI model for 1,000 times, gradient data (or local gradient data) are exchanged between multiple communication domains for the first time. The number of model training times is 1000; then, when the number of times each communication domain iteratively trains the AI model reaches 1900 times, gradient data (or local gradient data) are exchanged between multiple communication domains for the second time, and the number of interval model training times is 900; when the number of times each communication domain iteratively trains the AI model reaches 2700, the third interaction of gradient data (or local gradient data) between multiple communication domains, the number of interval model training is 800; when each communication domain When the number of domain iteration training AI models reaches 3400 times, the gradient data (or local gradient data) between multiple communication domains is interacted for the fourth time, and the number of interval model training times is 700, and so on.

It should be noted that in this embodiment, the device in the communication domain is specifically a processor for illustration. In other embodiments, the device in the communication domain may also be a chip or a server, which distributes the AI model. The specific implementation process of the training can be understood with reference to the relevant descriptions of this embodiment, and will not be described in detail here.

In this embodiment, since the gradient data generated by training the AI model on all processors is used to update the AI model every multiple rounds of model training, during each round of model training in the middle, each communication domain uses its internal The gradient data generated by multiple processors updates the AI model trained in the communication domain. This can alleviate the impact of the low progress of training the AI model in some communication domains on the overall training progress of the AI model, that is, it can improve the AI model. overall training efficiency. For example, if the progress of Set 1 in the first round is reduced by 3 seconds, and the progress of Set 2 in the second round is reduced by 5 seconds, the overall progress will not be reduced to 8 seconds, but the slowest one, which is 5 seconds. . Moreover, using global gradient data to update the AI model at each interval of multiple rounds of model training can ensure that the training effect of the AI model reaches a high level. On this basis, due to the intervening process of each round of model training, the relationship between processors in different communication domains There is no need to interact with gradient data, which can effectively reduce the communication resources required to train the AI model.

Next, combined with specific application scenarios, the specific implementation process of distributed training of AI models is introduced. In this application scenario, the system architecture described in Figure 1 can be deployed in a server. The server includes 4 CPUs and can be connected to 8 NPU chips, as shown in Figure 9. Thus, NPU1 to NPU8 in the server can Implement distributed training of the Pangu model (an AI model). In other embodiments, distributed training of the Pangu model can also be implemented based on NPU chips in multiple servers. The training method is similar to the implementation of distributed training of the Pangu model using multiple NPUs in a server. Refer to understanding.

In the server shown in Figure 9, each CPU can support 8 generation 4 double-rate dual inline memory modules (double data rate 4 dual inline memory modules, DDR4 DIMM), and between CPU1 to CPU4 Can be fully interconnected (full mesh). The CPU in the server can provide a bandwidth capability of 90GB/s (gigabytes per second), of which each CPU can provide a one-way bandwidth of 30GB/s and a two-way bandwidth of 60GB/s.

Among the 8 NPU chips external to the server, NPU1 to NPU4 can be fully interconnected and can be located on one NPU motherboard. NPU5 to NPU8 can be fully interconnected and can be located on another NPU motherboard. Moreover, there is a connection between the 8 NPU chips external to the server and the CPU. , for example, the connection can be based on the peripheral component interconnect express (PCIE) bus (only part of the connection between the NPU and the CPU is shown in Figure 9), so that NPU1 to NPU4 can pass through the CPU and NPU5 in the server. Go to NPU8 for data exchange. Each NPU motherboard can provide a bandwidth capacity of 90GB/s, of which each NPU can provide a one-way bandwidth of 30GB/s and a two-way bandwidth of 60GB/s.

Based on the server shown in Figure 9, distributed training of the Pangu model can be implemented. The distributed training process is shown in Figure 10. Users can provide a training script to the server. The training script can include the Pangu model file, specify NPU1 to NPU8 to train the Pangu model, and define NPU1 to NPU4 to belong to communication domain 1, and NPU5 to NPU8 to belong to communication domain 2.

In this way, the CPU on the host side of the server can parse the Pangu model to be trained from the training script, and determine the multiple NPUs used for distributed training of the Pangu model this time and the communication domain to which each NPU belongs.

Then, the CPU can extract the calculation graph according to the training script. The calculation graph includes multiple nodes, and there are edges connecting different nodes. Among them, the nodes in the calculation graph are used to indicate the calculations defined in the training script, and the edges between the nodes are used to indicate the dependencies between different calculations. The extracted calculation graph can be saved to a trans-flash card.

Then, the CPU can compile the calculation graph in the flash card, generate an intermediate representation (IR), and provide the IR to the compiler. Among them, the compiler can define one or more operator libraries, such as the neural network (NN) operator library shown in Figure 10, the Huawei collective communication library (huawei collective communication library, HCCL) operator library, etc. For example, the NN operator library can include convolution layer operators, pooling layer operators, loss functions, etc.; the HCCL operator library can include operators used to define data communication methods, such as allreduce operators, allgather Operator etc.

In this way, the CPU can use the compiler to determine the operators that need to be executed sequentially for distributed training of the Pangu model, generate corresponding device instructions, and send the device instructions to the NPU on the device side.

NPU1 and NPU8 on the device side can execute the corresponding operators in a loop based on the device instructions issued by the host side and perform gradient updates on the Pangu model until the iteration termination conditions are met, thereby realizing distributed training of the Pangu model. Among them, in the process of distributed training of the Pangu model, NPU1 to NPU4 in communication domain 1 and NPU5 to NPU8 in communication domain 2 train the Pangu model separately, and communication domain 1 and communication domain 2 are trained at every interval ( T-1) Round model training interactively trains the gradient data generated by the Pangu model to achieve global gradient update of the Pangu model. For the specific training process, please refer to the relevant description of the embodiment shown in Figure 3, here No further details will be given.

Finally, after completing the distributed training of the Pangu model, the device side can send the training results to the host side. The training results may include, for example, the trained Pangu model, attribute information of the Pangu model (such as inference accuracy), etc.

The model training method provided by the present application is described in detail above with reference to Figures 1 to 10. The model training device and computing device provided by the present application will be described below with reference to Figures 11 to 12.

With the same inventive concept as the above method, embodiments of the present application also provide a model training device. Referring to FIG. 11 , which is a schematic structural diagram of a model training device provided by an embodiment of the present application. The model training device 1100 shown in FIG. 11 may, for example, be the model training device mentioned in the embodiment shown in FIG. 3 . As shown in Figure 11, the model training device 1100 includes:

Acquisition module 1101, used to obtain the AI model to be trained;

Determining module 1102, configured to determine multiple communication domains, each of the multiple communication domains including multiple devices;

Update module 1103, configured to use the local gradient data corresponding to each communication domain to update the AI trained in the communication domain during each round of distributed training of the AI model using devices in the multiple communication domains. Model, the local gradient data corresponding to each communication domain is obtained by gradient fusion based on the gradient data generated by multiple devices in the communication domain; and, when multiple rounds are used, the distributed training of the devices in the multiple communication domains is used. When describing the AI model, the AI model trained separately in each communication domain is updated using global gradient data, which is obtained by gradient fusion based on gradient data respectively generated by multiple devices in the multiple communication domains.

In a possible implementation, the update module 1103 is used to:

Multiple devices in a target communication domain interact with each other to generate gradient data, and the target communication domain is one of the multiple communication domains;

The target communication domain performs gradient fusion based on the gradient data interacted between the multiple devices to generate local gradient data corresponding to the target communication domain;

The target communication domain uses the local gradient data corresponding to the target communication domain to update the AI model trained by the target communication domain.

In a possible implementation, the update module 1103 is used to:

Obtain the version number of the activation operation corresponding to the target communication domain and the version number of the interactive operation. The activation operation is used to trigger interactive gradient data between different devices in the target communication domain. The interactive operation is the target Operation of interactive gradient data between different devices in the communication domain;

When the version number of the activation operation is greater than or equal to the version number of the interaction operation, multiple devices in the target communication domain interact with each other to generate gradient data.

In a possible implementation, the determining module 1102 is used to:

Obtain a device topology relationship, which indicates a connection relationship between multiple devices used to train the AI model;

According to the device topological relationship, multiple devices used to train the AI model are divided to obtain the multiple communication domains, wherein the communication rate between different devices in each communication domain is higher than that in different communication domains The communication rate between devices.

In a possible implementation, the determining module 1102 is used to:

Generate a first configuration interface, the first configuration interface being used to present to the user identification of multiple devices used to train the AI model;

In response to the user's first configuration operation, a communication domain to which each of the plurality of devices for training the AI model belongs is determined.

In a possible implementation, the determination module 1102 is also used to:

Before training the AI model, a second configuration interface is generated. The second configuration interface is used to present multiple interaction strategies to the user, and each of the multiple interaction strategies is used to indicate multiple interaction strategies in the communication domain. A way to exchange gradient data between devices;

In response to a user's second configuration operation for the plurality of interaction strategies, a manner of interacting gradient data between the plurality of devices in each communication domain is determined.

The data processing device 100 shown in Figure 11 corresponds to the data processing device in the embodiment shown in Figure 3. Therefore, the specific implementation of each functional module in the data processing device 100 and the technical effects thereof can be referred to the previous embodiments. Relevant descriptions will not be repeated here.

An embodiment of the present application also provides a computing device. As shown in Figure 12, the computing device 1200 may include a communication interface 1210 and a processor 1220. Optionally, the computing device 1200 may also include a memory 1230. The memory 1230 may be disposed inside the computing device 1200 or may be disposed outside the computing device 1200 . For example, each action performed by the data processing device in the embodiment shown in FIG. 3 can be implemented by the processor 1220. The processor 1220 can obtain the AI model to be trained and multiple communication domains through the communication interface 1210, and use it to implement the method executed in Figure 3. During the implementation process, each step of the processing flow can complete the method executed in Figure 3 through instructions in the form of hardware integrated logic circuits or software in the processor 1220. For the sake of brevity, no further details will be given here. The program code executed by the processor 1220 to implement the above method may be stored in the memory 1230 . The memory 1230 and the processor 1220 are connected, such as coupling connection, etc.

Some features of the embodiments of the present application may be implemented/supported by the processor 1220 executing program instructions or software codes in the memory 1230. The software components loaded on memory 1230 may be summarized functionally or logically.

Any communication interface involved in the embodiments of this application may be a circuit, bus, transceiver, or any other device that can be used for information exchange. For example, the communication interface 1210 in the computing device 1200, for example, the other device may be a device connected to the computing device 1200, or the like.

The processor involved in the embodiments of this application may be a general-purpose processor, a digital signal processor, an application-specific integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component, which may implement or Execute each method, step and logical block diagram disclosed in the embodiment of this application. A general-purpose processor may be a microprocessor or any conventional processor, etc. The steps of the methods disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware processor for execution, or can be executed by a combination of hardware and software modules in the processor.

The coupling in the embodiment of this application is an indirect coupling or communication connection between devices, modules or modules, which may be in electrical, mechanical or other forms, and is used for information interaction between devices, modules or modules.

The processor may operate in conjunction with the memory. The memory can be a non-volatile memory, such as a hard disk or a solid state drive, or a volatile memory, such as a random access memory. Memory is, but is not limited to, any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

The embodiments of the present application do not limit the specific connection medium between the above communication interface, processor and memory. For example, the memory, processor and communication interface can be connected through a bus. The bus can be divided into address bus, data bus, control bus, etc.

Based on the above embodiments, embodiments of the present application also provide a computer storage medium, which stores a software program. When read and executed by one or more processors, the software program can implement any one or more of the above. The embodiment provides a method executed by the model training device. The computer storage medium may include: U disk, mobile hard disk, read-only memory, random access memory, magnetic disk or optical disk and other various media that can store program codes.

Those skilled in the art should understand that embodiments of the present application may be provided as methods, devices, systems, storage media or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each process and/or block in the flowchart illustrations and/or block diagrams, and combinations of processes and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine, such that the instructions executed by the processor of the computer or other programmable data processing device produce a use A device for realizing the functions specified in one process or multiple processes of the flowchart and/or one block or multiple blocks of the block diagram.

These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, the instructions The device implements the functions specified in a process or processes of the flowchart and/or a block or blocks of the block diagram.

These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device. Instructions provide steps for implementing the functions specified in a process or processes of a flowchart diagram and/or a block or blocks of a block diagram.

The terms "first", "second", etc. in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that the terms so used are interchangeable under appropriate circumstances, and are merely a way of distinguishing objects with the same attributes in describing the embodiments of the present application.

Obviously, those skilled in the art can make various changes and modifications to the embodiments of the present application without departing from the scope of the embodiments of the present application. In this way, if these modifications and variations of the embodiments of the present application fall within the scope of the claims of this application and equivalent technologies, then this application is also intended to include these modifications and variations.

Claims

A model training method, characterized in that the method includes:

Get the AI model to be trained;

determining a plurality of communication domains, each communication domain of the plurality of communication domains including a plurality of devices;

In each round of distributed training of the AI model using devices in the multiple communication domains, the AI model trained in each communication domain is updated using the local gradient data corresponding to the communication domain. Each communication domain The corresponding local gradient data is obtained by gradient fusion based on the gradient data generated by multiple devices in the communication domain;

When the AI model is distributedly trained using devices in the multiple communication domains at intervals of multiple rounds, the AI model trained separately in each communication domain is updated using global gradient data. The global gradient data is based on the multiple It is obtained by gradient fusion of gradient data generated by multiple devices in the communication domain.
The method according to claim 1, characterized in that, using the local gradient data corresponding to each communication domain to update the AI model trained in the communication domain includes:

Multiple devices in a target communication domain interact with each other to generate gradient data, and the target communication domain is one of the multiple communication domains;

The target communication domain performs gradient fusion based on the gradient data interacted between the multiple devices to generate local gradient data corresponding to the target communication domain;

The target communication domain uses the local gradient data corresponding to the target communication domain to update the AI model trained by the target communication domain.
The method according to claim 2, characterized in that, multiple devices in the target communication domain interact with each other to generate gradient data, including:

Obtain the version number of the activation operation corresponding to the target communication domain and the version number of the interactive operation. The activation operation is used to trigger interactive gradient data between different devices in the target communication domain. The interactive operation is the target Operation of interactive gradient data between different devices in the communication domain;

When the version number of the activation operation is greater than or equal to the version number of the interaction operation, multiple devices in the target communication domain interact with each other to generate gradient data.
The method according to claim 2 or 3, characterized in that the physical connection between multiple devices in the target communication domain is a ring connection.
The method according to any one of claims 1 to 4, characterized in that determining multiple communication domains includes:

Obtain a device topology relationship, which indicates a connection relationship between multiple devices used to train the AI model;

According to the device topological relationship, multiple devices used to train the AI model are divided to obtain the multiple communication domains, wherein the communication rate between different devices in each communication domain is higher than that in different communication domains The communication rate between devices.
The method according to any one of claims 1 to 4, characterized in that determining multiple communication domains includes:

Generate a first configuration interface, the first configuration interface being used to present to the user identification of multiple devices used to train the AI model;

In response to the user's first configuration operation, a communication domain to which each of the plurality of devices for training the AI model belongs is determined.
The method according to any one of claims 1 to 6, characterized in that, before training the AI model, the method further includes:

Generate a second configuration interface, the second configuration interface being used to present multiple interaction strategies to the user, each of the multiple interaction strategies being used to indicate interaction gradient data between multiple devices in the communication domain a method;

In response to a user's second configuration operation for the plurality of interaction strategies, a manner of interacting gradient data between the plurality of devices in each communication domain is determined.
The method according to any one of claims 1 to 7, characterized in that devices in different communication domains are located on the same computing node, or devices in the multiple communication domains are located on different computing nodes.
The method according to any one of claims 1 to 8, characterized in that the equipment in each communication domain includes a processor, a chip or a server.
A model training device, characterized in that the device includes:

Acquisition module, used to obtain the AI model to be trained;

a determining module configured to determine a plurality of communication domains, each of the plurality of communication domains including a plurality of devices;

An update module, configured to use the local gradient data corresponding to each communication domain to update the AI model trained in the communication domain during each round of distributed training of the AI model using devices in the multiple communication domains. , the local gradient data corresponding to each communication domain is obtained by gradient fusion based on the gradient data generated respectively by multiple devices in the communication domain; and, when multiple rounds are used to distribute the training using the devices in the multiple communication domains, When using the AI model, global gradient data is used to update the AI model trained separately in each communication domain. The global gradient data is obtained by gradient fusion based on gradient data respectively generated by multiple devices in the multiple communication domains.
The device according to claim 10, characterized in that the update module is used for:

Multiple devices in a target communication domain interact with each other to generate gradient data, and the target communication domain is one of the multiple communication domains;

The target communication domain performs gradient fusion based on the gradient data interacted between the multiple devices to generate local gradient data corresponding to the target communication domain;

The target communication domain uses the local gradient data corresponding to the target communication domain to update the AI model trained by the target communication domain.
The device according to claim 11, characterized in that the update module is used for:

Obtain the version number of the activation operation corresponding to the target communication domain and the version number of the interactive operation. The activation operation is used to trigger interactive gradient data between different devices in the target communication domain. The interactive operation is the target Operation of interactive gradient data between different devices in the communication domain;

When the version number of the activation operation is greater than or equal to the version number of the interaction operation, multiple devices in the target communication domain interact with each other to generate gradient data.
The apparatus according to claim 11 or 12, characterized in that the physical connection between multiple devices in the target communication domain is a ring connection.
The device according to any one of claims 10 to 13, characterized in that the determining module is used to:

Obtain a device topology relationship, which indicates a connection relationship between multiple devices used to train the AI model;

According to the device topological relationship, multiple devices used to train the AI model are divided to obtain the multiple communication domains, wherein the communication rate between different devices in each communication domain is higher than that in different communication domains The communication rate between devices.
The device according to any one of claims 10 to 13, characterized in that the determining module is used for:

Generate a first configuration interface, the first configuration interface being used to present to the user identification of multiple devices used to train the AI model;

In response to the user's first configuration operation, a communication domain to which each of the plurality of devices for training the AI model belongs is determined.
The device according to any one of claims 10 to 15, characterized in that the determining module is also used to:

Before training the AI model, a second configuration interface is generated. The second configuration interface is used to present multiple interaction strategies to the user, and each of the multiple interaction strategies is used to indicate multiple interaction strategies in the communication domain. A way to exchange gradient data between devices;

In response to a user's second configuration operation for the plurality of interaction strategies, a manner of interacting gradient data between the plurality of devices in each communication domain is determined.
The apparatus according to any one of claims 10 to 16, wherein the devices in different communication domains are located on the same computing node, or the devices in the multiple communication domains are located on different computing nodes.
The apparatus according to any one of claims 10 to 17, characterized in that the equipment in each communication domain includes a processor, a chip or a server.
A model training system, characterized in that the model training system includes a plurality of devices, and the model training system is used to perform the method according to any one of claims 1 to 9.
A computing device, characterized by including a processor and a memory;

The processor is configured to execute instructions stored in the memory, so that the computing device performs the method according to any one of claims 1 to 9.
A computer-readable storage medium, characterized in that the computer-readable storage medium stores instructions that, when run on at least one computing device, cause the at least one computing device to execute the instructions in claims 1 to 9 any of the methods described.
A computer program product containing instructions which, when run on at least one computing device, cause said at least one computing device to perform a method according to any one of claims 1 to 9.