WO2024094058A1 - 一种模型训练方法及相关装置 - Google Patents

一种模型训练方法及相关装置 Download PDF

Info

Publication number
WO2024094058A1
WO2024094058A1 PCT/CN2023/129042 CN2023129042W WO2024094058A1 WO 2024094058 A1 WO2024094058 A1 WO 2024094058A1 CN 2023129042 W CN2023129042 W CN 2023129042W WO 2024094058 A1 WO2024094058 A1 WO 2024094058A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
model
training
training device
type
Prior art date
Application number
PCT/CN2023/129042
Other languages
English (en)
French (fr)
Inventor
张琦
潘邵武
吴天诚
张昭举
刘璐
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2024094058A1 publication Critical patent/WO2024094058A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the field of artificial intelligence (AI) technology, and in particular to a model training method and related devices.
  • AI artificial intelligence
  • Machine learning technology can automatically mine the information contained in the data. Therefore, machine learning models trained with a large amount of data have been applied in various scenarios, such as face recognition, voice translation, and medical auxiliary diagnosis. In practical applications, the accuracy and generalization ability of machine learning models are crucial, and these rely on the use of a large amount of data to train machine learning models.
  • Federated Learning is a distributed machine learning technology. Its core idea is to conduct distributed model training among multiple data sources with local data. Without the need to exchange local data, it only exchanges intermediate results to build a global model based on fused data, thereby achieving a balance between data privacy protection and data sharing computing.
  • the present application provides a model training method that can shorten the idle waiting time during the entire iterative training process and improve the overall training efficiency.
  • the first aspect of the present application provides a model training method, which is applied to the field of artificial intelligence technology.
  • the method includes: a first training device obtains a training data set, and the training data set includes multiple data.
  • the multiple data obtained by the first training device can be a batch of data, that is, the number of multiple data is the same as the batch size of the model.
  • the batch size indicates the amount of data samples used to train the model at the same time, that is, the amount of data samples that the model needs to use in one iterative training process.
  • the first training device divides the plurality of data into a plurality of groups of data, each of the plurality of groups of data includes at least one data, and the amount of each group of data is related to the communication capability of the first training device. For example, the amount of each group of data is positively correlated with the communication capability of the first training device. That is, the stronger the communication capability of the first training device, the larger the amount of each group of data.
  • multiple groups of data are processed in batches in units of groups to train the model and the data obtained by processing the multiple groups of data are transmitted to the second training device in batches.
  • the second training device and the first training device jointly participate in the training of the model. Specifically, multiple parts of data will be obtained by processing multiple groups of data through the model, and each group of data corresponds to a unique part of data.
  • each part of the multiple parts of data in this step is also transmitted to the second training device in batches.
  • multiple data in the training process are divided into multiple groups of data, and the data volume of each group of data is related to the communication capability of the training device.
  • the training device processes multiple groups of data in batches through the model and transmits multiple parts of data obtained by processing multiple groups of data to other training devices in batches to perform model training.
  • the training device transmits the data obtained by processing the previous group of data to other training devices, it can still continue to process the data of the subsequent group, thereby hiding the model training time in the communication time between the training devices, and finally shortening the idle waiting time in the entire iterative training process, and improving the overall training efficiency.
  • the operations performed during the training model include a first type of operation and a second type of operation.
  • the first type of operation is used to generate multiple parts of data transmitted to the second training device based on multiple sets of data, that is, the first type of operation is an operation that relies on multiple sets of data to generate data to be transmitted.
  • the second type of operation is used to generate data that is only processed by the first training device, that is, the execution of the second type of operation is similar to that of the first training device. No does not affect the generation of data to be transmitted.
  • the execution priority of the first type of operation is higher than the execution priority of the second type of operation. That is, during the training process, when there are both first type of operations and second type of operations that can be executed at the same time, the first training device will prioritize the execution of the first type of operation. After all the first type of operations are executed, the first training device will execute the second type of operation.
  • the training device gives priority to executing the first type of operations so that it can continuously generate data to be transmitted and avoid the training device being in a communication idle state as much as possible, so as to ensure that the time for communication and training to be carried out in parallel is as long as possible, thereby shortening the idle waiting time in the entire iteration process and improving the overall training efficiency.
  • the first type of operation includes a first sub-type of operation and a second sub-type of operation
  • the operation result of the first sub-type of operation is used to obtain the input of the second sub-type of operation
  • the operation result of the second sub-type of operation is used to transmit to the second training device.
  • the operation result obtained by the first training device when executing the second sub-type of operation is the data that needs to be transmitted to the second training device
  • the operation result obtained by the first training device when executing the first sub-type of operation is used as the input of the second sub-type of operation, that is, the execution of the second sub-type of operation depends on the first sub-type of operation.
  • the first sub-type of operation is executed first, can the data used as the input of the second sub-type of operation be obtained, and then the second sub-type of operation can be executed.
  • the execution priority of the second subclass operation is higher than that of the first subclass operation.
  • the first training device needs to first execute the first sub-category operation and obtain the operation result before it can continue to execute the second sub-category operation based on the obtained operation result. That is, the second sub-category operation can only be executed after the operation conditions of the second sub-category operation are met. However, after the operation conditions of the second sub-category operation are met, the first training device will give priority to executing the second sub-category operation so as to generate data to be transmitted to the second training device as soon as possible.
  • the first training device caches the data generated by executing the first type of operation in a first queue, and the first queue is used to cache the data to be transmitted to the second training device.
  • the first training device stops executing the first type of operation and executes the second type of operation.
  • the first training device when there is a large amount of data in the first queue, it means that the speed at which the first training device sends data to the second training device is far slower than the speed at which the first training device generates data by performing the first type of operation. Therefore, the first training device continuing to prioritize the first type of operation will not improve the overall training efficiency. In this case, in order to avoid excessive memory overhead, the first training device can choose to switch to performing the second type of operation to avoid generating too much data to be transmitted and accumulating in the memory.
  • the first training device stops executing the second type of operation and continues to execute the first type of operation.
  • the data used to support the execution of the second type of operation may refer to data used as input for the second type of operation, i.e., input data for the second type of operation.
  • the first training device may continue to preferentially perform the first type of operation to continuously generate data to be transmitted to the second training device to ensure the continuity of data communication. Moreover, after the first training device has processed the input data of the second type of operation, the data used to support the execution of the second type of operation has been processed, and the first training device cannot continue to perform the second type of operation, so the first training device switches to continue to perform the first type of operation.
  • the training in which the first training device participates is federated learning, such as horizontal federated learning, vertical federated learning, or federated transfer learning.
  • the first type of operation includes operations for processing multiple groups of data based on the model
  • the second type of operation includes operations for performing reverse gradient calculations on the model based on data obtained from the second training device.
  • the model includes a first sub-model and a second sub-model; the first type of operation includes operations for processing multiple groups of data based on the first sub-model, and operations for processing the second sub-model based on data obtained from a second training device; the second type of operation includes operations for performing reverse gradient calculations on the first sub-model.
  • the number of the multiple data is related to the batch size of the model; during the training process of the model, the target gradient is related to the multiple gradients obtained based on the multiple sets of data, and the target gradient is used by the first training device to update the model. For example, when multiple gradients are obtained based on the multiple sets of data, the target gradient is obtained by averaging the multiple gradients.
  • This solution obtains the target gradient based on multiple gradients, and then updates the model based on the target gradient, which can ensure that the model is based on a
  • the model is updated with batches of data to ensure that the accuracy of model training is not affected.
  • the number of multiple data is related to the batch size of the model, the number of multiple groups of data is positively correlated with the amount of data to be transmitted, and the number of multiple groups of data is negatively correlated with the communication capability of the first training device and the training duration; wherein the amount of data to be transmitted is the amount of data to be transmitted generated after processing the multiple groups of data, and the training duration is the duration for the first training device to train the model based on the multiple groups of data.
  • the larger the amount of data to be transmitted generated after processing multiple groups of data the greater the number of groups of data, so as to reduce the amount of data to be transmitted generated by processing each group of data and avoid excessive amount of data to be transmitted generated after processing each group of data.
  • the stronger the communication capability of the first training device the more data the first training device can transmit to the second training device per unit time, and the fewer the number of groups of data can be divided into.
  • the longer the time it takes for the first training device to train the model based on multiple groups of data the slower the speed at which the first training device processes multiple groups of data to generate data to be transmitted, and the fewer the number of groups of data can be divided into.
  • the second aspect of the present application provides a model training device, which is a first training device, comprising:
  • An acquisition module is used to acquire a training data set, where the training data set includes multiple data;
  • a processing module used for dividing the plurality of data into a plurality of groups of data, each of the plurality of groups of data includes at least one data, and the data volume of each group of data is related to the communication capability of the first training device;
  • the processing module is also used to process multiple groups of data in batches based on the model deployed in the first training device, so as to train the model and transmit the data obtained by processing the multiple groups of data to the second training device in batches.
  • the second training device and the first training device jointly participate in the training of the model.
  • the operations performed during the training model include first-category operations and second-category operations, where the first-category operations are used to generate data to be transmitted to a second training device based on multiple sets of data, and the second-category operations are used to generate data to be processed only by the first training device, and the execution priority of the first-category operations is higher than the execution priority of the second-category operations.
  • the first type of operation includes a first sub-type of operation and a second sub-type of operation
  • the operation result of the first sub-type of operation is used to obtain the input of the second sub-type of operation
  • the operation result of the second sub-type of operation is used to be transmitted to the second training device.
  • the execution priority of the second subclass operation is higher than that of the first subclass operation.
  • processing module is further configured to:
  • the first type of operation is stopped and the second type of operation is performed.
  • processing module is further configured to:
  • the training in which the first training device participates is federated learning.
  • the first type of operation includes operations for processing multiple groups of data based on the model
  • the second type of operation includes operations for performing reverse gradient calculations on the model based on data obtained from the second training device.
  • the model includes a first sub-model and a second sub-model
  • the first type of operation includes an operation for processing multiple groups of data based on the first sub-model, and an operation for processing the second sub-model based on data obtained from the second training device;
  • the second type of operation includes operations for performing reverse gradient calculation on the first sub-model.
  • the number of multiple data is related to the batch size of the model
  • the target gradient is related to multiple gradients obtained based on multiple groups of data, and the target gradient is used by the first training device to update the model.
  • the number of the multiple data is related to the batch size of the model, the number of the multiple data groups is positively correlated with the amount of data to be transmitted, and the number of the multiple data groups is negatively correlated with the communication capability of the first training device and the training duration;
  • the amount of data to be transmitted is the amount of data to be transmitted generated after processing multiple sets of data
  • the training duration is the first training device based on The time it takes to train a model on multiple sets of data.
  • the third aspect of the present application provides a model training device, which may include a processor, the processor and a memory are coupled, the memory stores program instructions, and when the program instructions stored in the memory are executed by the processor, the method described in the first aspect or any implementation of the first aspect is implemented.
  • the first aspect can be specifically referred to, and no further description is given here.
  • the model training device further includes a communication interface, through which the model training device transmits data to other model training devices, or receives data transmitted by other model training devices.
  • the model training device can transmit data to other model training devices based on the communication interface through Remote Direct Memory Access (RDMA) technology, and the specific method of transmitting data between model training devices is not limited here.
  • RDMA Remote Direct Memory Access
  • RDMA refers to transferring data directly to the computer's storage area through the network, that is, quickly moving data from one system to the remote system memory without causing any impact on the operating system, thereby not occupying the computer's processing resources. Therefore, when the model training devices use RDMA technology to realize data transmission between them, the model training device can store the data to be transmitted in the memory, so that the data in the memory can be directly inserted into the memory of another model training device through the network, thereby reducing the consumption of the processing resources of the two model training devices.
  • the fourth aspect of the present application provides a model training system, comprising at least two model training devices, which jointly participate in the training of the model, and any one of the at least two model training devices uses the method described in any implementation method of the first aspect above to interact with other model training devices and perform model training.
  • the present application provides a computer-readable storage medium, in which a computer program is stored.
  • the computer-readable storage medium When the computer-readable storage medium is run on a computer, the computer executes the method described in any implementation of the first aspect.
  • a sixth aspect of the present application provides a circuit system, the circuit system comprising a processing circuit, the processing circuit being configured to execute the method described in any implementation of the first aspect above.
  • the present application provides a computer program product, which, when executed on a computer, enables the computer to execute the method described in any implementation of the first aspect.
  • a chip system which includes a processor for supporting a server or a threshold value acquisition device to implement the functions involved in any implementation of the first aspect, for example, sending or processing the data and/or information involved in the above method.
  • the chip system also includes a memory, which is used to store program instructions and data necessary for the server or communication device.
  • the chip system can be composed of chips, or it can include chips and other discrete devices.
  • FIG1 is a schematic diagram of an application scenario of a model training method provided in an embodiment of the present application.
  • FIG. 2 is a schematic diagram of the structure of an electronic device 101 provided in an embodiment of the present application.
  • FIG3 is a flow chart of a model training method provided in an embodiment of the present application.
  • FIG4A is a schematic diagram of a plurality of training devices performing model training provided in an embodiment of the present application.
  • FIG4B is a schematic diagram showing a comparison of training durations when training data is grouped and when training data is not grouped, provided in an embodiment of the present application;
  • FIG5 is a schematic diagram of a training process provided in an embodiment of the present application.
  • FIG6 is a schematic diagram of another training process provided in an embodiment of the present application.
  • FIG7 is a schematic diagram of a training architecture provided in an embodiment of the present application.
  • FIG8 is a schematic diagram of a flow chart of parallel task scheduling performed by a training device 3 provided in an embodiment of the present application;
  • FIG9 is a schematic diagram of a flow chart of parallel task scheduling performed by a training device 4 provided in an embodiment of the present application.
  • FIG10 is a schematic diagram of the structure of a model training device provided in an embodiment of the present application.
  • FIG11 is a schematic diagram of a structure of an execution device provided in an embodiment of the present application.
  • FIG12 is a schematic diagram of a structure of a chip provided in an embodiment of the present application.
  • FIG13 is a schematic diagram of the structure of a computer-readable storage medium provided in an embodiment of the present application.
  • the naming or numbering of the steps in the present application does not mean that the steps in the method flow must be executed in the time/logical sequence indicated by the naming or numbering.
  • the process steps that have been named or numbered can change the execution order according to the technical purpose to be achieved, as long as the same or similar technical effects can be achieved.
  • the division of units in this application is a logical division. There may be other division methods when it is implemented in actual applications. For example, multiple units can be combined or integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, and the indirect coupling or communication connection between units can be electrical or other similar forms, which are not limited in this application.
  • the units or sub-units described as separate components may or may not be physically separated, may or may not be physical units, or may be distributed in multiple circuit units, and some or all of the units may be selected according to actual needs to achieve the purpose of the present application.
  • Distributed training refers to the use of multiple devices located in different locations to train the model, thereby effectively utilizing the computing resources of each device to complete model training tasks that are difficult to complete with a single device.
  • Federated learning is essentially a model training method that can achieve data sharing and jointly build models on the basis of ensuring data privacy security and legal compliance.
  • the core idea of federated learning is that when multiple data sources jointly participate in model training, the model is jointly trained only through the intermediate parameters of the interactive model without the need for the original data flow, and the original data does not need to be local. This method achieves a balance between data privacy protection and data sharing analysis, that is, the data application mode of "data available but invisible”.
  • federated learning can be divided into three categories: horizontal federated learning (HFL), vertical federated learning (VFL) and federated transfer learning (FTL).
  • HFL horizontal federated learning
  • VFL vertical federated learning
  • FTL federated transfer learning
  • Horizontal federated learning means that the data of different participants in federated learning have a large overlap of features, but the overlap of data samples (i.e., samples to which the features belong) is not high.
  • the participants in federated learning are two banks serving different regional markets.
  • the customer groups they serve are quite different, but the customer features may overlap to a high degree due to similar business models.
  • Vertical federated learning means that the data samples of different participants in federated learning have a large overlap, but the overlap of sample features is not high.
  • two companies a bank and an e-commerce company
  • provide different services to customers and the two companies have different aspects of customer data, but the customer groups served by the two companies have a large overlap.
  • vertical federated learning has the widest application scenario. Therefore, the vertical federated learning algorithm is conducive to establishing cooperation between enterprises, using their respective unique data to jointly build more powerful models.
  • Federated transfer learning means that the data of different participants in federated learning do not overlap very much in terms of features and sample dimensions.
  • Neural networks are often referred to as models. Neural networks are a type of distributed parallel information processing that mimics the behavioral characteristics of animal neural networks. The algorithmic mathematical model of the neural network relies on the complexity of the system to adjust the interconnected relationships between a large number of internal nodes to achieve the purpose of processing information.
  • the neural network may be composed of neural units.
  • a neural unit may refer to an operation unit that takes xs (i.e., input data) and intercept 1 as input, and the output of the operation unit may be:
  • s 1, 2, ... n, n is a natural number greater than 1
  • Ws is the weight of xs
  • b is the bias of the neural unit.
  • f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into the output signal.
  • the output signal of the activation function can be used as the input of the next convolutional layer, and the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting multiple single neural units mentioned above, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field, which can be an area composed of several neural units.
  • the loss function is used to measure the difference between the predicted value and the actual value as a reference for model performance.
  • the process of model training is to continuously make predictions through training data, continuously adjust the difference between the predicted output and the expected output, and make the value of the loss function continuously smaller.
  • the gradient is a vector that indicates that the directional derivative of a function at a certain point reaches its maximum value along that direction, that is, the function changes fastest along the direction of the gradient at that point and has the largest rate of change.
  • Gradient descent is a method for finding the minimum of an objective function.
  • it is necessary to iteratively search for points in the opposite direction of the gradient corresponding to the current point on the function.
  • the gradient descent method is used to find the minimum value of the loss function.
  • the loss function converges fastest in the opposite direction of the gradient, that is, it can find the extreme point fastest.
  • the gradient vector is 0, it means that the loss function reaches a minimum point and the model accuracy reaches a maximum point.
  • the back propagation method is a specific implementation of the gradient descent method on the neural network.
  • the neural network can use the error back propagation (BP) algorithm to correct the size of the parameters in the initial model during the training process, so that the error loss of the model becomes smaller and smaller.
  • BP error back propagation
  • the back propagation method is a back propagation movement dominated by error loss, which aims to obtain the optimal model parameters, such as the weight matrix.
  • hyperparameters can include Batch size, learning rate ⁇ in gradient descent method, and number of iterations epoch.
  • the batch size indicates the number of data samples that are passed to the program at a time for training the model. For example, assuming that the training data set has 1000 data, if batch_size is set to 100, the program will first use the first 100 data in the training data set, that is, the 1st to 100th data, to train the model. After the training based on the 1st to 100th data is completed (that is, the weight of the model is updated), the model is trained using the 101st to 200th data until the 1000 data in the training set are used for the tenth time. Among them, the 1st to 100th data belong to the same batch of data, and the 101st to 200th data also belong to the same batch of data, and so on. There are 10 batches of data in the training data set.
  • model training the significance of setting the batch size is to train the model based on multiple sample data at the same time.
  • it can effectively utilize the parallelism of hardware devices such as GPU and NPU, and process multiple sample data in parallel to improve the model training efficiency; on the other hand, it can determine the average gradient value of the model based on multiple sample data, improve the training accuracy of the model, and avoid the gradient value fluctuations in the training process caused by determining the model gradient value based on a single sample data.
  • advertising platforms are the party that exposes advertisements
  • advertising platforms have data such as user characteristics and click exposure behaviors
  • advertisers have conversion behavior data such as user downloads, activations, and purchases. Due to factors such as user privacy protection and commercial protection, it is difficult for advertisers and advertising platforms to exchange data directly, resulting in effective data being scattered between the two parties, causing data silos.
  • advertising platforms and advertisers conduct joint training without sharing data resources and without data leaving the local area, thus achieving joint construction of models.
  • an embodiment of the present application provides a model training method, which divides multiple data of the same batch during the training process into multiple groups of data, and the training device processes the multiple groups of data in sequence through the model to perform model training; and, during the model training process, the training device gives priority to executing operations to generate data to be transmitted, thereby hiding the model training time in the communication time between the training devices, and avoiding the training devices from being in a communication idle state as much as possible, ultimately shortening the idle waiting time in the entire iterative process and improving the overall training efficiency.
  • FIG 1 is a schematic diagram of an application scenario of a model training method provided in an embodiment of the present application.
  • training device 10 and training device 20 are two training devices participating in distributed model training.
  • training device 10 can obtain training data for training the model from a local database 1;
  • training device 20 can obtain training data for training the model from a local database 2.
  • training device 10 and training device 20 cannot directly exchange data in database 1 and database 2.
  • model 1 is deployed on training device 10
  • model 2 is deployed on training device 20.
  • Training device 10 trains model 1 based on training data obtained from database 1
  • training device 20 trains model 2 based on training data obtained from database 2.
  • training device 10 and training device 20 will exchange intermediate parameters of the model (for example, data obtained by processing training data by local model) with each other, so that both parties can cooperate with each other to complete the training of local models.
  • the training device may be to divide a batch of multiple training data into multiple groups of data, and process the multiple groups of data in batches by group through the model to gradually generate multiple parts of data to be transmitted to other training devices. Then, while processing multiple groups of data in batches, the training device transmits a part or multiple parts of data generated by processing the data to other training devices to achieve synchronous execution of data processing and data transmission. After acquiring the data from other training devices, the training device continues to process the acquired data to complete the training of the model. Moreover, in the process of the training device training the model, the training device gives priority to executing the operation of generating the data to be transmitted to avoid the training device being in a communication idle state.
  • the present embodiment is to conduct distributed training with two training devices.
  • the devices participating in the distributed training may be two or more training devices, and the number of the training devices participating in the distributed training is not limited herein.
  • the model training method provided in the embodiment of the present application can be applied to an electronic device, or a chip system on an electronic device, such as a graphics processing unit (GPU) or a network processor (NPU) on an electronic device.
  • the electronic device can be, for example, a server, a mobile phone, a personal computer (PC), a laptop, a tablet computer, a smart TV, a mobile internet device (MID), a wearable device, a virtual reality (VR) device, an augmented reality (AR) device, a wireless electronic device in industrial control, a wireless electronic device in self-driving, a wireless electronic device in remote medical surgery, a wireless electronic device in smart grid, a wireless electronic device in transportation safety, a wireless electronic device in a smart city, a wireless electronic device in a smart home, etc.
  • the method provided in the embodiment of the present application will be introduced below by taking the method provided in the embodiment of the present application applied to a server as an example.
  • FIG. 2 is a schematic diagram of the structure of an electronic device 101 provided in an embodiment of the present application.
  • the electronic device 101 includes a processor 103, and the processor 103 is coupled to a system bus 105.
  • the processor 103 may be one or more processors, each of which may include one or more processor cores.
  • a display adapter (video adapter) 107 which may drive a display 109, and the display 109 is coupled to the system bus 105.
  • the system bus 105 is coupled to an input-output (I/O) bus via a bus bridge 111.
  • An I/O interface 115 is coupled to the I/O bus.
  • the I/O interface 115 communicates with a variety of I/O devices, such as an input device 117 (such as a touch screen, etc.), an external memory 121 (for example, a hard disk, a floppy disk, an optical disk or a USB flash drive), a multimedia interface, etc.).
  • I/O devices such as an input device 117 (such as a touch screen, etc.), an external memory 121 (for example, a hard disk, a floppy disk, an optical disk or a USB flash drive), a multimedia interface, etc.).
  • a transceiver 123 which may send and/or receive radio communication signals
  • a camera 155 which may capture static and dynamic digital video images
  • an external USB port 125 may be a USB interface.
  • the processor 103 may be any conventional processor, including a reduced instruction set computing (RISC) processor, a complex instruction set computing (CISC) processor, or a combination thereof.
  • the processor may be a dedicated device such as an ASIC.
  • the electronic device 101 can communicate with the software deployment server 149 through the network interface 129.
  • the network interface 129 is a hardware network interface, such as a network card.
  • the network 127 can be an external network, such as the Internet, or an internal network, such as Ethernet or a virtual private network (VPN).
  • the network 127 can also be a wireless network, such as a WiFi network, a cellular network, etc.
  • the hard disk drive interface 131 is coupled to the system bus 105.
  • the hard disk drive interface is connected to the hard disk drive 133.
  • the internal memory 135 is coupled to the system bus 105.
  • the data running in the internal memory 135 may include an operating system (OS) 137 of the electronic device 101, an application 143, and a scheduler.
  • OS operating system
  • the operating system consists of a shell 139 and a kernel 141.
  • Shell 139 is an interface between the user and the kernel of the operating system.
  • the shell is the outermost layer of the operating system.
  • the shell manages the interaction between the user and the operating system: it waits for user input, interprets user input to the operating system, and processes various operating system output results.
  • the kernel 141 consists of those parts of the operating system that manage memory, files, peripherals, and system resources.
  • the kernel 141 interacts directly with the hardware.
  • the operating system kernel usually runs processes and provides communication between processes, provides CPU time slice management, interrupts, memory management, IO management, etc.
  • the application 143 when the electronic device 101 is a smart phone, the application 143 includes an instant messaging-related program. In one embodiment, when the application 143 needs to be executed, the electronic device 101 can download the application 143 from the software deployment server 149 .
  • Figure 3 is a flow chart of a model training method provided in an embodiment of the present application.
  • the model training method provided in an embodiment of the present application is applied to a first training device, and the method includes the following steps 301-303.
  • Step 301 Obtain a training data set, where the training data set includes multiple data.
  • the first training device is a training device that participates in distributed training.
  • the first training device is, for example, the above-mentioned electronic device, or a chip in the electronic device.
  • the first training device Before training the model, the first training device can obtain a local training data set, which is used to train the model.
  • a batch size can be obtained, where the batch size indicates the amount of data samples used to train the model at the same time, that is, the amount of data samples that the model needs to use in one iterative training process.
  • all the data in the training dataset can be divided into multiple batches, and the amount of data in each batch is the same, and the amount of data in each batch is determined by the batch size. For example, assuming that the training dataset includes 10,000 images and the batch size is 100, the training dataset can be divided into 100 batches, each batch including 100 images. For example, the first batch includes images 1-100, the second batch includes images 101-200, and so on.
  • the multiple data acquired by the first training device are a batch of data, that is, the number of the multiple data is the same as the batch size of the model.
  • Step 302 divide the plurality of data into a plurality of groups of data, each of the plurality of groups of data includes at least one data, and the number of the plurality of groups of data is related to the communication capability of the first training device.
  • the first training device can further divide the multiple data to obtain multiple groups of data. For example, assuming that the multiple data are a batch of data, and the multiple data include 100 data in total, the first training device can divide the 100 data into 5 groups of data, each group of data includes 20 data.
  • the data volume of each set of data is related to the communication capability of the first training device.
  • the data volume of each set of data is positively correlated with the communication capability of the first training device. That is, the stronger the communication capability of the first training device, the larger the data volume of each set of data; the weaker the communication capability of the first training device, the smaller the data volume of each set of data.
  • the worse the communication capability of the first training device is, the less data the first training device transmits to other training devices per unit time, and the less time the first training device takes to transmit a specific amount of data to other training devices. Therefore, when the communication capability of the first training device is worse, dividing multiple data into more groups of data can hide the training duration in the communication duration as much as possible, thereby reducing the overall training delay overhead.
  • Step 303 based on the model deployed in the first training device, multiple groups of data are processed in batches in units of groups to train the model and multiple parts of data obtained by processing the multiple groups of data are transmitted to the second training device in batches.
  • the second training device and the first training device jointly participate in the training of the model.
  • each group of data of the multiple groups of data is input into the model in batches in units of groups, so as to realize batch processing of the multiple groups of data.
  • each processing of a group of data will generate a part of data, and this part of data generated needs to be transmitted to the second training device so that the second training device can cooperate to complete the training of the model.
  • multiple parts of data will be obtained by processing multiple groups of data through the model, and each group of data corresponds to a unique part of data.
  • each part of the multiple parts of data in this step is also generated in batches and transmitted to the second training device in batches.
  • the first training device processes multiple groups of data in batches based on the model, it continues to transmit multiple parts of data obtained by processing the multiple groups of data to the second training device in batches, so that the second training device cooperates with the first training device to complete the training of the model in the first training device.
  • the second training device and the first training device jointly participate in the training of the model.
  • the second training device includes one or more training devices.
  • the training in which the first training device and the second training device jointly participate is, for example, the above-mentioned federated learning, such as horizontal federated learning, vertical federated learning, or federated transfer learning, which is not specifically limited here.
  • the second training device processes the obtained multiple parts of data based on the model in the second training device and returns the processing results to the first training device.
  • the first training device updates the weight of the model in the first training device based on the processing results returned by the second training device, thereby achieving one iteration of model training.
  • the multiple parts of data obtained by processing the multiple groups of data can be transmitted in batches to the second training device, so that the second training device can perform calculations based on the acquired data as early as possible, thereby hiding the training time of the second training device in the communication time, reducing the overall training time of the model, and improving the training efficiency of the model.
  • the number of the plurality of data acquired by the first training device is related to the batch size of the model, for example, the number of the plurality of data
  • the target gradient is related to the multiple gradients obtained based on the multiple sets of data during the training of the model, and the target gradient is used by the first training device to update the model. For example, when multiple gradients are obtained based on the multiple sets of data, the target gradient is obtained by averaging the multiple gradients.
  • the training device usually trains the model based on multiple data in a batch at the same time, thereby obtaining a gradient, and updating the model based on the obtained gradient. Since the multiple data in a batch are divided into multiple groups of data in this scheme, and the model is trained in sequence based on the multiple groups of data, multiple gradients related to the multiple groups of data are obtained. In this case, this scheme obtains the target gradient based on multiple gradients, and then updates the model based on the target gradient, which can ensure that the model is updated based on a batch of data, ensuring that the accuracy of model training will not be affected.
  • FIG. 4A is a schematic diagram of a plurality of training devices performing model training provided in an embodiment of the present application.
  • model 1 is deployed in training device 1
  • model 2 is deployed in training device 2.
  • training device 1 is, for example, the first training device in the embodiment of the present application
  • training device 2 is, for example, the second training device in the embodiment of the present application.
  • training data is first input into model 1, and training device 1 performs forward calculation based on model 1, that is, processes the input training data based on model 1 to obtain data that needs to be transmitted to training device 2. Then, training device 1 transmits the data obtained by performing forward calculation on model 1 to training device 2. After receiving the data transmitted by training device 1, training device 2 performs forward calculation on model 2 based on local training data and data transmitted by training device 1, and performs reverse gradient calculation on model 2 based on the result of forward calculation of model 2 to obtain gradient data corresponding to model 2. Then, training device 2 transmits the gradient data obtained by reverse gradient calculation to training device 1, and training device 1 continues to perform reverse gradient calculation on model 1 based on the received data. Finally, training device 1 updates the weights in model 1 based on the gradient data obtained by reverse gradient calculation, thereby realizing one iteration training of model 1.
  • Figure 4B is a schematic diagram of training duration comparison when the training data is grouped and not grouped, provided in an embodiment of the present application.
  • the training duration comparisons in the two cases shown in Figure 4B are based on the model training process shown in Figure 4A.
  • the training device 1 when the data is not grouped, the training device 1 does not group the input training data when training the model 1, that is, the model 1 is trained based on a batch of training data at the same time. Specifically, the training device 1 first performs a forward calculation on multiple data of the same batch based on the model 1 (that is, calculation A in Figure 4B) to obtain data A; then, the training device 1 transmits data A to the training device 2.
  • the training device 2 After the training device 2 receives the data A transmitted by the training device 1, the training device 2 performs a forward calculation and a reverse gradient calculation on the model 2 based on the data A (that is, calculation B in Figure 4B) to obtain data B; then, the training device 2 transmits data B to the training device 1. After training device 1 receives data B transmitted by training device 2, training device 1 performs reverse gradient calculation on model 1 (i.e., calculation C in Figure 4B), and finally updates the weights in the model based on the obtained gradient, completing one iterative training of the model.
  • model 1 i.e., calculation C in Figure 4B
  • the training device 1 groups the input training data when training the model 1, that is, divides a batch of training data into two groups to train the model 1. Specifically, the training device 1 first divides a batch of input data into two groups, and performs forward calculation on multiple data of the first group based on the model 1 (i.e., calculation A1 in FIG4B ), to obtain data A1. Then, the training device 1 transmits data A1 to the training device 2; while the training device 1 transmits data A1 to the training device 2, the training device 1 continues to perform forward calculation on multiple data of the second group based on the model 1 (i.e., calculation A2 in FIG4B ), to obtain data A2. Among them, since the data transmission time is longer than the calculation time of the model 1, the data A2 obtained by the training device 1 performing calculation A2 is transmitted after waiting for the data A1 to be transmitted.
  • the training device 2 After the training device 2 receives the data A1 transmitted by the training device 1, the training device 2 performs forward calculation and reverse gradient calculation on the model 2 based on the data A1 (i.e., calculation B1 in FIG. 4B ), and obtains data B1; then, the training device 2 transmits the data B1 to the training device 1. In addition, after the training device 2 completes the calculation B1, when the training device 2 receives the data A2 transmitted by the training device 1, the training device 2 continues to perform forward calculation and reverse gradient calculation on the model 2 based on the data A2 (i.e., calculation B2 in FIG. 4B ), obtains data B2, and transmits the data B2 to the training device 1.
  • training device 1 After training device 1 receives data B1 transmitted by training device 2, training device 1 performs reverse gradient calculation on model 1 based on data B1 (i.e., calculation C1 in FIG. 4B ); after training device 1 receives data B2 transmitted by training device 2, training device 1 Based on data B2, reverse gradient calculation (i.e., calculation C2 in FIG4B ) is performed on model 1. Finally, training device 1 updates the weights in the model based on the gradients obtained by calculating C1 and calculating C2, completing one iteration of model training.
  • some calculations and data transmissions in training device 1 and training device 2 can be executed in parallel.
  • calculation A2 and data A1 transmission in training device 1 can be executed in parallel
  • data A2 transmission in training device 1 and calculation B1 in training device 2 can be executed in parallel, thereby achieving the parallelization of the training time and communication time of one party with the training time of the other party, that is, hiding the training time of the other party in the communication time, thereby shortening the idle waiting time in the entire iterative training process, saving a lot of training time, and effectively improving the training efficiency of the model.
  • multiple data in the training process are divided into multiple groups of data, and the training device processes the multiple groups of data in batches through the model and transmits the multiple parts of data obtained by processing the multiple groups of data in batches to other training devices to perform model training.
  • the training device transmits the data obtained by processing the previous group of data to other training devices, it can still continue to process the data of the subsequent group, thereby hiding the model training time in the communication time between the training devices, and finally shortening the idle waiting time in the entire iteration process, and improving the overall training efficiency.
  • the training device can simultaneously process the second set of data while transmitting the first part of the data to other training devices. That is, while transmitting the data obtained by processing the previous set of data to other training devices, the training device can still continue to process the subsequent sets of data.
  • the number of multiple data is related to the batch size of the model (for example, the number of multiple data is the same as the batch size of the model)
  • the number of multiple groups of data is positively correlated with the amount of data to be transmitted, and the number of multiple groups of data is negatively correlated with the communication capability of the first training device and the training duration.
  • the amount of data to be transmitted is the amount of data to be transmitted generated after processing the multiple groups of data
  • the training duration is the duration of the first training device training the model based on the multiple groups of data.
  • the larger the amount of data to be transmitted generated after processing multiple groups of data the greater the number of groups of data, so as to reduce the amount of data to be transmitted generated by processing each group of data and avoid excessive amount of data to be transmitted generated after processing each group of data.
  • the stronger the communication capability of the first training device the more data the first training device can transmit to the second training device per unit time, and the fewer the number of groups of data can be divided into.
  • the longer the time it takes for the first training device to train the model based on multiple groups of data the slower the speed at which the first training device processes multiple groups of data to generate data to be transmitted, and the fewer the number of groups of data can be divided into.
  • the first training device divides a batch of data, and the number of groups of the obtained multiple groups of data can be expressed by the following formula:
  • Batch_size represents the batch size
  • Seq_length represents the size of one dimension of the calculation matrix
  • hidden_size represents the feature dimension of the hidden layer of the model
  • BandWith represents the bandwidth
  • train_time represents the training time of the model.
  • Batch_size*Seq_length*hidden_size*32 represents the amount of data to be transmitted generated by the first training device processing multiple groups of data through the model
  • BandWith represents the amount of data that can be transmitted to the second training device by the first training device per unit time
  • train_time represents the time required for the first training device to process multiple groups of data.
  • Batch_size*Seq_length*hidden_size*32/BandWith represents the time required for the first training device to transmit all data to the second training device; on this basis, Batch_size*Seq_length*hidden_size*32/BandWith/train_time (i.e., the number of groups of multiple groups of data) can be the ratio between the time required for the first training device to transmit all data to the second training device and the time required for the first training device to process multiple groups of data.
  • the operations performed during the training of the model by the first training device include first-category operations and second-category operations.
  • the first-category operations are used to generate multiple parts of data transmitted to the second training device based on multiple sets of data, that is, the first-category operations are operations that rely on multiple sets of data to generate data to be transmitted.
  • the second-category operations are used to generate data that is only processed by the first training device, that is, whether the second-category operations are executed or not does not affect the generation of data to be transmitted.
  • the first-category operations include operations for processing multiple sets of data based on the model (that is, the forward calculation performed on model 1 in Figure 4A), and the second-category operations include operations for performing reverse gradient calculations on the model based on data obtained from the second training device (that is, the reverse gradient calculations performed on model 1 in Figure 4A).
  • the execution priority of the first type of operation is higher than the execution priority of the second type of operation. That is, during the training process, when there are both first type of operations and second type of operations that can be executed at the same time, the first training device will prioritize the execution of the first type of operation. After all the first type of operations are executed, the first training device will execute the second type of operation.
  • the first type of operation may refer to the forward calculation performed by the first training device on the model 1 based on multiple sets of data, wherein the forward calculation performed on the model 1 generates data that needs to be transmitted to the second training device.
  • the second type of operation may refer to the reverse gradient calculation performed by the first training device on the model 1 based on the data returned by the second training device, wherein the reverse gradient calculation performed on the model 1 generates gradient data of the model 1, and the gradient data of the model 1 is data that is further processed only by the first training device to update the weights of the model 1.
  • the training device gives priority to executing the first type of operations so as to continuously generate data to be transmitted, and avoid the training device being in a communication idle state as much as possible, so as to ensure that the time for communication and training to be carried out in parallel is as long as possible, thereby shortening the idle waiting time in the entire iterative process and improving the overall training efficiency.
  • the first type of operation may include a first sub-type of operation and a second sub-type of operation
  • the operation result of the first sub-type of operation is used to obtain the input of the second sub-type of operation
  • the operation result of the second sub-type of operation is used to transmit to the second training device.
  • the operation result obtained by the first training device performing the second sub-type of operation is the data that needs to be transmitted to the second training device
  • the operation result obtained by the first training device performing the first sub-type of operation is used as the input of the second sub-type of operation, that is, the execution of the second sub-type of operation depends on the first sub-type of operation.
  • the first sub-type of operation is executed first, can the data used as the input of the second sub-type of operation be obtained, and then the second sub-type of operation can be executed.
  • the execution of the second sub-category operation that directly generates the data to be transmitted is dependent on the execution of the first sub-category operation, so the first sub-category operation also belongs to the first category operation, that is, the first sub-category operation can also be understood as being used to generate the data to be transmitted to the second training device. Therefore, for the first sub-category operation, the execution priority of the first sub-category operation is also higher than the execution priority of the second category operation.
  • the execution priority of the second subclass operation is higher than that of the first subclass operation.
  • the first training device needs to first execute the first sub-category operation and obtain the operation result before it can continue to execute the second sub-category operation based on the obtained operation result. That is, the second sub-category operation can only be executed after the operation conditions of the second sub-category operation are met. However, after the operation conditions of the second sub-category operation are met, the first training device will give priority to executing the second sub-category operation so as to generate data to be transmitted to the second training device as soon as possible.
  • the model deployed by the first training device includes, for example, a first sub-model and a second sub-model.
  • the first type of operation includes an operation for processing multiple groups of data based on the first sub-model, and an operation for processing the second sub-model based on data obtained from the second training device. For example, after the first training device processes multiple groups of data based on the first sub-model, it generates data to be transmitted to the second training device; and, after the first training device processes the second sub-model based on the data obtained from the second training device, it continues to generate data to be transmitted to the second training device.
  • the first type of operation includes two types of operations (i.e., operations for processing multiple groups of data based on the first sub-model and operations for processing the second sub-model based on data obtained from the second training device), and these two types of operations can generate different types of data, respectively, and the generated data all need to be transmitted to the second training device.
  • the input of the second sub-model may also include the operation results of processing multiple groups of data based on the first sub-model. That is, the operation related to the second sub-model is actually the operation of processing the second sub-model based on the operation results of the first sub-model and the data obtained from the second training device.
  • the first type of operation includes the operation of processing multiple groups of data based on the first sub-model, and the operation of processing the second sub-model based on the operation results of the first sub-model and the data obtained from the second training device.
  • the second type of operation may include an operation of performing reverse gradient calculation on the first sub-model.
  • the model deployed by the first training device does not include a sub-model.
  • the first training device is, for example, training device 1 in Figure 4A
  • the model deployed by the first training device is, for example, model 1 in Figure 4A
  • the second training device is, for example, training device 2 in Figure 4A
  • the model deployed by the second training device is, for example, model 2 in Figure 4A.
  • the first type of operation in the training process may be, for example, the forward calculation performed on the model 1 in FIG. 4A
  • the second type of operation may be, for example, the reverse gradient calculation performed on the model 1 in FIG. 4A .
  • the model deployed by the first training device includes two sub-models, and both sub-models generate data that needs to be transmitted to the second training device.
  • Figure 5 is a schematic diagram of a training process provided by an embodiment of the present application. As shown in Figure 5, the training device 1 is deployed with model A1 and model A2, and the training device 2 is deployed with model B.
  • training device 1 performs forward calculation on model A1 based on the local training data set, and transmits the obtained calculation result 1 to training device 2.
  • Training device 2 performs forward calculation on model B based on calculation result 1 sent by training device 1, obtains calculation result 2, and returns the obtained calculation result 2 to training device 1.
  • Training device 1 performs forward calculation on model A2 based on calculation result 2 returned by training device 2, and performs reverse gradient calculation on model A2 based on the final calculation result to obtain gradient data 1.
  • training device 1 transmits gradient data 1 to training device 2, and training device 2 performs reverse gradient calculation on model B based on gradient data 1. After completing reverse gradient calculation on model B, training device 2 transmits the obtained gradient data 2 to training device 1, and training device 1 performs reverse gradient calculation on model A1 based on gradient data 2.
  • the above training process is for a set of data initially input by the training device 1. Since the training device 1 divides multiple data of the same batch into multiple groups of data, in the actual training process, for one batch of data, the training device 1 and the training device 2 need to perform the calculations in the above process multiple times.
  • the first training device mentioned above is, for example, the training device 1 in Figure 5
  • the second training device is, for example, the training device 2 in Figure 5.
  • the first type of operation mentioned above includes, for example, the forward calculation performed by the training device 1 on the model A1, the forward calculation performed by the training device 1 on the model A2, and the reverse gradient calculation.
  • the first sub-category of operation is, for example, the forward calculation performed by the training device 1 on the model A2, that is, the forward calculation does not directly generate data that needs to be transmitted to the training device 2 but generates data as input for the reverse gradient calculation;
  • the second sub-category of operation is, for example, the forward calculation performed by the training device 1 on the model A1, and the reverse gradient calculation performed by the training device 1 on the model A2.
  • the second type of operation mentioned above is, for example, the reverse gradient calculation performed by the training device 1 on the model A1.
  • the model deployed by the first training device includes two sub-models, and only one sub-model generates data that needs to be transmitted to the second training device.
  • Figure 6 is a schematic diagram of another training process provided by an embodiment of the present application. As shown in Figure 6, models C1 and C2 are deployed in the training device 3, and D is deployed in the training device 4.
  • the training device 3 performs forward calculation on the model C1 based on the local training data set to obtain the calculation result 3.
  • the training device 4 performs forward calculation on the model D based on the local training data set to obtain the calculation result 4, and transmits the obtained calculation result 4 to the training device 3.
  • training device 3 After obtaining calculation results 3 and 4, training device 3 performs forward calculation on model C2 based on calculation results 3 and 4, and performs reverse gradient calculation on model C2 based on the result obtained by forward calculation to obtain gradient data. Then, training device 3 transmits gradient data to training device 4, so that training device 4 performs reverse gradient calculation on model D based on the received gradient data. In addition, training device 3 also continues to perform reverse gradient calculation on model C1 based on the gradient data obtained by performing reverse gradient calculation on model C2, thereby obtaining gradient data of model C1.
  • the above training process is for a set of data initially input by the training device 3 and the training device 4.
  • the training device 3 and the training device 4 will divide multiple data of the same batch into multiple groups of data. Therefore, for a batch of data, the training device 3 and the training device 4 need to perform the calculations in the above process multiple times.
  • the first training device mentioned above is, for example, the training device 3 in Figure 6, and the second training device is, for example, the training device 4 in Figure 6.
  • the first type of operation mentioned above includes, for example, the forward calculation performed by the training device 1 on the model C1, the forward calculation performed by the training device 1 on the model C2, and the reverse gradient calculation.
  • the first sub-category of operation is, for example, the forward calculation performed by the training device 1 on the model C2, that is, the forward calculation does not directly generate data that needs to be transmitted to the training device 4 but generates data as input for the reverse gradient calculation;
  • the second sub-category of operation is, for example, the forward calculation performed by the training device 3 on the model C1, and the reverse gradient calculation performed by the training device 3 on the model C2.
  • the second type of operation mentioned above is, for example, the reverse gradient calculation performed by the training device 3 on the model C1.
  • the first training device is, for example, the training device 4 in FIG6
  • the second training device is, for example, the training device 3 in FIG6
  • the first type of operation is, for example, the forward calculation performed by the training device 4 on the model D
  • the second subtype of operation is, for example, the reverse gradient calculation performed by the training device 4 on the model D
  • the second type of operation is, for example, the reverse gradient calculation performed by the training device 4 on the model D.
  • the first training device includes one or two sub-models, respectively, and the following are described in detail.
  • the model deployed by the first training device may also be other types of models, and the first type of operation and the second type of operation performed by the first training device may also be determined based on the model deployed by the first training device. This embodiment does not limit the specific form of the model deployed by the first training device.
  • the execution priority of the first type of operation is higher than the execution priority of the second type of operation. Therefore, when the execution requirements of the first type of operation are met, the first training device will always give priority to executing the first type of operation.
  • this embodiment proposes to switch from executing the first type of operation to executing the second type of operation under certain conditions to reduce memory pressure.
  • the first training device caches the data generated by executing the first type of operation into the first queue.
  • the first queue is a data structure in the memory for storing data, and the first queue is specifically used to cache data to be transmitted to the second training device. Based on the first queue, the data in the first queue can be transmitted to the second training device in an orderly manner, that is, the data that enters the first queue first is first transmitted to the second training device.
  • the first training device stops executing the first type of operation and executes the second type of operation. That is, when there is a large amount of data in the first queue, the first training device no longer prioritizes executing the first type of operation, but switches to executing the second type of operation.
  • the size of the first threshold can be determined according to the memory size in the first training device, which is not specifically limited here.
  • the first training device can choose to switch to performing the second type of operation to avoid generating too much data to be transmitted and accumulating in the memory.
  • the first training device stops executing the second type of operation and continues to execute the first type of operation.
  • the first training device may continue to preferentially execute the first type of operation to continuously generate data to be transmitted to the second training device, so as to ensure the continuity of data communication.
  • the data used to support the execution of the second type of operation may refer to data used as input to the second type of operation, i.e., input data of the second type of operation.
  • the second type of operation may refer to the reverse gradient calculation of model 1 performed by training device 1 based on the data returned by training device 2. Then, the data used to support the execution of the second type of operation is the data returned by training device 2.
  • the process of the first training device performing the second type of operation is actually the process of processing the input data of the second type of operation. After the first training device has processed the input data of the second type of operation, the data used to support the execution of the second type of operation has been processed, and the first training device cannot continue to perform the second type of operation, so the first training device switches to continue to perform the first type of operation.
  • the above introduces the model deployed by the first training device and the first and second types of operations performed based on the model.
  • the specific process of the training device performing the first and second types of operations based on different priorities will be described in detail below with specific examples.
  • FIG. 7 is a schematic diagram of a training architecture provided in an embodiment of the present application.
  • the two training devices participating in the training can be divided into a leader node (i.e., the leader node in FIG. 7) and a follower node (i.e., the follower node in FIG. 7) according to their roles in the training task.
  • the leader node may refer to a training device that initiates a training task
  • the follower node refers to a training device that receives a training task and participates in the training.
  • the training device 3 may be a leader node
  • the training device 4 may be a follower node.
  • leader node As shown in Figure 7, for leader node and follower node, the functional components used to implement model training are the same, and the difference lies in the different training tasks. Taking leader node as an example, leader node includes workers and trainers.
  • Task generation refers to negotiating with the follower node before generating a training task to determine the training data set and the nodes to be trained in the training task.
  • Data reading refers to reading the training data set required for training during model training.
  • Parallel task scheduling consists of two parts, namely multi-queue caching and greedy mechanism control.
  • Multi-queue caching means that during the training process, different types of operations and data obtained from communication are cached through queues to determine the execution of operations based on the amount of data in each queue; greedy mechanism control means that during the training process, the execution of operations is determined based on the amount of data in each queue, so as to give priority to operations with high priority as much as possible.
  • trainers are responsible for performing forward calculations, reverse calculations, and optimizer updates (i.e. updating model weights) on the model.
  • FIG8 is a flow chart of a parallel task scheduling by a training device 3 provided in an embodiment of the present application. As shown in FIG8, the process of parallel task scheduling by the training device 3 includes the following steps 801-810.
  • Step 801 divide a batch of data into multiple groups of data.
  • the training device 3 first obtains a batch of data based on the batch size of the model, and divides the batch of data into multiple groups of data, and each group of data can be numbered in order.
  • Step 802 determining that the data currently to be processed by the model is a group of data with a group order of 1.
  • Step 803 determine whether queue 1 is empty.
  • the queue 1 is used to cache the data transmitted from the training device 4 to the training device 3. That is, the data cached in the queue 1 is used as input data for the training device 3 to perform forward calculation on the model C2.
  • Step 804 If queue 1 is empty, determine whether the forward calculation of model C1 has been completed.
  • the forward calculation of model C1 refers to the calculation of model C1 by the training device 3 based on the divided set of data.
  • the training device 3 needs to perform multiple forward calculations on model C1, and each forward calculation is based on the divided set of data to calculate model C1. Therefore, the number of times the training device 3 performs forward calculations on model C1 is the same as the number of sets of data.
  • the training device 3 has performed multiple forward calculations on the model C1 based on multiple groups of data, it means that the forward calculation of the representative model C1 has been completed; if the number of times the training device 3 performs forward calculations on the model is less than the number of groups of data, it means that the forward calculation of the representative model C1 has not been completed.
  • Step 805 if the forward calculation of model C1 is not completed, then the forward calculation is performed on model C1 , and the calculation result obtained by performing the forward calculation on model C1 is cached in queue 2 .
  • the training device 3 may obtain a corresponding set of data from multiple sets of data based on the currently determined set order, and perform forward calculation on the model C1 based on the obtained set of data.
  • Step 806 After performing forward calculation on the model C1 based on a group of data, the group order corresponding to the data to be processed is increased by 1.
  • the training device 3 can select the next group of data to perform calculations when performing forward calculations on the model C1 next time.
  • Step 807 If the forward calculation of model C1 is completed, determine whether queue 3 is empty.
  • queue 3 is used to cache the calculation results of performing reverse gradient calculation on model C2. That is, the data cached in queue 3 is used as input data for performing reverse gradient calculation on model C1.
  • Step 808 if queue 3 is not empty, reverse gradient calculation is performed on model C1 based on the data in queue 3 .
  • step 803 determines whether there is data transmitted by training device 4 in queue 1, that is, continues to wait for training device 4 to transmit data.
  • Step 809 If queue 1 is not empty, the data in queue 1 and queue 2 are dequeued at the same time.
  • training device 4 When queue 1 is not empty, it means that training device 4 has transmitted data to training device 3, so training device 3 can extract data from queue 1 and queue 2, that is, extract data required to perform forward calculation on model C2.
  • Step 810 based on the data dequeued from queue 1 and queue 2 , forward calculation and reverse gradient calculation are performed on model C2 , and the calculation result obtained by reverse gradient calculation is put into queue 3 .
  • the training device 3 updates the model C2 based on the obtained multiple gradients.
  • the training device 3 After the training device 3 performs all reverse gradient calculations on the model C1, the training device 3 then calculates the reverse gradients based on the obtained multiple The gradient updates the model C1.
  • the training device 3 since the calculation result obtained by performing forward calculation on model C1 is the basis of all other operations, the training device 3 gives priority to performing forward calculation on model C1; secondly, since the calculation results obtained by the forward calculation and reverse gradient calculation performed by the training device 3 on model C2 need to be transmitted to the training device 4, after the training device 3 is able to perform the calculation on model C2, the training device 3 gives priority to performing forward calculation and reverse gradient calculation on model C2; finally, when the training device is unable to perform the above-mentioned operations, the training device 3 performs reverse gradient calculation on model C1.
  • Figure 9 is a schematic diagram of a process of parallel task scheduling performed by a training device 4 provided in an embodiment of the present application. As shown in Figure 9, the process of parallel task scheduling performed by the training device 4 includes the following steps 901-908.
  • Step 901 divide a batch of data into N groups of data.
  • Step 902 determine whether the data currently to be processed by the model is a group of data with a group order of 1.
  • steps 901-902 in this embodiment are similar to the above-mentioned steps 801-802.
  • steps 801-802 please refer to the above-mentioned steps 801-802, which will not be repeated here.
  • Step 903 determining whether the data volume of queue 4 is greater than a first threshold, or whether the forward calculation of model D is completed.
  • queue 4 is used to cache the calculation results obtained by the training device 4 performing forward calculation on model D. If the amount of data in queue 4 is greater than the first threshold, it means that the speed at which the training device 4 performs forward calculation to generate data to be transmitted is much faster than the speed at which the training device 4 transmits data to the training device 3.
  • the completion of the forward calculation of model D may mean that the training device 4 has performed forward calculation on model D based on the N groups of data obtained by division.
  • Step 904 if the amount of data in queue 4 is not greater than the first threshold and the forward calculation of model D has not been completed, forward calculation is performed on model D and the calculation result is cached in queue 4 to transmit the calculation result in queue 4 to the training device 3 .
  • the training device 4 starts to transmit the calculation results in the queue 4 to the training device 3.
  • the training device can still perform other operations.
  • Step 905 After performing a forward calculation on the model D based on a group of data, the group number corresponding to the data to be processed is increased by 1.
  • step 905 After executing step 905, continue to execute the above step 903.
  • Step 906 if the amount of data in queue 4 is greater than the first threshold, or the forward calculation of model D is completed, reverse gradient calculation is performed on model D based on the data in queue 5 .
  • the amount of data in queue 4 is greater than the first threshold, it means that the speed at which the training device 4 performs forward calculations to generate data to be transmitted is much faster than the speed at which the training device 4 transmits data to the training device 3. Therefore, the forward calculation of model D can be suspended, and the reverse gradient calculation of model D can be executed.
  • Step 907 determine whether the reverse gradient calculation is completed.
  • Step 908 Update the weight of model D based on the calculated multiple gradients.
  • the weight of the model D is updated based on the calculated multiple gradients, thereby completing one iteration training of the model D. If the reverse gradient calculation is not completed, continue to execute the above step 903.
  • the training device 4 gives priority to performing the forward calculation on the model D.
  • the forward calculation on the model D can be suspended, and the reverse gradient calculation on the model D can be performed to reduce the memory pressure.
  • the reverse gradient calculation is performed on the model D and the weight of the model D is updated based on the calculated gradient.
  • the model training device provided in an embodiment of the present application may be a first training device, which includes: an acquisition module 1001 for acquiring a training data set, The training data set includes multiple data; the processing module 1002 is used to divide the multiple data into multiple groups of data, each of the multiple groups of data includes at least one data, and the data volume of each group of data is related to the communication capability of the first training device; the processing module 1002 is also used to perform processing on the multiple groups of data in batches based on the model deployed in the first training device to train the model and transmit the data obtained by processing the multiple groups of data to the second training device in batches, and the second training device and the first training device jointly participate in the training of the model.
  • the operations performed during the training model include first-category operations and second-category operations, where the first-category operations are used to generate data to be transmitted to a second training device based on multiple sets of data, and the second-category operations are used to generate data to be processed only by the first training device, and the execution priority of the first-category operations is higher than the execution priority of the second-category operations.
  • the first type of operation includes a first sub-type of operation and a second sub-type of operation
  • the operation result of the first sub-type of operation is used to obtain the input of the second sub-type of operation
  • the operation result of the second sub-type of operation is used to be transmitted to the second training device.
  • the execution priority of the second subclass operation is higher than that of the first subclass operation.
  • processing module 1002 is further configured to:
  • the first type of operation is stopped and the second type of operation is performed.
  • processing module 1002 is further configured to:
  • the training in which the first training device participates is federated learning.
  • the first type of operation includes operations for processing multiple groups of data based on the model
  • the second type of operation includes operations for performing reverse gradient calculations on the model based on data obtained from the second training device.
  • the model includes a first sub-model and a second sub-model
  • the first type of operation includes an operation for processing multiple groups of data based on the first sub-model, and an operation for processing the second sub-model based on data obtained from the second training device;
  • the second type of operation includes operations for performing reverse gradient calculation on the first sub-model.
  • the number of multiple data is related to the batch size of the model
  • the target gradient is related to multiple gradients obtained based on multiple groups of data, and the target gradient is used by the first training device to update the model.
  • the number of the multiple data is related to the batch size of the model, the number of the multiple data groups is positively correlated with the amount of data to be transmitted, and the number of the multiple data groups is negatively correlated with the communication capability of the first training device and the training duration;
  • the amount of data to be transmitted is the amount of data to be transmitted generated after processing multiple groups of data
  • the training duration is the duration of the first training device training the model based on the multiple groups of data.
  • FIG 11 is a schematic diagram of the structure of an execution device provided in an embodiment of the present application.
  • the execution device 1100 can be specifically manifested as a mobile phone, a tablet, a laptop computer, an intelligent wearable device, a server, etc., which is not limited here.
  • the execution device 1100 includes: a receiver 1101, a transmitter 1102, a processor 1103 and a memory 1104 (wherein the number of processors 1103 in the execution device 1100 can be one or more, and one processor is taken as an example in Figure 11), wherein the processor 1103 may include an application processor 11031 and a communication processor 11032.
  • the receiver 1101, the transmitter 1102, the processor 1103 and the memory 1104 may be connected via a bus or other means.
  • the memory 1104 may include a read-only memory and a random access memory, and provides instructions and data to the processor 1103. A portion of the memory 1104 may also include a non-volatile random access memory (NVRAM).
  • NVRAM non-volatile random access memory
  • the memory 1104 stores processor and operation instructions, executable modules or data structures, or subsets thereof, or extended sets thereof, wherein the operation instructions may include various operation instructions for implementing various operations.
  • the processor 1103 controls the operation of the execution device.
  • the various components of the execution device are coupled together through a bus system, wherein the bus system may include a power bus, a control bus, and a status signal bus in addition to a data bus.
  • the bus system may include a power bus, a control bus, and a status signal bus in addition to a data bus.
  • various buses are referred to as bus systems in the figures.
  • the method disclosed in the above embodiment of the present application can be applied to the processor 1103, or implemented by the processor 1103.
  • the processor 1103 can be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method can be completed by the hardware integrated logic circuit in the processor 1103 or the instruction in the form of software.
  • the above processor 1103 can be a general processor, a digital signal processor (digital signal processing, DSP), a microprocessor or a microcontroller, and can further include an application specific integrated circuit (application specific integrated circuit, ASIC), a field programmable gate array (field-programmable gate array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.
  • the processor 1103 can implement or execute the methods, steps and logic block diagrams disclosed in the embodiment of the present application.
  • the general processor can be a microprocessor or the processor can also be any conventional processor, etc.
  • the steps of the method disclosed in the embodiment of the present application can be directly embodied as a hardware decoding processor to execute, or a combination of hardware and software modules in the decoding processor to execute.
  • the software module may be located in a storage medium mature in the art, such as a random access memory, a flash memory, a read-only memory, a programmable read-only memory, or an electrically erasable programmable memory, a register, etc.
  • the storage medium is located in the memory 1104, and the processor 1103 reads the information in the memory 1104 and completes the steps of the above method in combination with its hardware.
  • the receiver 1101 can be used to receive input digital or character information and generate signal input related to the relevant settings and function control of the execution device.
  • the transmitter 1102 can be used to output digital or character information through the first interface; the transmitter 1102 can also be used to send instructions to the disk group through the first interface to modify the data in the disk group; the transmitter 1102 can also include a display device such as a display screen.
  • the electronic device provided in the embodiment of the present application may specifically be a chip, and the chip includes: a processing unit and a communication unit, the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, a pin or a circuit, etc.
  • the processing unit may execute the computer execution instructions stored in the storage unit, so that the chip in the execution device executes the method for selecting the model hyperparameters described in the above embodiment, or so that the chip in the training device executes the method for selecting the model hyperparameters described in the above embodiment.
  • the storage unit is a storage unit in the chip, such as a register, a cache, etc.
  • the storage unit may also be a storage unit located outside the chip in the wireless access device, such as a read-only memory (ROM) or other types of static storage devices that can store static information and instructions, a random access memory (RAM), etc.
  • ROM read-only memory
  • RAM random access memory
  • FIG. 12 is a schematic diagram of a structure of a chip provided in an embodiment of the present application.
  • the chip can be a neural network processor NPU 1200.
  • NPU 1200 is mounted on the host CPU (Host CPU) as a coprocessor, and tasks are assigned by the Host CPU.
  • the core part of the NPU is the operation circuit 1203, which is controlled by the controller 1204 to extract matrix data from the memory and perform multiplication operations.
  • the operation circuit 1203 includes multiple processing units (Process Engine, PE) inside.
  • the operation circuit 1203 is a two-dimensional systolic array.
  • the operation circuit 1203 can also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition.
  • the operation circuit 1203 is a general-purpose matrix processor.
  • the operation circuit takes the corresponding data of matrix B from the weight memory 1202 and caches it on each PE in the operation circuit.
  • the operation circuit takes the matrix A data from the input memory 1201 and performs matrix operation with matrix B, and the partial result or final result of the matrix is stored in the accumulator 1208.
  • the unified memory 1206 is used to store input data and output data.
  • the weight data is directly transferred to the weight memory 1202 through the direct memory access controller (DMAC) 1205.
  • the input data is also transferred to the unified memory 1206 through the DMAC.
  • DMAC direct memory access controller
  • BIU stands for Bus Interface Unit, that is, the bus interface unit 1210, which is used for the interaction between AXI bus and DMAC and instruction fetch buffer (IFB) 1209.
  • IOB instruction fetch buffer
  • the bus interface unit 1210 (BIU) is used for the instruction fetch memory 1209 to obtain instructions from the external memory, and is also used for the storage unit access controller 1205 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
  • DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 1206 or to transfer weight data to the weight memory 1202 or to transfer input data to the input memory 1201.
  • the vector calculation unit 1207 includes multiple operation processing units, and further processes the output of the operation circuit 1203 when necessary, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc. It is mainly used for non-convolutional/fully connected layer network calculations in neural networks, such as Batch Normalization, pixel-level summation, upsampling of feature planes, etc.
  • the vector calculation unit 1207 can store the processed output vector to the unified memory 1206.
  • the vector calculation unit 1207 can apply a linear function; or a nonlinear function to the output of the operation circuit 1203, such as linear interpolation of the feature plane extracted by the convolution layer, and then, for example, a vector of accumulated values to generate an activation value.
  • the vector calculation unit 1207 generates a normalized value, a pixel-level summed value, or both.
  • the processed output vector can be used as an activation input to the operation circuit 1203, for example, for use in a subsequent layer in a neural network.
  • An instruction fetch buffer 1209 connected to the controller 1204 is used to store instructions used by the controller 1204;
  • Unified memory 1206, input memory 1201, weight memory 1202 and instruction fetch memory 1209 are all on-chip memories. External memories are private to the NPU hardware architecture.
  • the processor mentioned in any of the above places may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above program.
  • Figure 13 is a schematic diagram of the structure of a computer-readable storage medium provided in an embodiment of the present application.
  • the present application also provides a computer-readable storage medium.
  • the method disclosed in Figure 3 above can be implemented as computer program instructions encoded in a machine-readable format on a computer-readable storage medium or encoded on other non-transitory media or products.
  • FIG. 13 schematically illustrates a conceptual partial view of an example computer-readable storage medium including a computer program for executing a computer process on a computing device, arranged in accordance with at least some embodiments presented herein.
  • computer readable storage medium 1300 is provided using signal bearing medium 1301.
  • Signal bearing medium 1301 may include one or more program instructions 1302, which when executed by one or more processors may provide the functionality or portions of the functionality described above with respect to FIG.
  • the signal bearing medium 1301 may include a computer readable medium 1303 such as, but not limited to, a hard drive, a compact disk (CD), a digital video disk (DVD), a digital tape, a memory, a ROM or RAM, and the like.
  • a computer readable medium 1303 such as, but not limited to, a hard drive, a compact disk (CD), a digital video disk (DVD), a digital tape, a memory, a ROM or RAM, and the like.
  • the signal bearing medium 1301 may include a computer recordable medium 1304, such as, but not limited to, a memory, a read/write (R/W) CD, a R/W DVD, etc.
  • the signal bearing medium 1301 may include a communication medium 1305, such as, but not limited to, a digital and/or analog communication medium (e.g., a fiber optic cable, a waveguide, a wired communication link, a wireless communication link, etc.).
  • a wireless form of the communication medium 1305 e.g., a wireless communication medium complying with the IEEE 802.11 standard or other transmission protocol.
  • the one or more program instructions 1302 may be, for example, computer executable instructions or logic implementation instructions.
  • the computing device of the computing device may be configured to provide various operations, functions, or actions in response to the program instructions 1302 communicated to the computing device via one or more of the computer readable medium 1303, the computer recordable medium 1304, and/or the communication medium 1305.
  • the device embodiments described above are merely schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this embodiment.
  • the connection relationship between the modules indicates that there is a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.
  • the technical solution of the present application is essentially or the part that contributes to the prior art can be embodied in the form of a software product, which is stored in a readable storage medium, such as a computer floppy disk, a U disk, a mobile hard disk, a ROM, a RAM, a disk or an optical disk, etc., including a number of instructions to enable a computer device (which can be a personal computer, a training device, or a network device, etc.) to execute the method of each embodiment of the present application.
  • a computer device which can be a personal computer, a training device, or a network device, etc.
  • all or part of the embodiments may be implemented by software, hardware, firmware or any combination thereof.
  • all or part of the embodiments may be implemented in the form of a computer program product.
  • a computer program product consists of one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of them The process or function according to the embodiment of the present application is generated separately.
  • the computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • Computer instructions can be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • computer instructions can be transmitted from a website site, a computer, a training device or a data center by wired (e.g., coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) mode to another website site, computer, training device or data center.
  • the computer-readable storage medium can be any available medium that a computer can store or a data storage device such as a training device, a data center, etc. that includes one or more available media integrations. Available media can be magnetic media, (e.g., floppy disk, hard disk, tape), optical media (e.g., DVD), or semiconductor media (e.g., solid-state drive (SSD)), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

一种模型训练方法,应用于人工智能技术领域。在该方法中,将训练过程中的多个数据划分为多组数据,且每组数据的数据量与训练装置的通信能力相关,由训练装置通过模型分批对多组数据进行处理并将处理多组数据所得到的多部分数据分批向其他训练装置传输,以执行模型的训练。这样,训练装置在向其他训练装置传输处理前面组次数据所得到的数据时,仍能够继续对后面组次的数据进行处理,进而实现将模型训练时间隐藏于训练装置之间的通信时间中,最终缩短整个迭代训练过程中的空闲等待时长,提高整体的训练效率。

Description

一种模型训练方法及相关装置
本申请要求于2022年11月03日提交国家知识产权局、申请号为202211372502.8、发明名称为“一种模型训练方法及相关装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能(Artificial Intelligence,AI)技术领域,尤其涉及一种模型训练方法及相关装置。
背景技术
随着人类社会数字化进程越来越快,产生了大量数据。通过机器学习技术可以自动化地挖掘数据中蕴藏的信息,因此经过大量数据训练出来的机器学习模型已经应用在各类场景中,例如人脸识别、语音翻译、医疗辅助诊断等场景。在实际应用中,机器学习模型的精度、泛化能力等至关重要,而这些都依赖于采用大量的数据对机器学习模型进行训练。
受限于法律法规、商业机密、个人隐私等数据隐私安全上的约束,多个数据来源方往往无法直接交换数据,进而导致多个数据来源方的数据无法融合在一起对机器学习模型进行训练,制约了机器学习模型能力的进一步提高。联邦学习的诞生即是为了解决这一问题。
联邦学习(Federated Learning)是一种分布式机器学习技术,其核心思想是通过在多个拥有本地数据的数据源之间进行分布式模型训练,在不需要交换本地数据的前提下,仅通过交换中间结果的方式,构建基于融合数据下的全局模型,从而实现数据隐私保护和数据共享计算的平衡。
然而,由于参与联邦学习的多个设备分布于不同的地方,多个设备在执行联邦学习的过程中,往往需要通过网络来交换大量的数据。受限于网络的通信能力,在多个设备之间待交换的数据较多的情况下,设备间交换数据的通信时长往往会大于设备训练模型的时长,进而导致联邦学习的整体时延较长。
发明内容
本申请提供了一种模型训练方法,能够缩短整个迭代训练过程中的空闲等待时长,提高整体的训练效率。
本申请第一方面提供一种模型训练方法,应用于人工智能技术领域。该方法包括:第一训练装置获取训练数据集,该训练数据集包括多个数据。其中,第一训练装置所获取到的多个数据可以为一个批次的数据,即多个数据的数量与模型的批次大小相同。批次大小指示了用于同时对模型进行训练的数据样本的数据量大小,即模型在一次迭代训练过程中需要使用到的数据样本的数据量大小。
然后,第一训练装置将多个数据划分为多组数据,多组数据中的每组数据包括至少一个数据,且每组数据的数据量与第一训练装置的通信能力相关。例如,每组数据的数据量与第一训练装置的通信能力具有正相关的关系。即,第一训练装置的通信能力越强,每组数据的数据量则越大。
最后,基于第一训练装置中所部署的模型,以组为单位分批对多组数据执行处理,以训练模型并分批向第二训练装置传输处理多组数据所得到的数据,第二训练装置与第一训练装置共同参与模型的训练。具体地,通过模型对多组数据执行处理会得到多部分数据,每组数据对应于唯一的一部分数据。并且,由于本步骤中是分批对多组数据执行处理,因此本步骤中多部分数据中的每一部分数据也是分批向第二训练装置传输的。
本方案中,将训练过程中的多个数据划分为多组数据,且每组数据的数据量与训练装置的通信能力相关,由训练装置通过模型分批对多组数据进行处理并将处理多组数据所得到的多部分数据分批向其他训练装置传输,以执行模型的训练。这样,训练装置在向其他训练装置传输处理前面组次数据所得到的数据时,仍能够继续对后面组次的数据进行处理,进而实现将模型训练时间隐藏于训练装置之间的通信时间中,最终缩短整个迭代训练过程中的空闲等待时长,提高整体的训练效率。
在一种可能的实现方式中,训练模型的过程中所执行的运算包括第一类运算和第二类运算。其中,第一类运算用于基于多组数据生成向第二训练装置传输的多部分数据,即第一类运算是依赖于多组数据来生成待传输数据的运算。第二类运算用于生成仅由第一训练装置处理的数据,即第二类运算的执行与 否并不影响待传输数据的生成。
此外,第一类运算的执行优先级高于第二类运算的执行优先级。也就是说,在训练过程中,当同时存在均能够执行的第一类运算和第二类运算时,第一训练装置优先执行第一类运算。在所有的第一类运算执行完毕后,第一训练装置再执行第二类运算。
本方案中,对于会影响待传输数据生成的第一类运算以及不会影响待传输数据生成的第二类运算,训练装置优先执行第一类运算,以便于能够持续生成待传输数据,尽可能地避免训练装置处于通信空闲状态,以保证通信和训练并行的时间尽可能地长,从而缩短整个迭代过程中的空闲等待时长,提高整体的训练效率。
在一种可能的实现方式中,第一类运算包括第一子类运算和第二子类运算,第一子类运算的运算结果用于得到第二子类运算的输入,第二子类运算的运算结果用于向第二训练装置传输。也就是说,第一训练装置执行第二子类运算所得到的运算结果即为需要向第二训练装置传输的数据,而第一训练装置执行第一子类运算所得到的运算结果是用于作为第二子类运算的输入,即第二子类运算的执行依赖于第一子类运算。在实际训练过程中,只有先执行第一子类运算后,才能够得到作为第二子类运算的输入的数据,进而才能执行第二子类运算。
在一种可能的实现方式中,第二子类运算的执行优先级高于第一子类运算。
简单来说,由于第二子类运算的执行依赖于第一子类运算,因此在训练过程中,第一训练装置需要先执行第一子类运算,并得到运算结果后,才能够基于所得到的运算结果继续执行第二子类运算。即,第二子类运算的运算条件满足后,才能够执行第二子类运算。但是,在第二子类运算的运算条件满足后,第一训练装置则优先执行第二子类运算,以便于尽快生成待传输至第二训练装置的数据。
在一种可能的实现方式中,在训练模型的过程中,第一训练装置将执行第一类运算所产生的数据缓存至第一队列中,第一队列用于缓存待传输至第二训练装置的数据。在第一队列中数据的数据量大于或等于第一阈值的情况下,第一训练装置则停止执行第一类运算,并执行第二类运算。
具体来说,在第一队列中的数据较多的情况下,代表第一训练装置向第二训练装置发送数据的速度远跟不上第一训练装置执行第一类运算而产生数据的速度,因此第一训练装置继续优先执行第一类运算也不会提高整体的训练效率。在这种情况下,为了避免内存开销过大,第一训练装置则可以是选择转至执行第二类运算,以免产生过多的待传输数据积压在内存中。
在一种可能的实现方式中,在第一队列中的数据小于第一阈值的情况下,或在用于支持执行第二类运算的数据已处理完毕的情况下,第一训练装置则停止执行第二类运算,并继续执行第一类运算。其中,用于支持执行第二类运算的数据可以是指作为第二类运算的输入的数据,即第二类运算的输入数据。
也就是说,在第一队列中的数据被消耗至一定数量的情况下,为了避免第一队列的数据被消耗完毕而导致出现通信空闲的现象,第一训练装置可以是继续优先执行第一类运算,以持续产生待传输至第二训练装置的数据,保证数据通信的持续性。并且,在第一训练装置处理完毕第二类运算的输入数据之后,用于支持执行第二类运算的数据则已处理完毕,第一训练装置无法继续执行第二类运算,因此第一训练装置转至继续执行第一类运算。
在一种可能的实现方式中,第一训练装置所参与的训练为联邦学习,例如横向联邦学习、纵向联邦学习或联邦迁移学习。
在一种可能的实现方式中,第一类运算包括基于模型对多组数据进行处理的运算,第二类运算包括基于从第二训练装置获取的数据对模型进行反向梯度计算的运算。
在一种可能的实现方式中,模型包括第一子模型和第二子模型;第一类运算包括基于第一子模型对多组数据进行处理的运算,以及基于从第二训练装置获取的数据对第二子模型进行处理的运算;第二类运算包括对第一子模型进行反向梯度计算的运算。
在一种可能的实现方式中,多个数据的数量与模型的批次大小相关;在模型的训练过程中,目标梯度与基于多组数据所得到的多个梯度相关,目标梯度用于第一训练装置更新模型。例如,在基于多组数据分别得到多个梯度的情况下,通过求取多个梯度的平均值,得到目标梯度。
本方案通过基于多个梯度来求取目标梯度,进而基于目标梯度对模型进行更新,能够保证是基于一 个批次的数据来对模型进行更新,确保模型训练的精度不会受到影响。
在一种可能的实现方式中,多个数据的数量与模型的批次大小相关,多组数据的组数与待传输数据量具有正相关的关系,且多组数据的组数与第一训练装置的通信能力以及训练时长具有负相关的关系;其中,待传输数据量为处理多组数据后所生成的待传输数据的数据量,训练时长为第一训练装置基于多组数据训练模型的时长。
也就是说,处理多组数据后所生成的待传输数据的数据量越大,多组数据的组数则越多,以减少处理每组数据所生成的待传输数据,避免处理每组数据后所生成的待传输数据的数据量过多。并且,第一训练装置的通信能力越强,第一训练装置单位时间内能够向第二训练装置传输的数据则越多,多组数据的组数则可以划分得越少。第一训练装置基于多组数据训练模型的时长越长,则代表第一训练装置处理多组数据以生成待传输数据的速度越慢,多组数据的组数则可以划分得越少。
本申请第二方面提供一种模型训练装置,该模型训练装置为第一训练装置,包括:
获取模块,用于获取训练数据集,训练数据集包括多个数据;
处理模块,用于将多个数据划分为多组数据,多组数据中的每组数据包括至少一个数据,且每组数据的数据量与第一训练装置的通信能力相关;
处理模块,还用于基于第一训练装置中所部署的模型,以组为单位分批对多组数据执行处理,以训练模型并分批向第二训练装置传输处理多组数据所得到的数据,第二训练装置与第一训练装置共同参与模型的训练。
在一种可能的实现方式中,训练模型的过程中所执行的运算包括第一类运算和第二类运算,第一类运算用于基于多组数据生成向第二训练装置传输的数据,第二类运算用于生成仅由第一训练装置处理的数据,第一类运算的执行优先级高于第二类运算的执行优先级。
在一种可能的实现方式中,第一类运算包括第一子类运算和第二子类运算,第一子类运算的运算结果用于得到第二子类运算的输入,第二子类运算的运算结果用于向第二训练装置传输。
在一种可能的实现方式中,第二子类运算的执行优先级高于第一子类运算。
在一种可能的实现方式中,处理模块,还用于:
将执行第一类运算所产生的数据缓存至第一队列中,第一队列用于缓存待传输至第二训练装置的数据;
在第一队列中数据的数据量大于或等于第一阈值的情况下,停止执行第一类运算,并执行第二类运算。
在一种可能的实现方式中,处理模块,还用于:
在第一队列中的数据小于第一阈值的情况下,或在用于支持执行第二类运算的数据已处理完毕的情况下,停止执行第二类运算,并继续执行第一类运算。
在一种可能的实现方式中,第一训练装置所参与的训练为联邦学习。
在一种可能的实现方式中,第一类运算包括基于模型对多组数据进行处理的运算,第二类运算包括基于从第二训练装置获取的数据对模型进行反向梯度计算的运算。
在一种可能的实现方式中,模型包括第一子模型和第二子模型;
第一类运算包括基于第一子模型对多组数据进行处理的运算,以及基于从第二训练装置获取的数据对第二子模型进行处理的运算;
第二类运算包括对第一子模型进行反向梯度计算的运算。
在一种可能的实现方式中,多个数据的数量与模型的批次大小相关;
在模型的训练过程中,目标梯度与基于多组数据所得到的多个梯度相关,目标梯度用于第一训练装置更新模型。
在一种可能的实现方式中,多个数据的数量与模型的批次大小相关,多组数据的组数与待传输数据量具有正相关的关系,且多组数据的组数与第一训练装置的通信能力以及训练时长具有负相关的关系;
其中,待传输数据量为处理多组数据后所生成的待传输数据的数据量,训练时长为第一训练装置基 于多组数据训练模型的时长。
本申请第三方面提供一种模型训练装置,可以包括处理器,处理器和存储器耦合,存储器存储有程序指令,当存储器存储的程序指令被处理器执行时实现上述第一方面或第一方面任一实现方式所述的方法。对于处理器执行第一方面的各个可能实现方式中的步骤,具体均可以参阅第一方面,此处不再赘述。
在一种可能的实现方式中,该模型训练装置还包括通信接口,该模型训练装置通过通信接口向其他的模型训练装置传输数据,或者是接收其他的模型训练装置所传输的数据。具体地,该模型训练装置可以是基于通信接口,通过远程直接数据存取(Remote Direct Memory Access,RDMA)技术来向其他的模型训练装置传输数据,在此并不限定模型训练装置之间传输数据的具体方式。
其中,RDMA是指通过网络把数据直接传入计算机的存储区,即将数据从一个系统快速移动到远程系统存储器中,而不对操作系统造成任何影响,从而不占用计算机的处理资源。因此,在模型训练装置之间通过RDMA技术来实现数据传输的情况下,模型训练装置可以将待传输的数据存储至内存中,以便于通过网络将内存中的数据直接穿入另一个模型训练装置的内存中,进而减少对两个模型训练装置的处理资源的消耗。
本申请第四方面提供了一种模型训练系统,包括至少两个模型训练装置,该至少两个模型训练装置共同参与模型的训练,且至少两个模型训练装置中的任意一个模型训练装置采用上述第一方面任一实现方式所述的方法与其他的模型训练装置进行交互并执行模型的训练。
本申请第五方面提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,当其在计算机上运行时,使得计算机执行上述第一方面任一实现方式所述的方法。
本申请第六方面提供了一种电路系统,所述电路系统包括处理电路,所述处理电路配置为执行上述第一方面任一实现方式所述的方法。
本申请第七方面提供了一种计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面任一实现方式所述的方法。
本申请第八方面提供了一种芯片系统,该芯片系统包括处理器,用于支持服务器或门限值获取装置实现上述第一方面任一实现方式中所涉及的功能,例如,发送或处理上述方法中所涉及的数据和/或信息。在一种可能的设计中,所述芯片系统还包括存储器,所述存储器,用于保存服务器或通信设备必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包括芯片和其他分立器件。
上述第二方面至第八方面的有益效果可以参考上述第一方面的介绍,在此不再赘述。
附图说明
图1为本申请实施例提供的一种模型训练方法的应用场景示意图;
图2为本申请实施例提供的一种电子设备101的结构示意图;
图3为本申请实施例提供的一种模型训练方法的流程示意图;
图4A为本申请实施例提供的一种多个训练装置执行模型训练的示意图;
图4B为本申请实施例提供的一种在对训练数据进行分组以及不分组情况下的训练时长对比示意图;
图5为本申请实施例提供的一种训练过程的示意图;
图6为本申请实施例提供的另一种训练过程的示意图;
图7为本申请实施例提供的一种训练架构的示意图;
图8为本申请实施例提供的一种训练装置3进行并行任务调度的流程示意图;
图9为本申请实施例提供的一种训练装置4进行并行任务调度的流程示意图;
图10为本申请实施例提供的一种模型训练装置的结构示意图;
图11为本申请实施例提供的执行设备的一种结构示意图;
图12为本申请实施例提供的芯片的一种结构示意图;
图13为本申请实施例提供的一种计算机可读存储介质的结构示意图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,下面结合附图,对本申请的实施例进行描述。显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。本领域普通技术人员可知,随着新应用场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的描述在适当情况下可以互换,以便使实施例能够以除了在本申请图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或模块的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或模块,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或模块。在本申请中出现的对步骤进行的命名或者编号,并不意味着必须按照命名或者编号所指示的时间/逻辑先后顺序执行方法流程中的步骤,已经命名或者编号的流程步骤可以根据要实现的技术目的变更执行顺序,只要能达到相同或者相类似的技术效果即可。本申请中所出现的单元的划分,是一种逻辑上的划分,实际应用中实现时可以有另外的划分方式,例如多个单元可以结合成或集成在另一个系统中,或一些特征可以忽略,或不执行,另外,所显示的或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,单元之间的间接耦合或通信连接可以是电性或其他类似的形式,本申请中均不作限定。并且,作为分离部件说明的单元或子单元可以是也可以不是物理上的分离,可以是也可以不是物理单元,或者可以分布到多个电路单元中,可以根据实际的需要选择其中的部分或全部单元来实现本申请方案的目的。
为便于理解,以下先介绍本申请实施例所涉及的技术术语。
(1)分布式训练
分布式训练是指采用位于不同地方的多个装置对模型进行训练,进而有效地利用各个装置的计算资源来完成单个装置难以完成的模型训练任务。
(2)联邦学习
联邦学习本质上是一种模型训练方法,能够在保障数据隐私安全及合法合规的基础上,实现数据共享,共同建立模型。联邦学习的核心思想是在多个数据源共同参与模型训练时,不需要进行原始数据流转的前提下,仅通过交互模型中间参数进行模型联合训练,原始数据可以不出本地。这种方式实现数据隐私保护和数据共享分析的平衡,即“数据可用不可见”的数据应用模式。
根据联邦学习所使用数据在各参与方的不同分布情况,联邦学习可以划分为三类:横向联邦学习(Horizontal Federated Learning,HFL)、纵向联邦学习(Vertical Federated Learning,VFL)和联邦迁移学习(Federated Transfer Learning,FTL)。以下将分别介绍这三种类型联邦学习所针对的不同数据分布情况。
(3)横向联邦学习
横向联邦学习是指联邦学习中不同参与方的数据有较大的特征的重叠,但数据样本(即特征所属的样本)的重叠度不高。例如,联邦学习的参与方是两家服务于不同区域市场的银行,他们所服务的客户群体差别较大,但客户的特征可能会因为相似的商业模式而重叠度较高。
(4)纵向联邦学习
纵向联邦学习是指联邦学习中不同参与方的数据样本有较大的重叠,但样本特征的重叠度不高。例如,两家公司(银行和电子商务公司)分别向客户提供不同的服务,且这两间公司拥有客户不同方面的数据,但这两间公司所服务的客户群体有较大的重叠。
一般来说,在目前的联邦学习场景下,纵向联邦学习的应用场景最为广泛,因此纵向联邦学习算法有利于各企业之间建立合作,使用各自的特有数据,共同建立更加强大的模型。
(5)联邦迁移学习
联邦迁移学习是指联邦学习中不同参与方的数据在特征和样本维度重叠度都不是非常高。
(6)神经网络
神经网络通常又称为模型,神经网络是一种模仿动物神经网络行为特征,进行分布式并行信息处理 的算法数学模型。神经网络依靠系统的复杂程度,通过调整内部大量节点之间相互连接的关系,从而达到处理信息的目的。
具体地,神经网络可以是由神经单元组成的。神经单元可以是指以xs(即输入数据)和截距1为输入的运算单元,该运算单元的输出可以为:
其中,s=1、2、……n,n为大于1的自然数,Ws为xs的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入,激活函数可以是sigmoid函数。
总的来说,神经网络是将多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
(7)损失函数
在训练神经网络的过程中,因为希望神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么神经网络的训练就变成了尽可能缩小这个loss的过程。
总的来说,损失函数是用于度量预测值与实际值之间的差异,以作为模型性能参考。损失函数的值越小,代表模型的预测输出和期望输出(也称为实际值)之间的差值就越小,也就说明模型的性能越好。模型训练的过程,就是不断地通过训练数据进行预测,不断调整预测输出与期望输出之间的差异,使得损失函数的值不断变小的过程。
(8)梯度(gradient)
梯度是一个向量,表示某一个函数在某一点处的方向导数沿着该方向取得最大值,即函数在该点处沿着梯度的方向变化最快,变化率最大。
(9)梯度下降法
梯度下降法是一种寻找目标函数最小化的方法。一般来说,要使用梯度下降法找到一个函数的局部最小值,必须向函数上当前点对应梯度的反方向的规定步长距离点进行迭代搜索。
通常地,梯度下降法应用于求解损失函数的最小值。基于梯度下降法的原理可知,损失函数沿梯度相反方向收敛最快,即能够最快找到极值点。当梯度向量为0时,说明损失函数到达一个极小值点,模型准确度达到一个极大值点。
(10)反向传播法
基于神经网络按层深入,层层嵌套的特点,对神经网络的目标函数计算梯度的时候,需要用反向传播的方式由深到浅倒着计算以及更新参数。因此,反向传播法是梯度下降法在神经网络上的具体实现方式。神经网络可以采用误差反向传播(back propagation,BP)算法在训练过程中修正初始的模型中参数的大小,使得模型的误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始的模型中参数,从而使误差损失收敛。反向传播法是以误差损失为主导的反向传播运动,旨在得到最优的模型参数,例如权重矩阵。
(11)超参数
机器学习模型中一般有两类参数:一类是需要从数据中学习和估计得到,称为模型参数,即模型本身的参数,例如线性回归直线的加权系数(斜率)及其偏差项(截距);另一类则是机器学习算法中的调优参数(tuning parameters),需要人为设定,称为超参数(Hyperparameter)。例如,超参数可以包括 批次大小(batch size)、梯度下降法中的学习速率α,迭代次数epoch。
(12)批次大小(batch size)
批次大小表示单次传递给程序用以训练模型的数据样本个数。比如,假设训练数据集有1000个数据,如果设置batch_size=100,那么程序首先会用训练数据集中的前100个数据,即第1-100个数据来训练模型。当基于第1-100个数据完成训练后(即更新完毕模型的权重),再使用第101-200个数据对模型训练,直至第十次使用完训练集中的1000个数据后停止。其中,第1-100个数据属于同一批次的数据,第101-200个数据也属于同一批次的数据,依次类推,该训练数据集中共有10个批次的数据。
在模型训练中,设置批次大小的意义是为了同时基于多个样本数据对模型进行训练,一方面能够有效地利用GPU、NPU等硬件装置的并行性,同时并行地对多个样本数据进行处理,以提高模型训练效率;另一方面则能够基于多个样本数据来确定模型的平均梯度值,提高模型的训练精度,避免基于单个样本数据来确定模型梯度值时容易造成训练过程中的梯度值波动。
目前,联邦学习技术的提出,是为了应对数据孤岛的问题。由于隐私保护和商业保护,在大部分商业场景下,不同企业或平台之间所拥有的数据是无法直接交换的。
例如,在广告场景下,广告主为投放广告的一方,广告平台为曝光广告的一方;广告平台上拥有用户特征、点击曝光行为等数据;广告主则拥有用户下载、激活、购买等转化行为数据。基于用户隐私保护以及商业保护等因素,广告主与广告平台之间的数据难以直接交换,导致有效数据分散在两方,造成数据孤岛现象。基于联邦学习,广告平台和广告主在无需共享数据资源,数据不出本地的情况下进行联合训练,实现模型的联合构建。
然而,由于参与联邦学习的多个设备分布于不同的地方,多个设备在执行联邦学习的过程中,往往需要通过网络来交换大量的数据。受限于网络的通信能力,在多个设备之间待交换的数据较多的情况下,设备间交换数据的通信时长往往会大于设备训练模型的时长,进而导致联邦学习的整体时延较长。
有鉴于此,本申请实施例提供了一种模型训练方法,将训练过程中同一批次的多个数据划分为多组数据,由训练装置通过模型依次对多组数据进行处理,以执行模型的训练;并且,在模型训练过程中,训练装置优先执行生成待传输数据的运算,进而实现将模型训练时间隐藏于训练装置之间的通信时间中,且尽可能地避免训练装置处于通信空闲状态,最终缩短整个迭代过程中的空闲等待时长,提高整体的训练效率。
请参阅图1,图1为本申请实施例提供的一种模型训练方法的应用场景示意图。如图1所示,训练装置10和训练装置20为参与模型分布式训练的两个训练装置。并且,训练装置10可以从本地的数据库1中获取用于训练模型的训练数据;训练装置20则可以是从本地的数据库2中获取用于训练模型的训练数据。并且,训练装置10与训练装置20之间无法直接交换数据库1和数据库2中数据。
此外,训练装置10上部署有模型1,训练装置20上部署有模型2。训练装置10基于从数据库1中获取到的训练数据对模型1进行训练,训练装置20基于从数据库2中获取到的训练数据对模型2进行训练。并且,在训练装置10和训练装置20进行模型训练的过程中,训练装置10和训练装置20会彼此交换模型的中间参数(例如通过本地模型处理训练数据所得到的数据),以便于双方能够相互配合完成本地模型的训练。
基于本申请实施例提供的模型训练方法,对于参与分布式训练的任意一个训练装置(例如图1所示的训练装置10),训练装置可以是将一个批次的多个训练数据划分为多组数据,并通过模型以组为单位分批对多组数据进行处理,以逐步生成多部分待传输至其他训练装置的数据。然后,训练装置在分批处理多组数据的同时,向其他训练装置传输通过处理数据所生成的一部分或多部分数据,以实现数据处理和数据传输同步执行。在获取到来自于其他训练装置的数据后,训练装置继续对所获取到的数据进行处理,以完成模型的训练。并且,在训练装置对模型进行训练的过程中,训练装置优先执行生成待传输数据的运算,以避免训练装置处于通信空闲状态。
需要说明的是,在图1所示的应用场景中,本申请实施例是以两个训练装置参与分布式训练来进行 举例说明。在实际应用中,参与分布式训练的装置可以是两个或两个以上的训练装置,在此并不对参与分布式训练的训练装置的数量进行限定。
具体地,本申请实施例所提供的模型训练方法可以应用于电子设备上,或者是电子设备上的芯片系统,例如电子设备上的图形处理器(graphics processing unit,GPU)或网络处理器(Neural-network Processing Unit,NPU)。示例性地,该电子设备例如可以是服务器、智能手机(mobile phone)、个人电脑(personal computer,PC)、笔记本电脑、平板电脑、智慧电视、移动互联网设备(mobile internet device,MID)、可穿戴设备,虚拟现实(virtual reality,VR)设备、增强现实(augmented reality,AR)设备、工业控制(industrial control)中的无线电子设备、无人驾驶(self driving)中的无线电子设备、远程手术(remote medical surgery)中的无线电子设备、智能电网(smart grid)中的无线电子设备、运输安全(transportation safety)中的无线电子设备、智慧城市(smart city)中的无线电子设备、智慧家庭(smart home)中的无线电子设备等。为了便于叙述,以下将以本申请实施例提供的方法应用于服务器上为例,对本申请实施例所提供的方法进行介绍。
可以参阅图2,图2为本申请实施例提供的一种电子设备101的结构示意图。如图2所示,电子设备101包括处理器103,处理器103和系统总线105耦合。处理器103可以是一个或者多个处理器,其中每个处理器都可以包括一个或多个处理器核。显示适配器(video adapter)107,显示适配器可以驱动显示器109,显示器109和系统总线105耦合。系统总线105通过总线桥111和输入输出(I/O)总线耦合。I/O接口115和I/O总线耦合。I/O接口115和多种I/O设备进行通信,比如输入设备117(如:触摸屏等),外存储器121,(例如,硬盘、软盘、光盘或优盘),多媒体接口等)。收发器123(可以发送和/或接收无线电通信信号),摄像头155(可以捕捉静态和动态数字视频图像)和外部USB端口125。其中,可选地,和I/O接口115相连接的接口可以是USB接口。
其中,处理器103可以是任何传统处理器,包括精简指令集计算(reduced instruction set Computing,RISC)处理器、复杂指令集计算(complex instruction set computing,CISC)处理器或上述的组合。可选地,处理器可以是诸如ASIC的专用装置。
电子设备101可以通过网络接口129和软件部署服务器149通信。示例性的,网络接口129是硬件网络接口,比如,网卡。网络127可以是外部网络,比如因特网,也可以是内部网络,比如以太网或者虚拟私人网络(virtual private network,VPN)。可选地,网络127还可以是无线网络,比如WiFi网络,蜂窝网络等。
硬盘驱动器接口131和系统总线105耦合。硬件驱动接口和硬盘驱动器133相连接。内存储器135和系统总线105耦合。运行在内存储器135的数据可以包括电子设备101的操作系统(OS)137、应用程序143和调度表。
操作系统包括Shell 139和内核(kernel)141。Shell 139是介于使用者和操作系统的内核间的一个接口。shell是操作系统最外面的一层。shell管理使用者与操作系统之间的交互:等待使用者的输入,向操作系统解释使用者的输入,并且处理各种各样的操作系统的输出结果。
内核141由操作系统中用于管理存储器、文件、外设和系统资源的那些部分组成。内核141直接与硬件交互,操作系统内核通常运行进程,并提供进程间的通信,提供CPU时间片管理、中断、内存管理和IO管理等等。
示例性地,在电子设备101为智能手机的情况下,应用程序143包括即时通讯相关的程序。在一个实施例中,在需要执行应用程序143时,电子设备101可以从软件部署服务器149下载应用程序143。
请参阅图3,图3为本申请实施例提供的一种模型训练方法的流程示意图。如图3所示,本申请实施例提供的模型训练方法应用于第一训练装置,且该方法包括以下的步骤301-303。
步骤301,获取训练数据集,训练数据集包括多个数据。
本实施例中,第一训练装置是参与分布式训练的一个训练装置,第一训练装置例如为上述的电子设备,或电子设备中的芯片。
在对模型执行训练前,第一训练装置可以获取到本地的训练数据集,该训练数据集用于对模型执行训练。此外,基于模型预先所设定的超参数,可以获取到批次大小(batch size),其中批次大小指示了用于同时对模型进行训练的数据样本的数据量大小,即模型在一次迭代训练过程中需要使用到的数据样本的数据量大小。
因此,基于模型预先所设定的批次大小,能够将训练数据集中的所有数据分成多个批次,每个批次的数据量相同,且每个批次的数据量是由批次大小所确定。例如,假设训练数据集包括10000个图像,批次大小为100,则训练数据集可以分为100个批次,每个批次包括100个图像。例如,第一个批次包括第1-100个图像,第二个批次包括第101-200个图像,以此类推。
可选的,第一训练装置所获取到的多个数据为一个批次的数据,即多个数据的数量与模型的批次大小相同。
步骤302,将多个数据划分为多组数据,多组数据中的每组数据包括至少一个数据,且多组数据的组数与第一训练装置的通信能力相关。
本步骤中,针对于多个数据,第一训练装置可以将该多个数据继续进行划分,得到多组数据。例如,假设该多个数据为一个批次的数据,且该多个数据共包括100个数据,第一训练装置可以将这100个数据划分为5组数据,每组数据包括20个数据。
其中,针对多组数据,每组数据的数据量是与第一训练装置的通信能力相关。例如,每组数据的数据量与第一训练装置的通信能力具有正相关的关系。即,第一训练装置的通信能力越强,每组数据的数据量则越大;第一训练装置的通信能力越差,每组数据的数据量则越小。
可以理解的是,第一训练装置的通信能力越差,则第一训练装置在单位时间内向其他的训练装置传输的数据越少,第一训练装置向其他的训练装置传输特定数据量的数据所花费的时间也就越少。因此,在第一训练装置的通信能力越差的情况下,将多个数据划分为越多组数的数据,能够尽可能地将训练时长隐藏于通信时长中,降低整体的训练时延开销。
步骤303,基于第一训练装置中所部署的模型,以组为单位分批对多组数据执行处理,以训练模型并分批向第二训练装置传输处理多组数据所得到的多部分数据,第二训练装置与第一训练装置共同参与模型的训练。
在对第一训练装置中所部署的模型进行训练的过程中,以组为单位分批将多组数据的每组数据输入至模型中,从而实现分批对多组数据执行处理。并且,在基于模型依次处理多组数据的过程中,每处理一组数据都会产生一部分数据,且产生的这一部分数据需要向第二训练装置传输,以便于第二训练装置配合完成模型的训练。也就是说,通过模型对多组数据执行处理会得到多部分数据,每组数据对应于唯一的一部分数据。并且,由于本步骤中是分批对多组数据执行处理,因此本步骤中多部分数据中的每一部分数据也是分批产生并且分批向第二训练装置传输的。
在模型的训练过程中,第一训练装置基于模型分批对多组数据执行处理后,继续将处理多组数据所得到的多部分数据分批向第二训练装置传输,以便于第二训练装置配合第一训练装置完成第一训练装置中模型的训练。其中,第二训练装置与第一训练装置共同参与模型的训练。第二训练装置例如包括一个或多个训练装置。第一训练装置和第二训练装置共同参与的训练例如为上述的联邦学习,比如横向联邦学习、纵向联邦学习或联邦迁移学习,在此不做具体限定。
例如,第一训练装置在基于模型对多组数据执行处理,并向第二训练装置传输处理多组数据所得到的多部分数据之后,第二训练装置则基于第二训练装置中的模型对获取到的多部分数据进行处理,并将处理结果返回给第一训练装置。最终,第一训练装置基于第二训练装置所返回的处理结果对第一训练装置中的模型进行权重更新,实现模型的一次迭代训练。
总的来说,通过将多个数据分成多组数据,并通过模型对多组数据分批处理,能够将处理多组数据所得到的多部分数据分批传输给第二训练装置,使得第二训练装置能够尽早地基于所获取到的数据进行计算,从而将第二训练装置的训练时间隐藏于通信时间内,降低模型的整体训练时长,提高模型的训练效率。
可选的,上述第一训练装置所获取的多个数据的数量与模型的批次大小相关,例如多个数据的数量 与模型的批次大小相同。并且,在模型的训练过程中,目标梯度与基于多组数据所得到的多个梯度相关,目标梯度用于第一训练装置更新模型。例如,在基于多组数据分别得到多个梯度的情况下,通过求取多个梯度的平均值,得到目标梯度。
可以理解的是,在模型的正常训练过程中,训练装置通常是基于一个批次内的多个数据同时对模型进行训练,从而得到一个梯度,并基于所得到的一个梯度对模型进行更新。由于本方案中是将一个批次内的多个数据分成多组数据,并基于多组数据依次对模型进行训练,因此得到了与多组数据相关的多个梯度。在这种情况下,本方案通过基于多个梯度来求取目标梯度,进而基于目标梯度对模型进行更新,能够保证是基于一个批次的数据来对模型进行更新,确保模型训练的精度不会受到影响。
示例性地,请参阅图4A,图4A为本申请实施例提供的一种多个训练装置执行模型训练的示意图。如图4A所示,训练装置1中部署有模型1,训练装置2中部署有模型2。其中,训练装置1例如为本申请实施例的第一训练装置,训练装置2例如为本申请实施例的第二训练装置。
对于训练装置1中的模型1而言,在模型1的训练过程中,先往模型1中输入训练数据,由训练装置1基于模型1执行正向计算,即基于模型1对输入的训练数据进行处理,得到需要传输至训练装置2的数据。然后,训练装置1向训练装置2传输对模型1执行正向计算所得到的数据。训练装置2在接收到训练装置1所传输的数据后,基于本地的训练数据以及训练装置1所传输的数据,对模型2进行正向计算,并基于模型2正向计算的结果对模型2进行反向梯度计算,得到模型2对应的梯度数据。然后,训练装置2将进行反向梯度计算所得到的梯度数据传输给训练装置1,由训练装置1基于接收到的数据继续对模型1进行反向梯度计算。最后,训练装置1基于反向梯度计算所得到的梯度数据对模型1中的权重进行更新,进而实现模型1的一次迭代训练。
请参阅图4B,图4B为本申请实施例提供的一种在对训练数据进行分组以及不分组情况下的训练时长对比示意图。其中,图4B所示的两种情况下的训练时长对比均是基于图4A所示的模型训练流程。如图4B所示,在数据不分组的情况下,训练装置1在训练模型1时不对输入的训练数据进行分组,即同时基于一个批次的训练数据对模型1进行训练。具体地,训练装置1先基于模型1对同一批次的多个数据进行正向计算(即图4B中的计算A),得到数据A;然后,训练装置1再向训练装置2传输数据A。在训练装置2接收到训练装置1所传输的数据A后,训练装置2基于数据A对模型2执行正向计算以及反向梯度计算(即图4B中的计算B),得到数据B;然后,训练装置2再向训练装置1传输数据B。在训练装置1接收到训练装置2所传输的数据B后,训练装置1对模型1执行反向梯度计算(即图4B中的计算C),最终基于所得到的梯度更新模型中的权重,完成模型的一次迭代训练。
由图4B中可以看出,在输入数据不分组的情况下,训练装置1和训练装置2中所有的计算以及数据传输均是串行执行的,且数据传输时间较长,训练装置1和训练装置2容易处于较长时间的空闲等待状态,导致模型的整体训练时延较长。
在数据分组的情况下,训练装置1在训练模型1时对输入的训练数据进行分组,即将一个批次的训练数据分为两组来对模型1进行训练。具体地,训练装置1先将输入的一批次数据分为两组,并基于模型1对第一组的多个数据进行正向计算(即图4B中的计算A1),得到数据A1。然后,训练装置1向训练装置2传输数据A1;在训练装置1向训练装置2传输数据A1的同时,训练装置1继续基于模型1对第二组的多个数据进行正向计算(即图4B中的计算A2),得到数据A2。其中,由于数据传输时长大于模型1的计算时长,因此训练装置1执行计算A2所得到的数据A2是在等待数据A1传输完毕后再继续传输。
在训练装置2接收到训练装置1所传输的数据A1后,训练装置2基于数据A1对模型2执行正向计算以及反向梯度计算(即图4B中的计算B1),得到数据B1;然后,训练装置2再向训练装置1传输数据B1。此外,在训练装置2执行完计算B1之后,当训练装置2接收到训练装置1所传输的数据A2时,训练装置2继续基于数据A2对模型2执行正向计算以及反向梯度计算(即图4B中的计算B2),得到数据B2,并向训练装置1传输数据B2。
在训练装置1接收到训练装置2所传输的数据B1后,训练装置1基于数据B1对模型1执行反向梯度计算(即图4B中的计算C1);在训练装置1接收到训练装置2所传输的数据B2后,训练装置1 基于数据B2对模型1执行反向梯度计算(即图4B中的计算C2)。最终,训练装置1基于通过计算C1和计算C2所得到的梯度更新模型中的权重,完成模型的一次迭代训练。
由图4B中可以看出,在输入数据按批次进行分组的情况下,训练装置1和训练装置2中部分的计算以及数据传输可以并行执行,例如训练装置1中的计算A2和数据A1传输可以并行执行,又例如训练装置1中的数据A2传输和训练装置2中的计算B1可以并行执行,进而实现了将一方的训练时间和通信时间与另一方的训练时间进行并行,即将另一方的训练时间隐藏在通信时间中,从而缩短了整个迭代训练过程中的空闲等待时长,节省了大量的训练时间,有效地提高了模型的训练效率。
本实施例中,将训练过程中多个数据划分为多组数据,由训练装置通过模型分批对多组数据进行处理并将处理多组数据所得到的多部分数据分批向其他训练装置传输,以执行模型的训练。这样,训练装置在向其他训练装置传输处理前面组次数据所得到的数据时,仍能够继续对后面组次的数据进行处理,进而实现将模型训练时间隐藏于训练装置之间的通信时间中,最终缩短整个迭代过程中的空闲等待时长,提高整体的训练效率。
例如,训练装置在处理完毕第一组数据,并得到第一部分数据之后,训练装置在向其他训练装置传输第一部分数据时,可以同时处理第二组数据,即实现向其他训练装置传输处理前面组次数据所得到的数据时,仍能够继续对后面组次的数据进行处理。
可选的,在多个数据的数量与模型的批次大小相关的情况下(例如多个数据的数量与模型的批次大小相同),对于多组数据而言,多组数据的组数与待传输数据量具有正相关的关系,且多组数据的组数与第一训练装置的通信能力以及训练时长具有负相关的关系。其中,待传输数据量为处理多组数据后所生成的待传输数据的数据量,训练时长为第一训练装置基于多组数据训练模型的时长。
也就是说,处理多组数据后所生成的待传输数据的数据量越大,多组数据的组数则越多,以减少处理每组数据所生成的待传输数据,避免处理每组数据后所生成的待传输数据的数据量过多。并且,第一训练装置的通信能力越强,第一训练装置单位时间内能够向第二训练装置传输的数据则越多,多组数据的组数则可以划分得越少。第一训练装置基于多组数据训练模型的时长越长,则代表第一训练装置处理多组数据以生成待传输数据的速度越慢,多组数据的组数则可以划分得越少。
示例性地,第一训练装置对一个批次的数据进行划分,得到的多组数据的组数可以是通过以下的公式来表示:
Batch_size*Seq_length*hidden_size*32/BandWith/train_time
其中,Batch_size表示批次大小,Seq_length表示计算矩阵的一个维度的大小,hidden_size表示模型隐藏层的特征维度,BandWith表示带宽,train_time表示模型的训练时长。Batch_size*Seq_length*hidden_size*32表示第一训练装置通过模型处理多组数据所产生的待传输数据的数据量,BandWith表示第一训练装置单位时间内能够向第二训练装置传输的数据量,train_time表示第一训练装置处理多组数据所需的时长。因此,Batch_size*Seq_length*hidden_size*32/BandWith则是表示第一训练装置向第二训练装置传输所有数据所需的时长;在此基础上,Batch_size*Seq_length*hidden_size*32/BandWith/train_time(即多组数据的组数)则能够是表示第一训练装置向第二训练装置传输所有数据所需的时长与第一训练装置处理多组数据所需的时长之间的比值。
可选的,对于第一训练装置而言,第一训练装置训练模型的过程中所执行的运算包括第一类运算和第二类运算。其中,第一类运算用于基于多组数据生成向第二训练装置传输的多部分数据,即第一类运算是依赖于多组数据来生成待传输数据的运算。第二类运算用于生成仅由第一训练装置处理的数据,即第二类运算的执行与否并不影响待传输数据的生成。例如,第一类运算包括基于模型对多组数据进行处理的运算(即图4A的对模型1所进行的正向计算),第二类运算包括基于从第二训练装置获取的数据对模型进行反向梯度计算的运算(即图4A的对模型1所进行的反向梯度计算)。
此外,第一类运算的执行优先级高于第二类运算的执行优先级。也就是说,在训练过程中,当同时存在均能够执行的第一类运算和第二类运算时,第一训练装置优先执行第一类运算。在所有的第一类运算执行完毕后,第一训练装置再执行第二类运算。
示例性地,在上述图4A所示的模型1中,第一类运算可以是指第一训练装置基于多组数据对模型1所执行的正向计算,其中对模型1所执行的正向计算会生成需要传输至第二训练装置的数据。第二类运算则可以是指第一训练装置基于第二训练装置返回的数据对模型1所执行的反向梯度计算,其中对模型1所执行的反向梯度计算会生成模型1的梯度数据,该模型1的梯度数据是仅由第一训练装置进一步处理以更新模型1权重的数据。
也就是说,在本实施例中,对于会影响待传输数据生成的第一类运算以及不会影响待传输数据生成的第二类运算,训练装置优先执行第一类运算,以便于能够持续生成待传输数据,尽可能地避免训练装置处于通信空闲状态,以保证通信和训练并行的时间尽可能地长,从而缩短整个迭代过程中的空闲等待时长,提高整体的训练效率。
进一步地,在一些可能的实施例中,第一类运算可以包括第一子类运算和第二子类运算,第一子类运算的运算结果用于得到第二子类运算的输入,第二子类运算的运算结果用于向第二训练装置传输。也就是说,第一训练装置执行第二子类运算所得到的运算结果即为需要向第二训练装置传输的数据,而第一训练装置执行第一子类运算所得到的运算结果是用于作为第二子类运算的输入,即第二子类运算的执行依赖于第一子类运算。在实际训练过程中,只有先执行第一子类运算后,才能够得到作为第二子类运算的输入的数据,进而才能执行第二子类运算。
需要说明的是,尽管第一子类运算并不直接生成待传输的数据,但是直接生成待传输的数据的第二子类运算的执行却是依赖于第一子运算的执行,因此第一子类运算也属于是第一类运算,即第一子类运算也可以理解为用于生成待传输至第二训练装置的数据。因此,对于第一子类运算而言,第一子类运算的执行优先级同样高于第二类运算的执行优先级。
可选的,对于同属于第一类运算的第一子类运算和第二子类运算而言,第二子类运算的执行优先级高于第一子类运算。
可以理解的是,由于第二子类运算的执行依赖于第一子类运算,因此在训练过程中,第一训练装置需要先执行第一子类运算,并得到运算结果后,才能够基于所得到的运算结果继续执行第二子类运算。即,第二子类运算的运算条件满足后,才能够执行第二子类运算。但是,在第二子类运算的运算条件满足后,第一训练装置则优先执行第二子类运算,以便于尽快生成待传输至第二训练装置的数据。
示例性地,第一训练装置所部署的模型例如包括第一子模型和第二子模型。第一类运算包括基于第一子模型对多组数据进行处理的运算,以及基于从第二训练装置获取的数据对第二子模型进行处理的运算。例如,第一训练装置在基于第一子模型对多组数据进行处理后,生成待传输至第二训练装置的数据;并且,第一训练装置基于从第二训练装置获取的数据对第二子模型进行处理后,继续生成待传输至第二训练装置的数据。即,第一类运算中包括两种运算(即基于第一子模型对多组数据进行处理的运算以及基于从第二训练装置获取的数据对第二子模型进行处理的运算),这两种运算分别能够生成不同类型的数据,且所生成的这些数据都是需要向第二训练装置传输的。
可选的,第二子模型的输入还可以包括基于第一子模型对多组数据进行处理的运算结果。即,第二子模型相关的运算实际上是基于第一子模型的运算结果以及从第二训练装置获取的数据对第二子模型进行处理的运算。那么,第一类运算包括基于第一子模型对多组数据进行处理的运算,以及基于第一子模型的运算结果和从第二训练装置获取的数据对第二子模型进行处理的运算。
第二类运算则可以是包括对第一子模型进行反向梯度计算的运算。
为了便于理解,以下将结合具体例子详细介绍第一训练装置所部署的模型以及基于模型所执行的第一类运算和第二类运算。
(1)第一训练装置所部署的模型不包括子模型。
示例性地,请参阅上述的图4A,第一训练装置例如为图4A中的训练装置1,第一训练装置所部署的模型例如为图4A中的模型1;第二训练装置例如为图4A中的训练装置2,第二训练装置所部署的模型例如为图4A中的模型2。
在这种情况下,训练过程中的第一类运算例如可以为图4A中对模型1所执行的正向计算,第二类运算例如可以为图4A中对模型1所执行的反向梯度计算。
(2)第一训练装置所部署的模型包括两个子模型,且两个子模型均会产生需要传输至第二训练装置的数据。
示例性地,请参阅图5,图5为本申请实施例提供的一种训练过程的示意图。如图5所示,训练装置1中部署有模型A1和模型A2,训练装置2中部署有模型B。
在训练过程中,训练装置1基于本地的训练数据集对模型A1执行正向计算,并将得到的计算结果1传输至训练装置2。训练装置2基于训练装置1所发送的计算结果1对模型B执行正向计算,得到计算结果2,并向训练装置1返回所得到的计算结果2。训练装置1基于训练装置2所返回的计算结果2对模型A2执行正向计算,并基于最终所得到的计算结果对模型A2执行反向梯度计算,得到梯度数据1。
其次,训练装置1向训练装置2传输梯度数据1,由训练装置2基于梯度数据1对模型B执行反向梯度计算。在对模型B完成反向梯度计算后,训练装置2向训练装置1传输所得到的梯度数据2,由训练装置1基于梯度数据2对模型A1执行反向梯度计算。
需要说明的是,以上的训练流程是针对于训练装置1初始所输入的一组数据而言的。由于训练装置1将同一批次的多个数据分成了多组数据,因此在实际训练过程中,针对一个批次的数据,训练装置1和训练装置2需要执行多次上述流程中的计算。
具体地,上述的第一训练装置例如为图5中的训练装置1,第二训练装置例如为图5中的训练装置2。上述的第一类运算例如包括训练装置1对模型A1所执行的正向计算、训练装置1对模型A2所执行的正向计算以及反向梯度计算。进一步地,在第一类运算中,第一子类运算例如为训练装置1对模型A2所执行的正向计算,即该正向计算并不直接产生需要向训练装置2传输的数据但会产生作为反向梯度计算的输入的数据;第二子类运算例如为训练装置1对模型A1所执行的正向计算、训练装置1对模型A2所执行的反向梯度计算。上述的第二类运算例如为训练装置1对模型A1所执行的反向梯度计算。
(3)第一训练装置所部署的模型包括两个子模型,且仅有一个子模型会产生需要传输至第二训练装置的数据。
示例性地,请参阅图6,图6为本申请实施例提供的另一种训练过程的示意图。如图6所示,训练装置3中部署有模型C1和C2,训练装置4中部署有D。
在训练过程中,训练装置3基于本地的训练数据集对模型C1执行正向计算,得到计算结果3。训练装置4基于本地的训练数据集对模型D执行正向计算,得到计算结果4,并向训练装置3传输所得到的计算结果4。
在获取到计算结果3和计算结果4之后,训练装置3则基于计算结果3和计算结果4对模型C2执行正向计算,并基于正向计算所得到的结果对模型C2执行反向梯度计算,得到梯度数据。然后,训练装置3向训练装置4传输梯度数据,以使得训练装置4基于接收到的梯度数据对模型D执行反向梯度计算。此外,训练装置3还基于通过对模型C2执行反向梯度计算所得到的梯度数据继续对模型C1执行反向梯度计算,从而得到模型C1的梯度数据。
类似地,以上的训练流程是针对于训练装置3和训练装置4初始所输入的一组数据而言的。在实际训练过程中,训练装置3和训练装置4均会将同一批次的多个数据分成多组数据,因此针对一个批次的数据,训练装置3和训练装置4需要执行多次上述流程中的计算。
具体地,上述的第一训练装置例如为图6中的训练装置3,第二训练装置例如为图6中的训练装置4。上述的第一类运算例如包括训练装置1对模型C1所执行的正向计算、训练装置1对模型C2所执行的正向计算以及反向梯度计算。进一步地,在第一类运算中,第一子类运算例如为训练装置1对模型C2所执行的正向计算,即该正向计算并不直接产生需要向训练装置4传输的数据但会产生作为反向梯度计算的输入的数据;第二子类运算例如为训练装置3对模型C1所执行的正向计算、训练装置3对模型C2所执行的反向梯度计算。上述的第二类运算例如为训练装置3对模型C1所执行的反向梯度计算。
或者,上述的第一训练装置例如为图6中的训练装置4,第二训练装置例如为图6中的训练装置3。上述的第一类运算例如为训练装置4对模型D所执行的正向计算;第二子类运算例如为训练装置4对模型D所执行的反向梯度计算。上述的第二类运算例如为训练装置4对模型D所执行的反向梯度计算。
在上述所示的三个例子中,分别是以第一训练装置中包括一个或两个子模型进行举例,详细介绍了 第一训练装置在训练过程中所执行的第一类运算和第二类运算。在实际应用中,第一训练装置所部署的模型也可以是其他类型的模型,且第一训练装置所执行的第一类运算和第二类运算也可以是根据第一训练装置所部署的模型来确定,本实施例并不限定第一训练装置所部署的模型的具体形式。
在上述的实施例中,介绍了第一类运算的执行优先级高于第二类运算的执行优先级。因此,在第一类运算的执行满足的情况下,第一训练装置总会优先执行第一类运算。
然而,在一些情况下,如果第一类运算的执行速度远高于第一训练装置的通信速度,那么将会使得第一训练装置执行第一类运算所产生的大量待传输数据缓存在内存中以等待传输,且随着第一类运算的不断执行,内存中所缓存的数据会越来越多,进而造成内存压力过大。基于此,本实施例中提出在满足一定的条件下,从执行第一类运算转至执行第二类运算,以减轻内存压力。
示例性地,在第一训练装置训练模型的过程中,第一训练装置将执行第一类运算所产生的数据缓存至第一队列中。其中,第一队列是内存中用于存储数据的一种数据结构,第一队列具体用于缓存待传输至第二训练装置的数据。基于第一队列,第一队列中的数据能够被有序地传输至第二训练装置,即先进入到第一队列中的数据则先被传输至第二训练装置。
在第一队列中数据的数据量大于或等于第一阈值的情况下,第一训练装置停止执行第一类运算,并执行第二类运算。即,在第一队列中的数据较多的情况下,第一训练装置不再优先执行第一类运算,而是转至执行第二类运算。其中,第一阈值的大小可以是根据第一训练装置中的内存大小来确定,在此并不做具体限定。
可以理解的是,在第一队列中的数据较多的情况下,代表第一训练装置向第二训练装置发送数据的速度远跟不上第一训练装置执行第一类运算而产生数据的速度,因此第一训练装置继续优先执行第一类运算也不会提高整体的训练效率。在这种情况下,为了避免内存开销过大,第一训练装置则可以是选择转至执行第二类运算,以免产生过多的待传输数据积压在内存中。
进一步地,在第一队列中的数据小于第一阈值的情况下,或在用于支持执行第二类运算的数据已处理完毕的情况下,第一训练装置停止执行第二类运算,并继续执行第一类运算。
也就是说,在第一队列中的数据被消耗至一定数量的情况下,为了避免第一队列的数据被消耗完毕而导致出现通信空闲的现象,第一训练装置可以是继续优先执行第一类运算,以持续产生待传输至第二训练装置的数据,保证数据通信的持续性。
其中,用于支持执行第二类运算的数据可以是指作为第二类运算的输入的数据,即第二类运算的输入数据。例如,在图4A中,对于训练装置1而言,第二类运算可以是指训练装置1基于训练装置2返回的数据执行模型1的反向梯度计算。那么,用于支持执行第二类运算的数据则为训练装置2所返回的数据。
第一训练装置执行第二类运算的过程实际上就是对第二类运算的输入数据进行处理的过程。在第一训练装置处理完毕第二类运算的输入数据之后,用于支持执行第二类运算的数据则已处理完毕,第一训练装置无法继续执行第二类运算,因此第一训练装置转至继续执行第一类运算。
以上介绍了第一训练装置所部署的模型以及基于模型所执行的第一类运算和第二类运算,为便于理解,以下将结合具体例子详细介绍训练装置基于不同的优先级执行第一类运算和第二类运算的具体过程。
示例性地,请参阅图7,图7为本申请实施例提供的一种训练架构的示意图。如图7所示,参与训练的两个训练装置按照训练任务中的角色可以分为领导者节点(即图7中的leader node)和跟随者节点(即图7中的follower node)。其中,leader node可以是指发起训练任务的训练装置,follower node则是指接收到训练任务并参与训练的训练装置。以图6所示的训练过程为例,训练装置3可以为leader node,训练装置4则为follower node。
如图7所示,对于leader node和follower node,两者内部用于实现模型训练的功能组件相同,区别在于训练任务的不同。以leader node为例,leader node中包括工作器(worker)和训练器(trainers)。
其中,worker负责任务生成、数据读取、并行任务调度以及通过通信模块与follower node交互数据。任务生成是指在生成训练任务之前与follower node进行协商,以确定训练任务中的训练数据集、待训练 的模型、模型的超参数以及数据传输过程中的加密算法等配置信息。数据读取是指在模型训练过程中读取训练所需的训练数据集。并行任务调度包括两个部分,分别为多队列缓存和贪心机制控制。多队列缓存是指在训练过程中,将执行不同类型的运算以及通信所获得的数据通过队列进行缓存,以便于基于各个队列中的数据量情况来决定运算的执行情况;贪心机制控制是指在训练过程中基于各个队列中的数据量情况来决定运算的执行情况,以尽可能地优先执行优先级高的运算。
此外,trainers则负责对模型执行正向计算、反向计算以及优化器更新(即更新模型权重)。
为便于理解,以下将结合图6所示的训练过程详细介绍训练装置进行并行任务调度的过程。请参阅图8,图8为本申请实施例提供的一种训练装置3进行并行任务调度的流程示意图。如图8所示,训练装置3进行并行任务调度的过程包括以下的步骤801-810。
步骤801,将一个批次的数据划分为多组数据。
具体地,训练装置3先基于模型的批次大小获取到一个批次的数据,并且将该一个批次的数据划分为多组数据,每组数据可以有序地进行编号。
步骤802,确定模型当前需处理的数据是组次为1的一组数据。
步骤803,判断队列1是否为空。
其中,队列1是用于缓存训练装置4向训练装置3传输的数据。即,队列1所缓存的数据是用于作为训练装置3对模型C2执行正向计算的输入数据。
步骤804,如果队列1为空,则判断模型C1的正向计算是否已完成。
其中,模型C1的正向计算则是指训练装置3基于所划分的一组数据对模型C1进行的计算。在训练过程中,训练装置3需要对模型C1进行多次正向计算,每次正向计算都是基于所划分的一组数据来对模型C1进行计算。因此,训练装置3对模型C1进行正向计算的次数与多组数据的组数相同。
如果训练装置3已经基于多组数据对模型C1执行了多次正向计算,则代表模型C1的正向计算已经完成;如果训练装置3对模型执行正向计算的次数少于多组数据的组数,则代表代表模型C1的正向计算未完成。
步骤805,如果模型C1的正向计算未完成,则对模型C1执行正向计算,且对模型C1执行正向计算所得到的计算结果进入队列2中缓存。
具体地,训练装置3可以基于当前所确定的组次在多组数据中获取对应的一组数据,并基于所获取的一组数据对模型C1执行正向计算。
步骤806,在基于一组数据对模型C1执行正向计算后,将待处理数据对应的组次加1。
在将待处理数据对应的组次加1之后,训练装置3下一次对模型C1执行正向计算时则能够选取到下一组数据来执行计算。
步骤807,如果模型C1的正向计算已完成,则判断队列3是否为空。
其中,队列3是用于缓存对模型C2执行反向梯度计算的计算结果。即队列3所缓存的数据是用于作为对模型C1执行反向梯度计算的输入数据。
步骤808,如果队列3不为空,则基于队列3中的数据对模型C1执行反向梯度计算。
此外,在队列1为空,模型C1的正向计算已完成且队列3为空的情况下,训练装置3当前无可执行的计算,因此训练装置3继续转至执行步骤803,以确定队列1中是否已有训练装置4所传输的数据,即继续等待训练装置4传输数据。
步骤809,如果队列1不为空,队列1和队列2中的数据同时出列。
在队列1不为空的情况下,代表训练装置4已向训练装置3传输数据,因此训练装置3可以提取队列1和队列2中的数据,即提取对模型C2执行正向计算所需的数据。
步骤810,基于队列1和队列2中出列的数据对模型C2执行正向计算和反向梯度计算,并将反向梯度计算所得到的计算结果入队列3。
最后,在训练装置3对模型C2执行完所有的反向梯度计算后,训练装置3再基于所得到的多个梯度对模型C2进行更新。
类似地,在训练装置3对模型C1执行完所有的反向梯度计算后,训练装置3再基于所得到的多个 梯度对模型C1进行更新。
总的来说,在图8所示的流程中,由于对模型C1执行正向计算所得到的计算结果是其他所有运算的基础,因此训练装置3优先执行对模型C1的正向计算;其次,由于训练装置3对模型C2执行的正向计算和反向梯度计算所得到的计算结果需要传输给训练装置4,因此在训练装置3能够执行对模型C2的计算之后,训练装置3则优先执行针对模型C2的正向计算和反向梯度计算;最后,在训练装置无法执行上述的运算的情况下,训练装置3再执行对模型C1的反向梯度计算。
请参阅图9,图9为本申请实施例提供的一种训练装置4进行并行任务调度的流程示意图。如图9所示,训练装置4进行并行任务调度的过程包括以下的步骤901-908。
步骤901,将一个批次的数据划分为N组数据。
步骤902,确定模型当前需处理的数据是组次为1的一组数据。
其中,本实施例中步骤901-902与上述的步骤801-802类似,具体可参考上述的步骤801-802,在此不再赘述。
步骤903,判断队列4的数据量是否大于第一阈值,或对模型D的正向计算是否执行完毕。
其中,队列4是用于缓存训练装置4对模型D执行正向计算所得到的计算结果。如果队列4的数据量大于第一阈值,则代表训练装置4执行正向计算产生待传输数据的速度远大于训练装置4向训练装置3传输数据的速度。此外,对模型D的正向计算执行完毕可以是指训练装置4已经基于划分得到的N组数据对模型D进行正向计算。
步骤904,如果队列4的数据量不大于第一阈值,且对模型D的正向计算未执行完毕,则对模型D执行正向计算,并将计算结果缓存至队列4,以向训练装置3传输队列4中的计算结果。
需要说明的是,在计算结果被缓存至队列4后,训练装置4则开始向训练装置3传输队列4中的计算结果。并且,在训练装置4向训练装置3传输数据的过程中,训练装置仍可以执行其他的运算。
步骤905,在基于一组数据对模型D执行一次正向计算后,将待处理数据对应的组次加1。
在执行步骤905后,继续转至执行上述的步骤903。
步骤906,如果队列4的数据量大于第一阈值,或对模型D的正向计算执行完毕,则基于队列5中的数据对模型D执行反向梯度计算。
如果队列4的数据量大于第一阈值,则代表训练装置4执行正向计算产生待传输数据的速度远大于训练装置4向训练装置3传输数据的速度,因此可以暂停执行对模型D的正向计算,从而转至执行对模型D执行反向梯度计算。
此外,在对模型D的正向计算执行完毕后,则可以是基于队列5中的数据对模型D执行反向梯度计算,以便于后续实现对模型D的更新。
步骤907,判断反向梯度计算是否执行完毕。
步骤908,基于计算得到的多个梯度,更新模型D的权重。
如果反向梯度计算完毕,则基于计算得到的多个梯度来更新模型D的权重,从而完成模型D的一次迭代训练。如果反向梯度计算未完毕,则继续转至执行上述的步骤903。
总的来说,在图9所示的流程中,由于对模型D执行正向计算所得到的计算结果需要向训练装置3传输,因此训练装置4优先执行对模型D的正向计算。其次,在队列中缓存有较多的待传输数据的情况下,则可以暂停执行对模型D的正向计算,转至执行对模型D的反向梯度计算,以减轻内存压力。最后,在模型D的正向计算执行完毕后,则对模型D执行反向梯度计算并基于计算得到的梯度更新模型D的权重。
以上详细介绍了本申请实施例提供的方法,接下来将介绍本申请实施例提供的用于执行上述方法的设备。
请参阅图10,图10为本申请实施例提供的一种模型训练装置的结构示意图。如图10所示,本申请实施例提供的模型训练装置可以为第一训练装置,该装置包括:获取模块1001,用于获取训练数据集, 训练数据集包括多个数据;处理模块1002,用于将多个数据划分为多组数据,多组数据中的每组数据包括至少一个数据,且每组数据的数据量与第一训练装置的通信能力相关;处理模块1002,还用于基于第一训练装置中所部署的模型,以组为单位分批对多组数据执行处理,以训练模型并分批向第二训练装置传输处理多组数据所得到的数据,第二训练装置与第一训练装置共同参与模型的训练。
在一种可能的实现方式中,训练模型的过程中所执行的运算包括第一类运算和第二类运算,第一类运算用于基于多组数据生成向第二训练装置传输的数据,第二类运算用于生成仅由第一训练装置处理的数据,第一类运算的执行优先级高于第二类运算的执行优先级。
在一种可能的实现方式中,第一类运算包括第一子类运算和第二子类运算,第一子类运算的运算结果用于得到第二子类运算的输入,第二子类运算的运算结果用于向第二训练装置传输。
在一种可能的实现方式中,第二子类运算的执行优先级高于第一子类运算。
在一种可能的实现方式中,处理模块1002,还用于:
将执行第一类运算所产生的数据缓存至第一队列中,第一队列用于缓存待传输至第二训练装置的数据;
在第一队列中数据的数据量大于或等于第一阈值的情况下,停止执行第一类运算,并执行第二类运算。
在一种可能的实现方式中,处理模块1002,还用于:
在第一队列中的数据小于第一阈值的情况下,或在用于支持执行第二类运算的数据已处理完毕的情况下,停止执行第二类运算,并继续执行第一类运算。
在一种可能的实现方式中,第一训练装置所参与的训练为联邦学习。
在一种可能的实现方式中,第一类运算包括基于模型对多组数据进行处理的运算,第二类运算包括基于从第二训练装置获取的数据对模型进行反向梯度计算的运算。
在一种可能的实现方式中,模型包括第一子模型和第二子模型;
第一类运算包括基于第一子模型对多组数据进行处理的运算,以及基于从第二训练装置获取的数据对第二子模型进行处理的运算;
第二类运算包括对第一子模型进行反向梯度计算的运算。
在一种可能的实现方式中,多个数据的数量与模型的批次大小相关;
在模型的训练过程中,目标梯度与基于多组数据所得到的多个梯度相关,目标梯度用于第一训练装置更新模型。
在一种可能的实现方式中,多个数据的数量与模型的批次大小相关,多组数据的组数与待传输数据量具有正相关的关系,且多组数据的组数与第一训练装置的通信能力以及训练时长具有负相关的关系;
其中,待传输数据量为处理多组数据后所生成的待传输数据的数据量,训练时长为第一训练装置基于多组数据训练模型的时长。
请参阅图11,图11为本申请实施例提供的执行设备的一种结构示意图,执行设备1100具体可以表现为手机、平板、笔记本电脑、智能穿戴设备、服务器等,此处不做限定。具体的,执行设备1100包括:接收器1101、发射器1102、处理器1103和存储器1104(其中执行设备1100中的处理器1103的数量可以一个或多个,图11中以一个处理器为例),其中,处理器1103可以包括应用处理器11031和通信处理器11032。在本申请的一些实施例中,接收器1101、发射器1102、处理器1103和存储器1104可通过总线或其它方式连接。
存储器1104可以包括只读存储器和随机存取存储器,并向处理器1103提供指令和数据。存储器1104的一部分还可以包括非易失性随机存取存储器(non-volatile random access memory,NVRAM)。存储器1104存储有处理器和操作指令、可执行模块或者数据结构,或者它们的子集,或者它们的扩展集,其中,操作指令可包括各种操作指令,用于实现各种操作。
处理器1103控制执行设备的操作。具体的应用中,执行设备的各个组件通过总线系统耦合在一起,其中总线系统除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚 说明起见,在图中将各种总线都称为总线系统。
上述本申请实施例揭示的方法可以应用于处理器1103中,或者由处理器1103实现。处理器1103可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器1103中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器1103可以是通用处理器、数字信号处理器(digital signal processing,DSP)、微处理器或微控制器,还可进一步包括专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。该处理器1103可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1104,处理器1103读取存储器1104中的信息,结合其硬件完成上述方法的步骤。
接收器1101可用于接收输入的数字或字符信息,以及产生与执行设备的相关设置以及功能控制有关的信号输入。发射器1102可用于通过第一接口输出数字或字符信息;发射器1102还可用于通过第一接口向磁盘组发送指令,以修改磁盘组中的数据;发射器1102还可以包括显示屏等显示设备。
本申请实施例提供的电子设备具体可以为芯片,芯片包括:处理单元和通信单元,处理单元例如可以是处理器,通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使执行设备内的芯片执行上述实施例描述的模型超参数的选择方法,或者,以使训练设备内的芯片执行上述实施例描述的模型超参数的选择方法。可选地,存储单元为芯片内的存储单元,如寄存器、缓存等,存储单元还可以是无线接入设备端内的位于芯片外部的存储单元,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。
具体的,请参阅图12,图12为本申请实施例提供的芯片的一种结构示意图,芯片可以表现为神经网络处理器NPU 1200,NPU 1200作为协处理器挂载到主CPU(Host CPU)上,由Host CPU分配任务。NPU的核心部分为运算电路1203,通过控制器1204控制运算电路1203提取存储器中的矩阵数据并进行乘法运算。
在一些实现中,运算电路1203内部包括多个处理单元(Process Engine,PE)。在一些实现中,运算电路1203是二维脉动阵列。运算电路1203还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路1203是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器1202中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器1201中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)1208中。
统一存储器1206用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器(Direct Memory Access Controller,DMAC)1205,DMAC被搬运到权重存储器1202中。输入数据也通过DMAC被搬运到统一存储器1206中。
BIU为Bus Interface Unit即,总线接口单元1210,用于AXI总线与DMAC和取指存储器(Instruction Fetch Buffer,IFB)1209的交互。
总线接口单元1210(Bus Interface Unit,BIU),用于取指存储器1209从外部存储器获取指令,还用于存储单元访问控制器1205从外部存储器获取输入矩阵A或者权重矩阵B的原数据。
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器1206或将权重数据搬运到权重存储器1202中或将输入数据数据搬运到输入存储器1201中。
向量计算单元1207包括多个运算处理单元,在需要的情况下,对运算电路1203的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/全连接层网络计算,如Batch Normalization(批归一化),像素级求和,对特征平面进行上采样等。
在一些实现中,向量计算单元1207能将经处理的输出的向量存储到统一存储器1206。例如,向量计算单元1207可以将线性函数;或,非线性函数应用到运算电路1203的输出,例如对卷积层提取的特征平面进行线性插值,再例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元1207生成归一化的值、像素级求和的值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路1203的激活输入,例如用于在神经网络中的后续层中的使用。
控制器1204连接的取指存储器(instruction fetch buffer)1209,用于存储控制器1204使用的指令;
统一存储器1206,输入存储器1201,权重存储器1202以及取指存储器1209均为On-Chip存储器。外部存储器私有于该NPU硬件架构。
其中,上述任一处提到的处理器,可以是一个通用中央处理器,微处理器,ASIC,或一个或多个用于控制上述程序执行的集成电路。
可以参阅图13,图13为本申请实施例提供的一种计算机可读存储介质的结构示意图。本申请还提供了一种计算机可读存储介质,在一些实施例中,上述图3所公开的方法可以实施为以机器可读格式被编码在计算机可读存储介质上或者被编码在其它非瞬时性介质或者制品上的计算机程序指令。
图13示意性地示出根据这里展示的至少一些实施例而布置的示例计算机可读存储介质的概念性局部视图,示例计算机可读存储介质包括用于在计算设备上执行计算机进程的计算机程序。
在一个实施例中,计算机可读存储介质1300是使用信号承载介质1301来提供的。信号承载介质1301可以包括一个或多个程序指令1302,其当被一个或多个处理器运行时可以提供以上针对图5描述的功能或者部分功能。
在一些示例中,信号承载介质1301可以包含计算机可读介质1303,诸如但不限于,硬盘驱动器、紧密盘(CD)、数字视频光盘(DVD)、数字磁带、存储器、ROM或RAM等等。
在一些实施方式中,信号承载介质1301可以包含计算机可记录介质1304,诸如但不限于,存储器、读/写(R/W)CD、R/W DVD、等等。在一些实施方式中,信号承载介质1301可以包含通信介质1305,诸如但不限于,数字和/或模拟通信介质(例如,光纤电缆、波导、有线通信链路、无线通信链路、等等)。因此,例如,信号承载介质1301可以由无线形式的通信介质1305(例如,遵守IEEE 802.11标准或者其它传输协议的无线通信介质)来传达。
一个或多个程序指令1302可以是,例如,计算机可执行指令或者逻辑实施指令。在一些示例中,计算设备的计算设备可以被配置为,响应于通过计算机可读介质1303、计算机可记录介质1304、和/或通信介质1305中的一个或多个传达到计算设备的程序指令1302,提供各种操作、功能、或者动作。
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,训练设备,或者网络设备等)执行本申请各个实施例的方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行计算机程序指令时,全部或部 分地产生按照本申请实施例的流程或功能。计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,计算机指令可以从一个网站站点、计算机、训练设备或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、训练设备或数据中心进行传输。计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的训练设备、数据中心等数据存储设备。可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。

Claims (26)

  1. 一种模型训练方法,应用于第一训练装置,所述方法还包括:
    获取训练数据集,所述训练数据集包括多个数据;
    将所述多个数据划分为多组数据,所述多组数据中的每组数据包括至少一个数据,且所述每组数据的数据量与所述第一训练装置的通信能力相关;
    基于所述第一训练装置中所部署的模型,以组为单位分批对所述多组数据处理,以训练所述模型并分批向第二训练装置传输处理所述多组数据所得到的数据,所述第二训练装置与所述第一训练装置共同参与所述模型的训练。
  2. 根据权利要求1所述的方法,其特征在于,训练所述模型的过程中所执行的运算包括所述第一类运算和第二类运算,所述第一类运算用于基于所述多组数据生成向所述第二训练装置传输的数据,所述第二类运算用于生成仅由所述第一训练装置处理的数据,所述第一类运算的执行优先级高于所述第二类运算的执行优先级。
  3. 根据权利要求2所述的方法,其特征在于,第一类运算包括第一子类运算和第二子类运算,所述第一子类运算的运算结果用于得到所述第二子类运算的输入,所述第二子类运算的运算结果用于向所述第二训练装置传输。
  4. 根据权利要求3所述的方法,其特征在于,所述第二子类运算的执行优先级高于所述第一子类运算。
  5. 根据权利要求1-4任意一项所述的方法,其特征在于,在训练所述模型的过程中,所述方法还包括:
    将执行所述第一类运算所产生的数据缓存至第一队列中,所述第一队列用于缓存待传输至所述第二训练装置的数据;
    在所述第一队列中数据的数据量大于或等于第一阈值的情况下,停止执行所述第一类运算,并执行所述第二类运算。
  6. 根据权利要求5所述的方法,其特征在于,所述方法还包括:
    在所述第一队列中的数据小于所述第一阈值的情况下,或在用于支持执行所述第二类运算的数据已处理完毕的情况下,停止执行所述第二类运算,并继续执行所述第一类运算。
  7. 根据权利要求1-6任意一项所述的方法,其特征在于,所述第一训练装置所参与的训练为联邦学习。
  8. 根据权利要求1-7任意一项所述的方法,其特征在于,所述第一类运算包括基于所述模型对所述多组数据进行处理的运算,所述第二类运算包括基于从所述第二训练装置获取的数据对所述模型进行反向梯度计算的运算。
  9. 根据权利要求1-7任意一项所述的方法,其特征在于,所述模型包括第一子模型和第二子模型;
    所述第一类运算包括基于所述第一子模型对所述多组数据进行处理的运算,以及基于从所述第二训练装置获取的数据对所述第二子模型进行处理的运算;
    所述第二类运算包括对所述第一子模型进行反向梯度计算的运算。
  10. 根据权利要求1-9任意一项所述的方法,其特征在于,所述多个数据的数量与所述模型的批次大 小相关;
    在所述模型的训练过程中,目标梯度与基于所述多组数据所得到的多个梯度相关,所述目标梯度用于所述第一训练装置更新所述模型。
  11. 根据权利要求1-10任意一项所述的方法,其特征在于,所述多个数据的数量与所述模型的批次大小相关,所述多组数据的组数与待传输数据量具有正相关的关系,且所述多组数据的组数与所述第一训练装置的通信能力以及训练时长具有负相关的关系;
    其中,所述待传输数据量为处理所述多组数据后所生成的待传输数据的数据量,所述训练时长为所述第一训练装置基于所述多组数据训练所述模型的时长。
  12. 一种模型训练装置,所述模型训练装置为第一训练装置,包括:
    获取模块,用于获取训练数据集,所述训练数据集包括多个数据;
    处理模块,用于将所述多个数据划分为多组数据,所述多组数据中的每组数据包括至少一个数据,且所述每组数据的数据量与所述第一训练装置的通信能力相关;
    所述处理模块,还用于基于所述第一训练装置中所部署的模型,以组为单位分批对所述多组数据处理,以训练所述模型并分批向第二训练装置传输处理所述多组数据所得到的数据,所述第二训练装置与所述第一训练装置共同参与所述模型的训练。
  13. 根据权利要求12所述的装置,其特征在于,训练所述模型的过程中所执行的运算包括所述第一类运算和第二类运算,所述第一类运算用于基于所述多组数据生成向所述第二训练装置传输的数据,所述第二类运算用于生成仅由所述第一训练装置处理的数据,所述第一类运算的执行优先级高于所述第二类运算的执行优先级。
  14. 根据权利要求13所述的装置,其特征在于,第一类运算包括第一子类运算和第二子类运算,所述第一子类运算的运算结果用于得到所述第二子类运算的输入,所述第二子类运算的运算结果用于向所述第二训练装置传输。
  15. 根据权利要求14所述的装置,其特征在于,所述第二子类运算的执行优先级高于所述第一子类运算。
  16. 根据权利要求12-15任意一项所述的装置,其特征在于,所述处理模块,还用于:
    将执行所述第一类运算所产生的数据缓存至第一队列中,所述第一队列用于缓存待传输至所述第二训练装置的数据;
    在所述第一队列中数据的数据量大于或等于第一阈值的情况下,停止执行所述第一类运算,并执行所述第二类运算。
  17. 根据权利要求16所述的装置,其特征在于,所述处理模块,还用于:
    在所述第一队列中的数据小于所述第一阈值的情况下,或在用于支持执行所述第二类运算的数据已处理完毕的情况下,停止执行所述第二类运算,并继续执行所述第一类运算。
  18. 根据权利要求12-17任意一项所述的装置,其特征在于,所述第一训练装置所参与的训练为联邦学习。
  19. 根据权利要求12-18任意一项所述的装置,其特征在于,所述第一类运算包括基于所述模型对所述多组数据进行处理的运算,所述第二类运算包括基于从所述第二训练装置获取的数据对所述模型进行 反向梯度计算的运算。
  20. 根据权利要求12-18任意一项所述的装置,其特征在于,所述模型包括第一子模型和第二子模型;
    所述第一类运算包括基于所述第一子模型对所述多组数据进行处理的运算,以及基于从所述第二训练装置获取的数据对所述第二子模型进行处理的运算;
    所述第二类运算包括对所述第一子模型进行反向梯度计算的运算。
  21. 根据权利要求12-20任意一项所述的装置,其特征在于,所述多个数据的数量与所述模型的批次大小相关;
    在所述模型的训练过程中,目标梯度与基于所述多组数据所得到的多个梯度相关,所述目标梯度用于所述第一训练装置更新所述模型。
  22. 根据权利要求12-21任意一项所述的装置,其特征在于,所述多个数据的数量与所述模型的批次大小相关,所述多组数据的组数与待传输数据量具有正相关的关系,且所述多组数据的组数与所述第一训练装置的通信能力以及训练时长具有负相关的关系;
    其中,所述待传输数据量为处理所述多组数据后所生成的待传输数据的数据量,所述训练时长为所述第一训练装置基于所述多组数据训练所述模型的时长。
  23. 一种模型训练装置,其特征在于,包括存储器和处理器;所述存储器存储有代码,所述处理器被配置为执行所述代码,当所述代码被执行时,所述装置执行如权利要求1至11任意一项所述的方法。
  24. 一种模型训练系统,其特征在于,包括:至少两个模型训练装置,所述至少两个模型训练装置共同参与模型的训练,且所述至少两个模型训练装置中的任意一个模型训练装置采用如权利要求1至11任意一项所述的方法与其他的模型训练装置进行交互并执行模型的训练。
  25. 一种计算机存储介质,其特征在于,所述计算机存储介质存储有指令,所述指令在由计算机执行时使得所述计算机实施权利要求1至11任意一项所述的方法。
  26. 一种计算机程序产品,其特征在于,所述计算机程序产品存储有指令,所述指令在由计算机执行时使得所述计算机实施权利要求1至11任意一项所述的方法。
PCT/CN2023/129042 2022-11-03 2023-11-01 一种模型训练方法及相关装置 WO2024094058A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211372502.8A CN118036776A (zh) 2022-11-03 2022-11-03 一种模型训练方法及相关装置
CN202211372502.8 2022-11-03

Publications (1)

Publication Number Publication Date
WO2024094058A1 true WO2024094058A1 (zh) 2024-05-10

Family

ID=90929727

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/129042 WO2024094058A1 (zh) 2022-11-03 2023-11-01 一种模型训练方法及相关装置

Country Status (2)

Country Link
CN (1) CN118036776A (zh)
WO (1) WO2024094058A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118396048A (zh) * 2024-06-28 2024-07-26 山东海量信息技术研究院 分布式训练系统、方法及设备、介质和计算机程序产品

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170308789A1 (en) * 2014-09-12 2017-10-26 Microsoft Technology Licensing, Llc Computing system for training neural networks
CN111931950A (zh) * 2020-09-28 2020-11-13 支付宝(杭州)信息技术有限公司 一种基于联邦学习进行模型参数更新的方法及系统
US20200389545A1 (en) * 2019-06-05 2020-12-10 Siemens Healthcare Gmbh Methods and systems for control of the transmission of medical image data packets via a network
WO2022105714A1 (zh) * 2020-11-23 2022-05-27 华为技术有限公司 数据处理方法、机器学习的训练方法及相关装置、设备
CN114861217A (zh) * 2022-03-25 2022-08-05 支付宝(杭州)信息技术有限公司 一种多方联合训练中的数据同步方法及装置
CN114897067A (zh) * 2022-04-28 2022-08-12 北京百度网讯科技有限公司 基于联邦学习的决策模型训练方法、装置和联邦学习系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170308789A1 (en) * 2014-09-12 2017-10-26 Microsoft Technology Licensing, Llc Computing system for training neural networks
US20200389545A1 (en) * 2019-06-05 2020-12-10 Siemens Healthcare Gmbh Methods and systems for control of the transmission of medical image data packets via a network
CN111931950A (zh) * 2020-09-28 2020-11-13 支付宝(杭州)信息技术有限公司 一种基于联邦学习进行模型参数更新的方法及系统
WO2022105714A1 (zh) * 2020-11-23 2022-05-27 华为技术有限公司 数据处理方法、机器学习的训练方法及相关装置、设备
CN114861217A (zh) * 2022-03-25 2022-08-05 支付宝(杭州)信息技术有限公司 一种多方联合训练中的数据同步方法及装置
CN114897067A (zh) * 2022-04-28 2022-08-12 北京百度网讯科技有限公司 基于联邦学习的决策模型训练方法、装置和联邦学习系统

Also Published As

Publication number Publication date
CN118036776A (zh) 2024-05-14

Similar Documents

Publication Publication Date Title
CN110582785B (zh) 配置用于执行层描述符列表的具有功率效率的深度神经网络模块
US11315013B2 (en) Implementing parameter server in networking infrastructure for high-performance computing
US10949328B2 (en) Data flow graph computation using exceptions
US20190205358A1 (en) Sparsity-aware hardware accelerators
CN113469355B (zh) 分布式系统中的多模型训练管道
JP7451614B2 (ja) オンチップの計算ネットワーク
WO2024094058A1 (zh) 一种模型训练方法及相关装置
US20190138373A1 (en) Multithreaded data flow processing within a reconfigurable fabric
US11467992B1 (en) Memory access operation in distributed computing system
KR20220164570A (ko) 딥 러닝 가속기 및 랜덤 액세스 메모리를 구비한 에지 서버
US11941528B2 (en) Neural network training in a distributed system
US11494321B1 (en) State buffer memloc reshaping
WO2024160216A1 (zh) 一种联邦学习方法及相关装置
US11809953B1 (en) Dynamic code loading for multiple executions on a sequential processor
CN112799726A (zh) 数据处理装置、方法及相关产品
US10846201B1 (en) Performance debug for networks
CN111190735A (zh) 一种基于Linux的片上CPU/GPU流水化计算方法及计算机系统
WO2024007873A1 (zh) 一种图处理方法及相关装置
WO2020042770A1 (zh) 图像识别处理方法和装置
US20210004658A1 (en) System and method for provisioning of artificial intelligence accelerator (aia) resources
US11531578B1 (en) Profiling and debugging for remote neural network execution
WO2023151216A1 (zh) 图数据处理的方法和芯片
WO2024016894A1 (zh) 一种神经网络的训练方法以及相关设备
US12001352B1 (en) Transaction ordering based on target address
WO2023134588A1 (zh) 计算系统、方法、装置及加速设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23884984

Country of ref document: EP

Kind code of ref document: A1