WO2023024920A1 - 模型训练方法、系统、集群及介质 - Google Patents
模型训练方法、系统、集群及介质 Download PDFInfo
- Publication number
- WO2023024920A1 WO2023024920A1 PCT/CN2022/111734 CN2022111734W WO2023024920A1 WO 2023024920 A1 WO2023024920 A1 WO 2023024920A1 CN 2022111734 W CN2022111734 W CN 2022111734W WO 2023024920 A1 WO2023024920 A1 WO 2023024920A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- model
- training
- parameters
- trained
- output
- Prior art date
Links
- 238000012549 training Methods 0.000 title claims abstract description 429
- 238000000034 method Methods 0.000 title claims abstract description 78
- 238000013473 artificial intelligence Methods 0.000 claims abstract description 127
- 238000009826 distribution Methods 0.000 claims description 48
- 238000013527 convolutional neural network Methods 0.000 claims description 35
- 230000003993 interaction Effects 0.000 claims description 20
- 230000000052 comparative effect Effects 0.000 claims description 14
- 238000011423 initialization method Methods 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 8
- 230000000295 complement effect Effects 0.000 abstract description 16
- 230000008569 process Effects 0.000 description 17
- 238000010586 diagram Methods 0.000 description 13
- 238000012360 testing method Methods 0.000 description 10
- 238000003058 natural language processing Methods 0.000 description 9
- 210000002569 neuron Anatomy 0.000 description 8
- 238000005457 optimization Methods 0.000 description 8
- 238000004891 communication Methods 0.000 description 7
- 230000000694 effects Effects 0.000 description 6
- 230000006870 function Effects 0.000 description 5
- 238000012545 processing Methods 0.000 description 5
- 239000013598 vector Substances 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000013507 mapping Methods 0.000 description 4
- 230000000306 recurrent effect Effects 0.000 description 4
- 230000001960 triggered effect Effects 0.000 description 4
- 230000003044 adaptive effect Effects 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000007477 logistic regression Methods 0.000 description 3
- 239000007787 solid Substances 0.000 description 3
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000013136 deep learning model Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- 101150041570 TOP1 gene Proteins 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 210000004027 cell Anatomy 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/09—Supervised learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Definitions
- the present application relates to the technical field of artificial intelligence (AI), and in particular to a model training method, a model training system, a cluster of computing devices, a computer-readable storage medium, and a computer program product.
- AI artificial intelligence
- AI tasks refer to tasks completed using the capabilities of AI models.
- AI tasks can include natural language processing (NLP) tasks such as language translation and intelligent question answering, or computer vision (CV) tasks such as target detection and image classification.
- NLP natural language processing
- CV computer vision
- New AI models are usually proposed by experts in the AI field for specific AI tasks, and these AI models have achieved good results in the above-mentioned specific AI tasks. Therefore, many researchers try to introduce these new AI models into other AI tasks.
- the transformer model is a deep learning model that weights various parts of the input data based on the attention mechanism.
- the transformer model has achieved remarkable results in many NLP tasks.
- Many researchers have tried to introduce the transformer model into CV tasks, such as image classification tasks, target detection tasks, and so on.
- an AI model such as a transformer model
- it usually needs to be pre-trained on a large data set first, which leads to a long time-consuming training process.
- some AI models may require Training for thousands of days is difficult to meet business needs.
- This application provides an AI model training method, which uses the output of the second model complementary to the first model to infer the training data as a supervisory signal, trains the first model, and promotes the accelerated convergence of the first model without requiring large Pre-training on large-scale data sets shortens training time and improves training efficiency.
- the present application also provides a model training system, a computing device cluster, a computer-readable storage medium, and a computer program product corresponding to the above method.
- the present application provides an AI model training method.
- the method can be performed by a model training system.
- the model training system may be a software system for training an AI model, and the computing device or computing device cluster executes the AI model training method by running the program code of the software system.
- the model training system may also be a hardware system for training AI models. The following uses the model training system as an example to describe the software system.
- the model training system determines the first model to be trained and the second model to be trained, the first model and the second model are two heterogeneous AI models, and then input training data into the first model and the the second model, obtain the first output after the first model performs inference on the training data, and the second output after the second model infers the training data, and then use the second output
- iteratively updating model parameters of the first model in combination with the first output until the first model satisfies a first preset condition.
- the model training system uses the second model that is complementary to the performance of the first model to infer the second output of the training data, and adds an additional supervisory signal to the training of the first model, so as to promote the first model to compare with the first model.
- the learning of the second model with complementary models enables the convergence of the first model to be accelerated without pre-training on a large-scale data set, which greatly shortens the training time, improves the efficiency of the first model training, and meets business needs.
- the model training system may also use the first output as the supervisory signal of the second model, and iteratively update the model parameters of the second model in combination with the second output until the first output The second model satisfies the second preset condition.
- the model training system uses the first model that is complementary to the performance of the second model to infer the first output of the training data, and adds an additional supervisory signal to the training of the second model to promote the second model to be complementary to the second model.
- the learning of the first model enables the second model to accelerate the convergence without pre-training on a large-scale data set, which greatly shortens the training time, improves the efficiency of the second model training, and meets the needs of the business.
- the first output includes at least one of a first feature extracted by the first model from the training data and a first probability distribution inferred based on the first feature
- the The second output includes at least one of a second feature extracted from the training data by the second model and a second probability distribution inferred based on the second feature.
- the model training system uses the second output as the supervisory signal of the first model, and iteratively updates the model parameters of the first model in combination with the first output, which can be realized in the following manner: determine the first model according to the first feature and the second feature a contrastive loss, and/or, determining a first relative entropy loss based on said first probability distribution and said second probability distribution; then based on at least one of said first comparative loss and said first relative entropy loss, Iteratively updating model parameters of the first model.
- the model training system can not only enable the AI model to learn how to distinguish different categories, but also enable the AI model to refer to the probability estimate (or probability distribution) of another AI model to improve its generalization ability.
- model training system when iteratively updates the model parameters of the first model, iteratively updates the first model according to the gradient of the first comparison loss and the gradient of the first relative entropy loss.
- a model parameter for a model When the difference between the supervised loss of the first model and the supervised loss of the second model is less than a first preset threshold, stop performing the iterative update of the model parameters of the first model according to the gradient of the first comparison loss .
- the model training system restricts the gradient backflow, for example, restricting the contrastive loss gradient backflow to the first model, which can prevent the model with poor performance from misleading the model with better performance, causing the model to converge in the wrong direction , which can promote efficient convergence of the first model.
- the model training system may first iteratively update the second model according to the gradient of the second comparison loss and the gradient of the second relative entropy loss.
- the model parameters of the second model When the difference between the supervised loss of the second model and the supervised loss of the first model is less than a second preset threshold, stop performing iterative updating of the model of the second model according to the gradient of the second relative entropy loss parameter.
- the model training system restricts the gradient backflow, such as limiting the gradient backflow of relative entropy loss to the second model, which can prevent the model with poor performance from misleading the model with better performance, causing the model to converge in the wrong direction, thus It can promote efficient convergence of the second model.
- the learning speed, data utilization efficiency and upper limit of representation ability of the branch training the first model and the branch training the second model may be different, and the model training system can adjust the training Strategy, realize that at different stages of training, the branch with good training effect (such as fast convergence and high precision) acts as the teacher (that is, the role of providing supervisory signals), and promotes the learning of branches with poor training effect.
- the two branches can be partners and learn from each other.
- the roles of the branches can be reversed. That is to say, two heterogeneous AI models can independently select corresponding roles during the training process to achieve mutual promotion and improve training efficiency.
- the first model is a converter model
- the second model is a convolutional neural network model.
- the performance of the converter model and the convolutional neural network model are complementary. Therefore, the model training system can train the converter model and the convolutional neural network model in a complementary learning manner to improve training efficiency.
- the model training system may determine the first model to be trained and the second model to be trained according to the user's selection through the user interface, or determine the type of AI task set by the user. The first model to be trained and the second model to be trained.
- the model training system supports adaptively determining the first model to be trained and the second model to be trained according to the type of AI task, which improves the automation of AI model training, and the model training system also supports human intervention. For example, manually select the first model to be trained and the second model to be trained to realize interactive training.
- the model training system may receive training parameters configured by the user through the user interface, and may also determine the training parameters according to the type of AI task set by the user and the first model and the second model. In this way, the model training system can support adaptive determination of training parameters, thereby realizing a fully automatic AI model training solution. In addition, the model training system also supports manual intervention to configure training parameters to meet personalized business needs.
- the model training system may output at least one of the trained first model and the trained second model, so as to pass at least one of the trained first model and the trained second model reasoning. That is to say, the model training system can implement joint training and detachable reasoning (for example, using one of the AI models for reasoning), thereby improving the flexibility of deploying AI models and reducing the difficulty of deploying AI models.
- the training parameters include one or more of training rounds, optimizer type, learning rate update strategy, model parameter initialization method, and training strategy.
- the model training system can iteratively update the model parameters of the first model according to the above training parameters, so as to improve the training efficiency of the first model.
- the present application provides a model training system.
- the system includes:
- An interaction unit configured to determine a first model to be trained and a second model to be trained, where the first model and the second model are two heterogeneous AI models;
- a training unit configured to input training data into the first model and the second model, obtain a first output after the first model performs inference on the training data, and the second model performs an inference on the training The second output after the data is inferred;
- the training unit is further configured to use the second output as the supervisory signal of the first model, and iteratively update the model parameters of the first model in combination with the first output until the first model satisfies the first preset conditions.
- the training unit is also used for:
- the first output includes at least one of a first feature extracted by the first model from the training data and a first probability distribution inferred based on the first feature
- the The second output includes at least one of a second feature extracted from the training data by the second model and a second probability distribution inferred based on the second feature
- the training unit is specifically used for:
- Model parameters of the first model are iteratively updated according to at least one of the first comparative loss and the first relative entropy loss.
- the training unit is specifically used for:
- the first model is a converter model
- the second model is a convolutional neural network model
- the interaction unit is specifically configured to:
- the first model to be trained and the second model to be trained are determined according to the type of AI task set by the user.
- the interaction unit is also used for:
- the training parameters are determined according to the type of the AI task set by the user and the first model and the second model.
- the training parameters include one or more of training rounds, optimizer type, learning rate update strategy, model parameter initialization method, and training strategy.
- the present application provides a computing device cluster, where the computing device cluster includes at least one computing device.
- At least one computing device includes at least one processor and at least one memory.
- the processor and the memory communicate with each other.
- the at least one processor is configured to execute instructions stored in the at least one memory, so that the cluster of computing devices executes the method described in the first aspect or any implementation manner of the first aspect.
- the present application provides a computer-readable storage medium, where an instruction is stored in the computer-readable storage medium, and the instruction instructs a computing device or a cluster of computing devices to execute any one of the above-mentioned first aspect or the first aspect Implement the method described in the manner.
- the present application provides a computer program product containing instructions, which, when running on a computing device or a computing device cluster, causes the computing device or computing device cluster to perform any one of the above-mentioned first aspect or the first aspect Implement the method described in the manner.
- the present application may further be combined to provide more implementation manners.
- FIG. 1 is a system architecture diagram of a model training system provided by an embodiment of the present application
- Fig. 2 is a schematic diagram of a model selection interface provided by the embodiment of the present application.
- FIG. 3 is a schematic diagram of a training parameter configuration interface provided by an embodiment of the present application.
- FIG. 4 is a schematic diagram of a deployment environment of a model training system provided by an embodiment of the present application.
- FIG. 5 is a flow chart of a model training method provided in an embodiment of the present application.
- FIG. 6 is a schematic flow chart of a model training method provided in an embodiment of the present application.
- FIG. 7 is a schematic diagram of a model training process provided by an embodiment of the present application.
- FIG. 8 is a schematic structural diagram of a computing device cluster provided by an embodiment of the present application.
- first and second in the embodiments of the present application are used for description purposes only, and cannot be understood as indicating or implying relative importance or implicitly indicating the quantity of indicated technical features. Thus, a feature defined as “first” and “second” may explicitly or implicitly include one or more of these features.
- AI tasks refer to tasks completed using the capabilities of AI models.
- AI tasks can be divided into natural language processing (natural language processing, NLP) tasks, computer vision (computer vision, CV) tasks, automatic speech recognition (automatic speech recognition, ASR) tasks and other different types.
- natural language processing natural language processing, NLP
- computer vision computer vision, CV
- automatic speech recognition automatic speech recognition, ASR
- An AI model refers to an algorithm model developed and trained by AI technologies such as machine learning to achieve specific AI tasks.
- the AI model is also referred to as a "model" for short.
- Different types of AI tasks can be completed by their corresponding AI models.
- NLP tasks such as language translation or intelligent question answering can be completed through the transformer model.
- CV tasks such as image classification or target detection can be completed by a convolutional neural network (CNN) model.
- CNN convolutional neural network
- the transformer model has achieved remarkable results in many NLP tasks, and many researchers have tried to introduce the transformer model into CV tasks.
- the number of words included in the vocabulary in the NLP task is limited, and the pattern of the input image in the CV task is usually infinitely possible, so when the transformer model is introduced into other tasks such as CV tasks, it needs to be used on a larger data set Pre-training takes a long time for the whole training process. For example, the introduction of some AI models to other AI tasks may require thousands of days of training, which is difficult to meet business needs.
- an embodiment of the present application provides an AI model training method.
- the method can be performed by a model training system.
- the model training system can be a software system for training an AI model, and the software system can be deployed in a cluster of computing devices.
- the computing device cluster executes the AI model training method of the embodiment of the present application by running the program code of the above software system.
- the model training system may also be a hardware system, and the hardware system executes the AI model training method of the embodiment of the present application when running.
- the model training system can determine the first model to be trained and the second model to be trained, wherein the first model and the second model are two heterogeneous AI models, that is, the first model and the second model are AI models of different structural types, for example, one AI model can be a transformer model, and the other AI model can be a CNN model. Due to heterogeneity, the performance of the two AI models is usually complementary. To this end, the model training system can jointly train the first model and the second model through complementary learning.
- the process of joint training of the first model and the second model by the model training system is to input the training data into the first model and the second model, obtain the first output after the first model performs inference on the training data, and the second The second output after the model infers the training data, and then uses the second output as the supervisory signal of the first model, and iteratively updates the model parameters of the first model in combination with the first output, until the first model meets the first preset condition.
- the model training system uses the second model to infer the training data after the second output adds an additional supervisory signal to the training of the first model, and promotes the first model to learn from the second model that is complementary to the first model , so that the first model can accelerate the convergence without pre-training on a large-scale data set, which greatly shortens the training time, improves the efficiency of the first model training, and meets the needs of the business.
- the model training system 100 includes an interaction unit 102 and a training unit 104 .
- the interaction unit 102 can interact with the user through a browser (browser) or a client (client).
- the interaction unit 102 is configured to determine a first model to be trained and a second model to be trained, where the first model and the second model are two heterogeneous AI models.
- the training unit 104 is used to input the training data into the first model and the second model, obtain the first output after the first model is inferred, and the second output after the second model infers the training data, and then use the second output as
- the supervisory signal of the first model is combined with the first output to iteratively update the model parameters of the first model until the first model satisfies a first preset condition.
- the training unit 104 is further configured to use the first output as the supervisory signal of the second model, and iteratively update the model parameters of the second model in combination with the second output, until the second model satisfies the second preset condition.
- the interaction unit 102 may interact with the user through a browser or a client, so as to determine the first model to be trained and the second model to be trained. For example, the interaction unit 102 may determine the first model to be trained and the second model to be trained according to the user's selection through the user interface. For another example, the interaction unit 102 may automatically determine the first model to be trained and the second model to be trained according to the type of AI task set by the user.
- the interaction unit 102 determines the first model to be trained and the second model to be trained according to the user's selection through the user interface for illustration.
- the user interface includes a model selection interface.
- the model selection interface may be a graphical user interface (graphical user interface, GUI) or a command user interface (command user interface, CUI).
- the model selection interface is used as an example for illustration.
- the interaction unit 102 may provide the client or the browser with page elements of the model selection interface in response to a request from the client or the browser, so that the client or the browser renders the model selection interface according to the page elements.
- the model selection interface 200 carries model selection controls, such as a first model selection control 202 and a second model selection control 204 .
- a selectable model list may be presented to the user in the interface.
- the selectable model list includes at least one model, and each model includes at least one instance.
- the user can select a model from the selectable model list.
- An instance of one model is selected as the first model, and an instance of the other model is selected from the list of selectable models as the second model.
- the first model may be an instance of the transformer model
- the second model may be an instance of the CNN model.
- the model selection interface 200 also carries an OK control 206 and a Cancel control 208 . Wherein, the confirm control 206 is used to confirm the user's model selection operation, and the cancel control 208 is used to cancel the user's model selection operation.
- the instance of the model in the selectable model list may be built in the model training system, or may be pre-uploaded by the user.
- the user may also upload an instance of the AI model in real time, so that the interaction unit 102 determines the multiple instances of the AI model uploaded by the user as the first model to be trained and the second model to be trained.
- the selectable model list may include a custom option. When the user selects this option, the process of uploading an instance of the AI model may be triggered, and the interaction unit 102 may determine the instance of the AI model uploaded by the user in real time as the first AI model to be trained. model and a second model to be trained.
- the training unit 104 may perform model training according to training parameters.
- the training parameters may be manually configured by the user, or automatically determined or adaptively adjusted by the training unit 104 .
- the training parameters may include one or more of training rounds, optimizer type, learning rate update strategy, model parameter initialization method and training strategy.
- Training epochs refers to the number of training epochs or training rounds.
- One period, that is, one period (epoch) means that each sample in the training set participates in model training once.
- the optimizer refers to the algorithm used to update the model parameters.
- the optimizer type can include different types such as gradient descent, momentum optimization, and adaptive learning rate optimization.
- gradient descent can be further subdivided into batch gradient descent (BGD), stochastic gradient descent (stochastic gradient descent) or mini-batch gradient descent (mini-batch gradient descent).
- Momentum optimization includes standard momentum (momentum) optimization such as or Newton accelerated gradient (nesterov accelerated gradient, NAG) optimization.
- Adaptive learning rate optimization includes AdaGrad, RMSProp, Adam or AdaDelta, etc.
- the learning rate refers to the control factor of the update range of the model parameters, which can usually be set to 0.01, 0.001, or 0.0001, etc.
- the learning rate update strategy can be piecewise constant decay, exponential decay, cosine decay or reciprocal decay, etc.
- the model parameter initialization method includes using a pre-trained model to perform model parameter initialization. In some embodiments, the model parameter initialization method may also include Gaussian distribution initialization and the like.
- the training strategy refers to the strategy used to train the model. Training strategies can be divided into single-stage training strategies and multi-stage training strategies. When the optimizer type is gradient descent, the training strategy can also include the gradient return method of each training stage.
- the following is an example of a user manually configuring training parameters through a user interface.
- the user interface includes a training parameter configuration interface, and the training parameter configuration interface may be a GUI or a CUI.
- the GUI for configuring the training parameters is used as an example for illustration.
- the training parameter configuration interface 300 carries a training round configuration control 302, an optimizer type configuration control 304, a learning rate update strategy configuration control 306, a model initialization mode configuration control 308 and training Policy configuration controls 310 .
- the training round configuration control 302 supports the user to configure the training rounds by directly inputting a numerical value or adding or subtracting a numerical value.
- the user can directly input a value of 100 through the training round configuration control 302, thereby configuring the training rounds to be 100 rounds .
- the optimizer type configuration control 304, the learning rate update strategy configuration control 306, the model parameter initialization method configuration control 308 and the training strategy configuration control 310 support the user to configure the corresponding parameters through the drop-down selection method.
- the user can configure the optimizer type as Adam, the learning rate update strategy as exponential decay, the model parameter initialization method as initialization based on the pre-trained model, and the training strategy as a three-stage training strategy.
- the training parameter configuration interface 300 also carries an OK control 312 and a Cancel control 314 .
- the browser or the client may submit the aforementioned training parameters configured by the user to the model training system 100 .
- the cancel control 314 is triggered, the configuration of the training parameters by the user is cancelled.
- FIG. 3 is an example illustrating that the user uniformly configures the training parameters for the first model and the second model.
- the user may also separately configure the training parameters for the first model and the second model.
- the training parameters may also be automatically determined according to the type of the AI task set by the user and the first model and the second model.
- the model training system 100 can maintain the mapping relationship between the type of AI task, the first model, and the second model.
- the training parameters can be determined based on the above mapping relationship.
- Fig. 1 is only a schematic division manner of the model training system 100, and in other possible implementation manners of the embodiment of the present application, the model training system 100 may also be divided in other manners. This embodiment of the present application does not limit it.
- the model training system 100 can be deployed in various ways. In some possible implementation manners, the model training system 100 may be deployed centrally in a cloud environment, an edge environment, or a terminal, or distributed in different environments in a cloud environment, an edge environment, or a terminal.
- the cloud environment indicates the cluster of central computing equipment owned by the cloud service provider and used to provide computing, storage, and communication resources.
- the cluster of central computing devices includes one or more central computing devices, which may be, for example, central servers.
- the edge environment indicates an edge computing device cluster that is geographically close to the end device (that is, the end-side device) and is used to provide computing, storage, and communication resources.
- the edge computing device cluster includes one or more edge computing devices.
- the edge computing device may be, for example, an edge server or a computing box.
- Terminals include but are not limited to user terminals such as desktop computers, laptop computers, and smart phones.
- model training system 100 is centrally deployed in a cloud environment to provide users with cloud services for training AI models for illustration.
- the model training system 100 is centrally deployed in a cloud environment, for example, deployed in a central server of the cloud environment. In this way, the model training system 100 can provide cloud services for training AI models for users to use.
- the model training system 100 deployed in the cloud environment can provide an application programming interface (application programming interface, API) of the cloud service.
- API application programming interface
- a browser or a client can call this API to enter the model selection interface 200 .
- the user can select an instance of the AI model through the model selection interface 200, and the model training system 100 determines the first model to be trained and the second model to be trained according to the user's selection.
- the browser or client can enter the training parameter configuration interface 300 .
- Users can configure training parameters such as training rounds, optimizer type, learning rate update strategy, model parameter initialization method, and training strategy through the controls carried by the training parameter configuration interface 300 .
- the model training system 100 performs joint training on the first model and the second model according to the above training parameters configured by the user.
- the model training system 100 in the cloud environment can obtain the first output after the first model performs inference on the training data and the second output after the second model infers the training data according to inputting the training data into the first model and the second model.
- the second output of using the second output as the supervisory signal of the first model, iteratively updates the model parameters of the first model in combination with the first output until the first model satisfies the first preset condition.
- the model training system 100 iteratively updates the model parameters of the first model, iteratively updates the parameters of the first model by using the gradient descent method according to the configured training parameters, and updates the learning rate by using an exponential decay method.
- the method includes:
- the model training system 100 determines a first model to be trained and a second model to be trained.
- the first model and the second model are two heterogeneous AI models.
- heterogeneous means that the structure types of AI models are different.
- An AI model is usually formed by connecting multiple neurons (cells). Therefore, the structure type of the AI model can be determined according to the structure type of the neurons.
- the structure types of the AI models formed based on the neurons may be different.
- the performances of the two heterogeneous AI models can be complementary. Among them, performance can be measured by different indicators.
- the metrics can be, for example, accuracy, inference time, and the like.
- Complementary performances of two heterogeneous AI models may be that the performance of the first model on the first index is better than that of the second model on the first index, and the performance of the second model on the second index is better than that of the first model.
- the inference time of an AI model with a low number of parameters is shorter than that of an AI model with a high number of parameters, and the accuracy of an AI model with a high number of parameters is higher than that of an AI model with a low number of parameters.
- the first model and the second model may be different models in a transformer model, a CNN model, and a recurrent neural network (recurrent neural network, RNN) model.
- the first model may be a transformer model
- the second model may be a CNN model.
- the model training system 100 can determine the first model to be trained and the second model to be trained in various ways. The different implementation methods are introduced respectively below.
- the model training system 100 determines the first model to be trained and the second model to be trained according to the user's selection through the user interface. Specifically, the model training system 100 may return a page element in response to a client or browser request, so that the client or browser presents the model selection interface 200 to the user based on the page element.
- the user can select instances of AI models of different structure types through the model selection interface 200, such as selecting instances of any two models in a transformer model, a CNN model, and a recurrent neural network (recurrent neural network, RNN) model, and the model training system 100 can The instances of the model selected by the user are determined as the first model to be trained and the second model to be trained.
- the model training system 100 may determine the instance of the transformer model as the first model to be trained, and determine the instance of the CNN model as the second model to be trained.
- the model training system 100 obtains the task type, and determines the models matching the task type as the first model to be trained and the second model to be trained according to the mapping relationship between the task type and the AI model.
- the model training system 100 can determine that the AI model matching the image classification task includes a transformer model and a CNN model according to the mapping relationship between the task type and the AI model, so the instance of the transformer model and the CNN model can be The instances of are determined as the first model to be trained and the second model to be trained.
- the model training system 100 may determine a first model to be trained and a second model to be trained from multiple AI models matching the task type according to business requirements.
- Business requirements may include requirements for model performance, requirements for model size, and so on.
- model performance can be characterized by indicators such as accuracy, inference time, and inference speed.
- the model training system 100 can determine a 16-layer transformer model such as a 16-layer visual converter basic model (vision transformer base/16, ViT-B/16) as the first model to be trained according to the demand for model size, and determine 50 layers
- the residual network model (residual network-50, ResNet-50) is the second model to be trained.
- the model training system 100 may also determine ViT-B/16 as the first model to be trained and ResNet-50 as the second model to be trained based on the user's selection.
- ResNet is an example of a CNN model, and ResNet solves the problem of gradient disappearance or gradient explosion in a deep CNN model through short-circuit connections.
- the model training system 100 inputs training data into the first model and the second model, and obtains a first output after the first model infers the training data, and a second output after the second model infers the training data.
- the model training system 100 can obtain a training data set, and then divide the training data in the training data set into several batches, for example, divide it into several batches according to a preset batch size, and then input the training data in batches into the first A model and a second model to obtain a first output after the first model performs inference on the training data and a second output after the second model performs inference on the training data.
- the first output after the first model infers the training data includes at least one of a first feature extracted by the first model from the training data and a first probability distribution inferred based on the first feature.
- the second output after the second model performs inference on the training data includes at least one of a second feature extracted by the second model from the training data and a second probability distribution inferred based on the second feature.
- model training system 100 may not batch the training data in the training data set, but input the training data in the training data set into the first model and the second model one by one, and obtain the first model to perform inference on the training data.
- the first output after , and the second output after the second model performs inference on the training data. That is, the model training system 100 may train the AI model in an offline training manner or an online training manner, which is not limited in this embodiment of the present application.
- the model training system 100 uses the second output as the supervisory signal of the first model, and iteratively updates the model parameters of the first model in combination with the first output, until the first model satisfies the first preset condition.
- the second output of the second model after performing inference on the training data may be used as a supervisory signal of the first model for supervised training of the first model.
- the process of the model training system 100 supervising the training of the first model may be that the model training system 100 determines the first comparison loss according to the first feature extracted from the training data by the first model and the second feature extracted from the training data by the second model, and determining the first relative entropy loss according to the first probability distribution and the second probability distribution, and then iteratively updating the model parameters of the first model according to at least one of the first comparison loss and the first relative entropy loss.
- Contrastive loss is mainly used to characterize the loss of the same training data after dimensionality reduction processing (such as feature extraction) by different AI models.
- the comparison loss can be obtained according to the first feature extracted from the training data by the first model and the second feature obtained by the feature extraction of the training data by the second model, for example, according to the distance between the first feature and the second feature.
- the model training system 100 can determine the comparative loss of the first model and the second model by formula (1):
- L cont represents the contrast loss
- N is the number of training data in a batch
- z represents the feature, for example and Representing the first feature obtained by extracting the i-th training data with the first model and the second feature obtained by extracting the i-th training data with the second model.
- i and j can take any integer from 1 to N (including both endpoints of 1 and N).
- Features can be represented in the form of feature vectors or feature matrices.
- P represents the logistic regression (softmax) probability of the similarity of features.
- the similarity of features can be characterized by the distance of feature vectors, for example, by the cosine distance of feature vectors.
- the logistic regression probability of the similarity between the first feature and the second feature is usually not equal to the logistic regression probability of the similarity between the second feature and the first feature, for example
- Relative entropy loss also known as KL divergence (Kullback-Leibler divergence, KLD)
- KL divergence Kullback-Leibler divergence
- KLD KL divergence
- the relative entropy loss may be the loss generated by the same training data being classified by the classifiers of the first model and the second model.
- Relative entropy can be determined from different probability distributions. The following is an example of relative entropy loss in an image classification task.
- the model training system 100 can determine the relative entropy loss of the first model and the second model by formula (2):
- N represents the number of training data in a batch
- P 1 (i) represents the probability distribution of the first model for classifying the i-th training data, that is, the first probability distribution
- P 2 (i) represents the probability distribution of the second model on The probability distribution of the i-th training data classification, that is, the second probability distribution.
- P 1 (i) and P 2 (i) are discrete.
- the relative entropy loss (KL divergence) is not symmetrical, and the relative entropy loss from distribution P 1 to distribution P 2 is usually not equal to the relative entropy loss from distribution P 2 to distribution P 1 , that is, D KL (P 1
- the model training system 100 can determine the first comparative loss according to the first feature and the second feature in combination with the above formula (1), and determine the first relative entropy loss according to the first probability distribution and the second probability distribution in combination with the above formula (2) , and then the model training system 100 can iteratively update the model parameters of the first model according to at least one of the gradient of the first contrastive loss and the gradient of the first relative entropy loss.
- the model parameters refer to parameters that can be learned through training data.
- the model parameters of the first model may include weight w and bias b of neurons.
- the model training system 100 when iteratively updates the model parameters of the first model, it can iteratively update the model parameters of the first model according to the pre-configured training parameters.
- the training parameters include the optimizer type, which can be different types such as gradient descent and momentum optimization, and the gradient descent further includes batch gradient descent, stochastic gradient descent or mini-batch gradient descent.
- the model training system 100 can iteratively update the model parameters of the first model according to a pre-configured optimizer type, for example, the model training system 100 can iteratively update the model parameters of the first model through gradient descent.
- the pre-configured training parameters also include a learning rate update policy.
- the model management system 100 can update the learning rate according to the learning rate update policy, for example, update the learning rate according to exponential decay.
- the model management system 100 iteratively updates the model parameters of the first model, iteratively updates the first The model parameters for the model.
- the first preset condition can be set according to business requirements.
- the first preset condition may be set to be that the performance of the first model reaches the preset performance.
- performance can be measured by indicators such as accuracy and inference time.
- the first preset condition may be set as the loss value of the first model tends to converge, or the loss value of the first model is smaller than a preset value.
- the performance of the first model can be determined by the performance of the first model on the test dataset.
- the datasets for training AI models include training datasets, validation datasets, and test datasets.
- the training data set is used to learn model parameters, such as learning the weight of neurons in the first model, and further, learning the bias of neurons in the first model.
- the validation data set is used to select the hyperparameters of the first model, such as the number of model layers, the number of neurons, and the learning rate.
- the test dataset is used to evaluate the performance of the model. The test dataset does not participate in the process of determining model parameters nor in the process of selecting hyperparameters. In order to ensure the evaluation accuracy, the test data in the test data set is usually used once.
- the model training system 100 can input the test data in the test data set into the first model, and evaluate the performance of the first model according to the output of the first model after reasoning the test data and the labels of the test data. If the performance of the trained first model reaches the preset performance, the model training system 100 can output the trained first model, otherwise the model training system 100 can return to model selection or training parameter configuration for model optimization until the trained The performance of the first model reaches the preset performance.
- the model training system 100 uses the first output as the supervisory signal of the second model, and iteratively updates the model parameters of the second model in combination with the second output until the second model satisfies the second preset condition.
- the model training system 100 may also perform supervised training on the second model according to the first output.
- the first output includes at least one of a first feature extracted from the training data by the first model and a first probability distribution inferred based on the first feature.
- the second output includes at least one of a second feature extracted from the training data by the second model and a second probability distribution inferred based on the second feature.
- the model training system 100 may determine the second comparative loss according to the second output and the first output, and determine the second relative entropy loss according to the second probability distribution and the first probability distribution.
- the model training system 100 may iteratively update the model parameters of the second model according to at least one of the second comparison loss and the second relative entropy loss until the second model satisfies the second preset condition.
- the calculation method of the second comparative loss may refer to the above formula (1)
- the calculation method of the second relative entropy loss may refer to the above formula (2), which will not be repeated in this embodiment.
- the model training system 100 iteratively updates the model parameters of the second model, it can iteratively update the model parameters of the second model according to the preset training parameters for the second model.
- the training parameter may include an optimizer type, and the model training system 100 may iteratively update the parameters of the second model according to the optimizer type.
- the optimizer type may be stochastic gradient descent, and the model training system 100 may iteratively update the parameters of the second model by means of stochastic gradient descent.
- Training parameters can also include a learning rate update strategy.
- the model training system 100 can update the learning rate according to the learning rate updating strategy.
- the model training system 100 can iterate based on at least one of the gradient of the second contrastive loss and the gradient of the second relative entropy loss, and the updated learning rate. Model parameters of the second model are updated.
- the second preset condition can be set according to business requirements.
- the second preset condition may be set to be that the performance of the second model reaches the preset performance.
- performance can be measured by indicators such as accuracy and inference time.
- the second preset condition may be set as the loss value of the second model tends to converge, or the loss value of the second model is smaller than the preset value.
- the embodiment of the present application provides an AI model training method.
- the model training system 100 uses the second model to infer the training data and the second output adds an additional supervisory signal to the training of the first model, so as to promote the first model to learn from the second model that is complementary to the first model. , so that the first model can accelerate the convergence, so that targeted training can be achieved without pre-training on a large-scale data set, which greatly shortens the training time and improves the first model training. The efficiency meets the needs of the business.
- the model training system 100 can also use the first output of the first model to infer the training data to add an additional supervisory signal to the training of the second model, so as to promote the second model to learn from the first model that is complementary to the second model. , so that the second model can accelerate the convergence without pre-training on a large-scale data set, which greatly shortens the training time, improves the efficiency of the second model training, and meets the needs of the business.
- the performance of the first model and the performance of the second model may change.
- the performance of the first model can change from being lower than the performance of the second model to being higher than the performance of the second model.
- iteratively update the first model can cause the second model to mislead the first model and affect the training of the first model.
- the model training system 100 may also iteratively update the model parameters of the first model in a gradient-restricted reflow manner.
- the gradient limited reflow refers to reflowing part of the gradient to iteratively update the model parameters. For example, reflow the gradient of the contrastive loss, or reflow the gradient of the relative entropy loss, to iteratively update the model parameters.
- the model training system 100 can iteratively update the model parameters of the first model in a gradient-restricted reflow method when the performance of the first model is significantly higher than that of the second model.
- the performance of the first model can also be characterized by the supervision loss of the first model.
- Supervised loss is also called cross entropy loss.
- the supervision loss can be calculated by formula (3):
- xi represents the i-th training data
- n represents the data volume of the training data in a batch of training data
- p( xi ) represents the true probability distribution
- q( xi ) represents the predicted probability distribution, for example, the first probability distribution inferred by the first model.
- the smaller the supervision loss of the first model the closer the inference result of the first model is to the label, the higher the accuracy of the first model is, the larger the supervision loss of the first model is, and the surface inference result of the first model is consistent with The closer the labels are, the lower the accuracy of the first model.
- the process of training the first model by the model training system 100 may include the following steps:
- the model training system 100 iteratively updates the model parameters of the first model according to the gradient of the first contrastive loss and the gradient of the first relative entropy loss.
- the model training system 100 can reflow both the gradient of the first contrast loss and the gradient of the first relative entropy loss, so that based on the first comparison
- the gradient of the loss and the gradient of the first relative entropy loss iteratively update the model parameters of the first model.
- the model training system 100 may determine the supervision loss of the first model and the supervision loss of the second model respectively with reference to the above formula (3).
- the model training system 100 may trigger gradient-limited reflow, for example, only reflow the gradient of the first relative entropy loss.
- the model training system 100 stops performing iterative updating of the model parameters of the first model according to the gradient of the first contrastive loss.
- S5064 is illustrated by using the model training system 100 to reflow the gradient of the first relative entropy loss.
- the model training system 100 may also reflow the gradient of the first relative entropy loss. In order to iteratively update the model parameters of the first model according to the gradient of the first contrastive loss.
- model training system 100 also uses the output of the first model as a supervisory signal to train the second model, only part of the gradient (for example, the second relative entropy loss Gradient), iteratively updates the model parameters of the second model according to the partial gradient.
- the gradient for example, the second relative entropy loss Gradient
- the model training system 100 can not only enable the AI model to learn how to distinguish different categories, but also enable the AI model to refer to the probability estimation of another AI model to improve its generalization ability.
- restricting the gradient backflow such as restricting the gradient backflow of the contrastive loss to the first model, or restricting the gradient backflow of the relative entropy loss to the second model, it is possible to prevent the model with poor performance from misleading the model with better performance.
- the model is caused to converge in the wrong direction, thereby promoting efficient convergence of the first model and the second model.
- the learning speed, data utilization efficiency, and upper limit of representation capabilities of the branch training the first model and the branch training the second model may be different.
- the model training system 100 can adjust the training strategy to achieve At different stages of , the branch with good training effect (such as fast convergence and high precision) plays the role of teacher (that is, the role of providing supervision signals), and promotes the branch with poor training effect to learn.
- the two branches can be partners and learn from each other.
- the roles of the branches can be reversed. That is to say, two heterogeneous AI models can independently select corresponding roles during the training process to achieve mutual promotion and improve training efficiency.
- the model training system 100 acquires a plurality of AI models to be trained, specifically an instance of a CNN model and an instance of a transformer model, also called a CNN branch (branch) and transformer branch.
- each branch includes a backbone network and a classifier, the backbone network is used to extract feature vectors from the input image, and the classifier is used to classify images based on feature vectors.
- the CNN model and the transformer model can be a teacher model (for example, a model that provides a supervisory signal) and a student model (for example, a model that learns based on a supervisory signal).
- the model training system 100 can determine the comparison loss according to the features extracted from the training data (such as input images) by the CNN model and the features extracted from the training data by the transformer model.
- the model training system 100 can determine the relative entropy loss according to the probability distribution of each category obtained by classifying the input image by the CNN model and the probability distribution of each category obtained by classifying the input image by the transformer model itself. As shown by the dotted line pointing to the transformer branch in FIG.
- the gradient of the comparison loss can flow back to the transformer model, and the model training system 100 can update the model parameters of the transformer model according to the gradient of the comparison loss.
- the gradient of the relative entropy loss KL divergence
- the model training system 100 can update the model parameters of the CNN model according to the gradient of the relative entropy loss.
- the model training system 100 can update the model parameters of the CNN model according to the gradient of the relative entropy loss.
- the supervised loss of the transformer model is much larger than that of the CNN model, the gradient of the relative entropy loss can stop flowing back to the CNN model.
- the model training system 100 can update the model parameters of the transformer model according to the gradient of the comparison loss.
- the contrastive loss is usually dual, therefore, the gradient of the contrastive loss can also flow back to the second model, for example, to the CNN model. That is, the model training system 100 can update the model parameters of the CNN model according to the gradient of the contrastive loss and the gradient of the relative entropy loss.
- the embodiment of the present application also verifies the performance of the AI model trained by the AI model training method of the present application on multiple data sets, see the following table for details:
- Table 1 shows the two models output by the joint training of the embodiment of the present application, as well as the accuracy of the two independently trained models on ImageNet, Real, V2, CIFAR 10, CIFAR100, Flowers, and Stanford Cars data sets. It should be noted that this accuracy is the accuracy of the category ranked first when the model predicts the input image category, that is, the accuracy of Top1.
- the accuracy of the jointly trained CNN model (such as the jointly trained ResNet-50 in Table 1) and the jointly trained transformer model (such as the jointly trained ViT-Base in Table 1) of the embodiment of the present application is relatively Compared with the independently trained CNN model (such as ResNet-50 in Table 1) and the independently trained transformer model (such as ViT-Base in Table 1), it has improved, especially on the V2 dataset.
- jointly trained ResNet-50 and ViT-Base can converge faster.
- the jointly trained ResNet-50 and ViT-Base usually tend to converge within 20 rounds, while the independently trained ResNet-50 and ViT-Base usually tend to converge after 20 rounds. in convergence.
- heterogeneous AI models learn from each other and jointly train, which can effectively shorten the training time and improve the training efficiency.
- the model training system 100 adds a learning objective similar to the contrastive learning method, and uses the features learned from one AI model to add an additional supervisory signal to the training of another AI model.
- the model parameters are updated, therefore, accelerated convergence can be achieved. Due to the natural heterogeneity of the two heterogeneous AI models and the differences in representation capabilities, it can effectively prevent the occurrence of common problems such as model collapse and degenerate solutions in comparative learning.
- this method does not need to artificially design a heuristic structure operator to promote model convergence and improve model performance, maintain the characteristics of the original structure of the model as much as possible, and reduce the modification of structural details, thereby improving the flexibility and scalability of the model training system 100. It has good versatility.
- the system 100 includes:
- An interaction unit 102 configured to determine a first model to be trained and a second model to be trained, where the first model and the second model are two heterogeneous AI models;
- a training unit 104 configured to input training data into the first model and the second model, obtain a first output after the first model performs inference on the training data, and the second model performs inference on the The second output after the training data is inferred;
- the training unit 104 is further configured to use the second output as the supervisory signal of the first model, and iteratively update the model parameters of the first model in combination with the first output until the first model satisfies the first model parameter. a preset condition.
- the training unit 104 is also configured to:
- the first output includes at least one of a first feature extracted by the first model from the training data and a first probability distribution inferred based on the first feature
- the The second output includes at least one of a second feature extracted from the training data by the second model and a second probability distribution inferred based on the second feature
- the training unit 104 is specifically used for:
- Model parameters of the first model are iteratively updated according to at least one of the first comparative loss and the first relative entropy loss.
- the training unit 104 is specifically configured to:
- the first model is a converter model
- the second model is a convolutional neural network model
- the interaction unit 102 is specifically configured to:
- the first model to be trained and the second model to be trained are determined according to the type of AI task set by the user.
- the interaction unit 102 is further configured to:
- the training parameters are determined according to the type of the AI task set by the user and the first model and the second model.
- the training parameters include one or more of training rounds, optimizer type, learning rate update strategy, model parameter initialization method, and training strategy.
- the model training system 100 may correspond to the implementation of the method described in the embodiment of the present application, and the above-mentioned and other operations and/or functions of the various modules/units of the model training system 100 are respectively in order to realize the implementation shown in FIG. 5
- the corresponding flow of each method in the example is not repeated here.
- the embodiment of the present application also provides a computing device cluster.
- the computing device cluster may be a computing device cluster formed by at least one computing device in a cloud environment, an edge environment, or a terminal device.
- the computing device cluster is specifically used to implement the functions of the model training system 100 in the embodiment shown in FIG. 1 .
- FIG. 8 provides a schematic structural diagram of a computing device cluster.
- the computing device cluster 80 includes multiple computing devices 800 , and the computing device 800 includes a bus 801 , a processor 802 , a communication interface 803 and a memory 804 .
- the processor 802 , the memory 804 and the communication interface 803 communicate through the bus 801 .
- the bus 801 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus or the like.
- PCI peripheral component interconnect
- EISA extended industry standard architecture
- the bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in FIG. 8 , but it does not mean that there is only one bus or one type of bus.
- the processor 802 may be a central processing unit (central processing unit, CPU), a graphics processing unit (graphics processing unit, GPU), a microprocessor (micro processor, MP) or a digital signal processor (digital signal processor, DSP) and the like. Any one or more of them.
- CPU central processing unit
- GPU graphics processing unit
- MP microprocessor
- DSP digital signal processor
- the communication interface 803 is used for communicating with the outside.
- the communication interface 803 can be used to receive the first model and the second model selected by the user through the user interface, receive the training parameters configured by the user, or the communication interface 803 can be used to output the trained first model and/or the trained first model. Two models and so on.
- the memory 804 may include a volatile memory (volatile memory), such as a random access memory (random access memory, RAM).
- volatile memory such as a random access memory (random access memory, RAM).
- Memory 804 can also include non-volatile memory (non-volatile memory), such as read-only memory (read-only memory, ROM), flash memory, hard disk drive (hard disk drive, HDD) or solid state drive (solid state drive) , SSD).
- Executable codes are stored in the memory 804, and the processor 802 executes the executable codes to execute the aforementioned AI model training method.
- the embodiment of the present application also provides a computer-readable storage medium.
- the computer-readable storage medium may be any available medium that a computing device can store, or a data storage device such as a data center that includes one or more available media.
- the available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media (eg, solid state hard disk), etc.
- the computer-readable storage medium includes instructions, and the instructions instruct a computing device to execute the above AI model training method.
- the embodiment of the present application also provides a computer program product.
- the computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on the computing device, the processes or functions according to the embodiments of the present application will be generated in whole or in part.
- the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, from a website, computing device, or data center via Wired (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.) transmission to another website site, computing device, or data center.
- Wired eg, coaxial cable, fiber optic, digital subscriber line (DSL)
- wireless eg, infrared, wireless, microwave, etc.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Medical Informatics (AREA)
- Image Analysis (AREA)
Abstract
一种人工智能(AI)模型训练方法,包括:确定待训练的第一模型和待训练的第二模型(S502),第一模型和第二模型为异构的两种AI模型,将训练数据输入第一模型和第二模型,获得第一模型对训练数据进行推理后的第一输出,以及第二模型对训练数据进行推理后的第二输出(S504),然后以第二输出为第一模型的监督信号,结合第一输出迭代更新第一模型的模型参数,直至第一模型满足第一预设条件(S506)。该方法利用与第一模型互补的第二模型对训练数据进行推理后的输出作为监督信号,训练第一模型,促进第一模型加速收敛,无需在大规模数据集上预训练,缩短了训练时间,提高了训练效率。
Description
本申请要求于2021年08月24日提交中国国家知识产权局、申请号为202110977567.4、发明名称为“模型训练方法、系统、集群及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及人工智能(artificial intelligence,AI)技术领域,尤其涉及一种模型训练方法、模型训练系统以及计算设备集群、计算机可读存储介质、计算机程序产品。
随着AI技术的不断发展,很多新的AI模型也随之产生。其中,AI模型是指通过机器学习等AI技术开发和训练得到的用于实现特定AI任务的算法模型。AI任务是指利用AI模型的功能完成的任务。其中,AI任务可以包括语言翻译、智能问答等自然语言处理(natural language processing,NLP)任务,或者目标检测、图像分类等计算机视觉(computer vision,CV)任务。
新的AI模型通常是AI领域的专家针对特定的AI任务而提出的,并且这些AI模型在上述特定的AI任务取得了较好的效果。因此,很多研究者尝试将这些新的AI模型引入其他的AI任务。以转换器(transformer)模型为例,transformer模型是一种基于注意力机制对输入数据的各个部分进行加权的深度学习模型。该transformer模型在很多NLP任务中均获得了显著的效果,很多研究者尝试将transformer模型引入CV任务,例如图像分类任务、目标检测任务等等。
然而,将AI模型(例如是transformer模型)引入新的AI任务时,通常需要先在较大的数据集上进行预训练,由此导致整个训练过程需要花费较长时间,例如一些AI模型可能需要训练数千天,难以满足业务的需求。
发明内容
本申请提供了一种AI模型训练方法,该方法利用与第一模型互补的第二模型对训练数据进行推理后的输出作为监督信号,训练第一模型,促进第一模型加速收敛,无需在大规模数据集上预训练,缩短了训练时间,提高了训练效率。本申请还提供了上述方法对应的模型训练系统、计算设备集群、计算机可读存储介质以及计算机程序产品。
第一方面,本申请提供了一种AI模型训练方法。该方法可以由模型训练系统执行。该模型训练系统可以是用于训练AI模型的软件系统,计算设备或计算设备集群通过运行该软件系统的程序代码,以执行AI模型训练方法。该模型训练系统也可以是用于训练AI模型的硬件系统。下文以该模型训练系统为软件系统进行示例说明。
具体地,模型训练系统确定待训练的第一模型和待训练的第二模型,该第一模型和第二模型为异构的两种AI模型,然后将训练数据输入所述第一模型和所述第二模型,获得所 述第一模型对所述训练数据进行推理后的第一输出,以及所述第二模型对所述训练数据进行推理后的第二输出,接着以所述第二输出为所述第一模型的监督信号,结合所述第一输出迭代更新所述第一模型的模型参数,直至所述第一模型满足第一预设条件。
该方法中,模型训练系统利用与第一模型性能互补的第二模型对训练数据进行推理后的第二输出,为第一模型的训练加入额外的监督信号,促进第一模型向与该第一模型互补的第二模型学习,使得第一模型可以加速收敛,无需在大规模的数据集上进行预训练,大幅缩短了训练时间,提高了第一模型训练的效率,满足了业务的需求。
在一些可能的实现方式中,模型训练系统还可以以所述第一输出为所述第二模型的监督信号,结合所述第二输出迭代更新所述第二模型的模型参数,直至所述第二模型满足第二预设条件。
如此,模型训练系统利用与第二模型性能互补的第一模型对训练数据进行推理后的第一输出,为第二模型的训练加入额外的监督信号,促进第二模型向与该第二模型互补的第一模型学习,使得第二模型可以加速收敛,无需在大规模的数据集上进行预训练,大幅缩短了训练时间,提高了第二模型训练的效率,满足了业务的需求。
在一些可能的实现方式中,所述第一输出包括所述第一模型从所述训练数据中提取的第一特征和基于所述第一特征推理的第一概率分布中的至少一个,所述第二输出包括所述第二模型从所述训练数据中提取的第二特征和基于所述第二特征推理的第二概率分布中的至少一个。
模型训练系统以第二输出为第一模型的监督信号,结合第一输出迭代更新所述第一模型的模型参数,可以通过如下方式实现:根据所述第一特征和所述第二特征确定第一对比损失,和/或者,根据所述第一概率分布和所述第二概率分布确定第一相对熵损失;然后根据所述第一对比损失和所述第一相对熵损失中的至少一个,迭代更新所述第一模型的模型参数。
基于上述对比损失和/或相对熵损失进行梯度回流,模型训练系统不仅可以使得AI模型学习到如何区分不同的类别,还能够使AI模型参考另一个AI模型的概率估计(或称作概率分布)来提升自身的泛化能力。
在一些可能的实现方式中,模型训练系统在迭代更新所述第一模型的模型参数时,可以先根据所述第一对比损失的梯度和所述第一相对熵损失的梯度迭代更新所述第一模型的模型参数。当所述第一模型的监督损失与所述第二模型的监督损失的差值小于第一预设阈值时,停止执行根据所述第一对比损失的梯度迭代更新所述第一模型的模型参数。
该方法中,模型训练系统通过对梯度回流进行限制,例如限制对比损失的梯度回流至第一模型,可以避免性能较差的模型对性能较好的模型产生误导,导致模型朝着错误的方向收敛,由此可以促进第一模型高效收敛。
在一些可能的实现方式中,模型训练系统在迭代更新所述第二模型的模型参数时,可以先根据所述第二对比损失的梯度和所述第二相对熵损失的梯度迭代更新所述第二模型的模型参数。当所述第二模型的监督损失与所述第一模型的监督损失的差值小于第二预设阈值时,停止执行根据所述第二相对熵损失的梯度迭代更新所述第二模型的模型参数。
模型训练系统通过对梯度回流进行限制,例如限制相对熵损失的梯度回流至第二模型, 可以避免性能较差的模型对性能较好的模型产生误导,导致模型朝着错误的方向收敛,由此可以促进第二模型高效收敛。
在一些可能的实现方式中,由于模型结构的差异,训练第一模型的分支和训练第二模型的分支的学习速度、数据利用效率及表征能力的上限可以是不同的,模型训练系统可以调整训练策略,实现在训练的不同阶段,由训练效果好(如收敛快、精度高)的分支充当老师的角色(即提供监督信号的角色),促进训练效果较差的分支进行学习。在训练效果接近的情况下,两个分支可以互为合作伙伴,相互学习。随着训练的递进,分支的角色可以发生互换。也即异构的两个AI模型在训练过程中可以自主地选择相应角色达到互相促进的目的,提高了训练效率。
在一些可能的实现方式中,所述第一模型为转换器模型,所述第二模型为卷积神经网络模型。转换器模型和卷积神经网络模型的性能互补,因此,模型训练系统可以采用互补学习的方式训练转换器模型和卷积神经网络模型,提高训练效率。
在一些可能的实现方式中,模型训练系统可以根据用户通过用户界面的选择,确定所述待训练的第一模型和所述待训练的第二模型,或者是根据用户设置的AI任务的类型确定所述待训练的第一模型和所述待训练的第二模型。
该方法中,模型训练系统支持根据AI任务的类型自适应地确定待训练的第一模型和待训练的第二模型,提升了AI模型训练的自动化程度,并且,模型训练系统也支持人为干预,例如人工选择待训练的第一模型和待训练的第二模型,实现交互式训练。
在一些可能的实现方式中,模型训练系统可以接收用户通过用户界面配置的训练参数,也可以根据用户设置的AI任务的类型以及所述第一模型、所述第二模型,确定训练参数。如此,模型训练系统可以支持自适应确定训练参数,进而实现全自动的AI模型训练方案,此外,模型训练系统也支持人工干预的方式配置训练参数,满足了个性化的业务需求。
在一些可能的实现方式中,模型训练系统可以输出已训练的第一模型和已训练的第二模型中的至少一个,以通过已训练的第一模型和已训练的第二模型中的至少一个进行推理。也即模型训练系统可以实现联合训练及可拆卸推理(例如使用其中一个AI模型进行推理),由此提升了部署AI模型的灵活性,降低AI模型部署的难度。
在一些可能的实现方式中,所述训练参数包括训练轮次、优化器类型、学习率更新策略、模型参数初始化方式和训练策略中的一种或多种。模型训练系统可以按照上述训练参数,迭代更新第一模型的模型参数,以提升第一模型的训练效率。
第二方面,本申请提供了一种模型训练系统。所述系统包括:
交互单元,用于确定待训练的第一模型和待训练的第二模型,所述第一模型和所述第二模型为异构的两种AI模型;
训练单元,用于将训练数据输入所述第一模型和所述第二模型,获得所述第一模型对所述训练数据进行推理后的第一输出,以及所述第二模型对所述训练数据进行推理后的第二输出;
所述训练单元,还用于以所述第二输出为所述第一模型的监督信号,结合所述第一输出迭代更新所述第一模型的模型参数,直至所述第一模型满足第一预设条件。
在一些可能的实现方式中,所述训练单元还用于:
以所述第一输出为所述第二模型的监督信号,结合所述第二输出迭代更新所述第二模型的模型参数,直至所述第二模型满足第二预设条件。
在一些可能的实现方式中,所述第一输出包括所述第一模型从所述训练数据中提取的第一特征和基于所述第一特征推理的第一概率分布中的至少一个,所述第二输出包括所述第二模型从所述训练数据中提取的第二特征和基于所述第二特征推理的第二概率分布中的至少一个;
所述训练单元具体用于:
根据所述第一特征和所述第二特征确定第一对比损失,和/或者,根据所述第一概率分布和所述第二概率分布确定第一相对熵损失;
根据所述第一对比损失和所述第一相对熵损失中的至少一个,迭代更新所述第一模型的模型参数。
在一些可能的实现方式中,所述训练单元具体用于:
根据所述第一对比损失的梯度和所述第一相对熵损失的梯度迭代更新所述第一模型的模型参数;
当所述第一模型的监督损失与所述第二模型的监督损失的差值小于第一预设阈值时,停止执行根据所述第一对比损失的梯度迭代更新所述第一模型的模型参数。
在一些可能的实现方式中,所述第一模型为转换器模型,所述第二模型为卷积神经网络模型。
在一些可能的实现方式中,所述交互单元具体用于:
根据用户通过用户界面的选择,确定所述待训练的第一模型和所述待训练的第二模型;或者,
根据用户设置的AI任务的类型确定所述待训练的第一模型和所述待训练的第二模型。
在一些可能的实现方式中,所述交互单元还用于:
接收用户通过用户界面配置的训练参数;和/或,
根据用户设置的AI任务的类型以及所述第一模型、所述第二模型,确定训练参数。
在一些可能的实现方式中,所述训练参数包括训练轮次、优化器类型、学习率更新策略、模型参数初始化方式和训练策略中的一种或多种。
第三方面,本申请提供一种计算设备集群,所述计算设备集群包括至少一台计算设备。至少一台计算设备包括至少一个处理器和至少一个存储器。所述处理器、所述存储器进行相互的通信。所述至少一个处理器用于执行所述至少一个存储器中存储的指令,以使得计算设备集群执行如第一方面或第一方面的任一种实现方式所述的方法。
第四方面,本申请提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,所述指令指示计算设备或计算设备集群执行上述第一方面或第一方面的任一种实现方式所述的方法。
第五方面,本申请提供了一种包含指令的计算机程序产品,当其在计算设备或计算设备集群上运行时,使得计算设备或计算设备集群执行上述第一方面或第一方面的任一种实现方式所述的方法。本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。
为了更清楚地说明本申请实施例的技术方法,下面将对实施例中所需使用的附图作以简单地介绍。
图1为本申请实施例提供的一种模型训练系统的系统架构图;
图2为本申请实施例提供的一种模型选择界面的示意图;
图3为本申请实施例提供的一种训练参数配置界面的示意图;
图4为本申请实施例提供的一种模型训练系统的部署环境示意图;
图5为本申请实施例提供的一种模型训练方法的流程图;
图6为本申请实施例提供的一种模型训练方法的流程示意图;
图7为本申请实施例提供的一种模型训练进程示意图;
图8为本申请实施例提供的一种计算设备集群的结构示意图。
本申请实施例中的术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。
为了便于理解本申请实施例,首先,对本申请涉及的部分术语进行解释说明。
AI任务是指利用AI模型的功能完成的任务。AI任务可以分为自然语言处理(natural language processing,NLP)任务、计算机视觉(computer vision,CV)任务、自动语音识别(automatic speech recognition,ASR)任务等不同类型。
AI模型是指通过机器学习等AI技术开发和训练得到的用于实现特定AI任务的算法模型。本申请实施例中也将AI模型简称为“模型”。不同类型的AI任务可以通过各自对应的AI模型完成。例如,语言翻译或者智能问答等NLP任务可以通过transformer模型完成。又例如,图像分类或目标检测等CV任务可以通过卷积神经网络(convolutional neural network,CNN)模型完成。
由于一些AI模型在特定的AI任务上取得较好的效果,很多研究者尝试将这些AI模型引入其他AI任务。例如transformer模型在很多NLP任务中均获得了显著的效果,很多研究者尝试将transformer模型引入CV任务。将transformer模型引入CV任务时,通常需要将图像进行序列化。以图像分类任务为例,先对输入图像进行分块,提取每个分块的特征表示以实现输入图像的序列化,然后将分块的特征表示输入到transformer模型对输入图像分类。
然而,NLP任务中词表包括的词的数量是有限的,CV任务中输入图像的模式通常有无限可能,如此导致transformer模型引入到CV任务等其他任务中时,需要在较大的数据集上进行预训练,进而导致整个训练过程需要花费较长时间。例如一些AI模型引入到其他AI任务可能需要训练数千天,难以满足业务需求。
有鉴于此,本申请实施例提供了一种AI模型训练方法。该方法可以由模型训练系统执行。模型训练系统可以是用于训练AI模型的软件系统,该软件系统可以部署在计算设备集 群中。计算设备集群通过运行上述软件系统的程序代码,从而执行本申请实施例的AI模型训练方法。在一些实施例中,模型训练系统也可以是硬件系统,该硬件系统运行时执行本申请实施例的AI模型训练方法。
具体地,模型训练系统可以确定待训练的第一模型和待训练的第二模型,其中,第一模型和第二模型为异构的两种AI模型,也即第一模型和第二模型为不同结构类型的AI模型,例如一个AI模型可以为transformer模型,另一个AI模型可以为CNN模型。由于异构的两种AI模型的性能通常是互补的。为此,模型训练系统可以通过互补学习的方式对第一模型和第二模型进行联合训练。
其中,模型训练系统对第一模型和第二模型进行联合训练的过程为,将训练数据输入第一模型和第二模型,获得第一模型对训练数据进行推理后的第一输出,以及第二模型对训练数据进行推理后的第二输出,然后以第二输出为第一模型的监督信号,结合第一输出迭代更新第一模型的模型参数,直至第一模型满足第一预设条件。
在该方法中,模型训练系统利用第二模型对训练数据进行推理后的第二输出为第一模型的训练加入额外的监督信号,促进第一模型向与该第一模型互补的第二模型学习,使得第一模型可以加速收敛,无需在大规模的数据集上进行预训练,大幅缩短了训练时间,提高了第一模型训练的效率,满足了业务的需求。
为了使得本申请的技术方案更加清楚、易于理解,下面先对模型训练系统的架构进行介绍。
参见图1所示的模型训练系统的架构图,模型训练系统100包括交互单元102和训练单元104。其中,交互单元102可以通过浏览器(browser)或客户端(client)与用户交互。
具体地,交互单元102用于确定待训练的第一模型和待训练的第二模型,该第一模型和第二模型为异构的两种AI模型。训练单元104用于将训练数据输入第一模型和第二模型,获得第一模型进行推理后的第一输出,以及第二模型对训练数据进行推理后的第二输出,然后以第二输出为所述第一模型的监督信号,结合所述第一输出迭代更新所述第一模型的模型参数,直至所述第一模型满足第一预设条件。进一步地,训练单元104还用于以所述第一输出为第二模型的监督信号,结合第二输出迭代更新第二模型的模型参数,直至第二模型满足第二预设条件。
在一些可能的实现方式中,交互单元102可以通过浏览器或客户端与用户交互,从而确定待训练的第一模型和待训练的第二模型。例如,交互单元102可以根据用户通过用户界面的选择,确定待训练的第一模型和待训练的第二模型。又例如,交互单元102可以根据用户设置的AI任务的类型自动地确定待训练的第一模型和待训练的第二模型。
下面以交互单元102根据用户通过用户界面的选择,确定待训练的第一模型和待训练的第二模型进行示例说明。其中,用户界面包括模型选择界面。该模型选择界面可以是图形用户界面(graphical user interface,GUI)或者是命令用户界面(command user interface,CUI)。本实施例以模型选择界面为GUI进行示例说明。交互单元102可以响应于客户端或浏览器的请求,向客户端或浏览器提供模型选择界面的页面元素,以使客户端或浏览器根据该页面元素渲染模型选择界面。
参见图2所示的模型选择界面的示意图,模型选择界面200承载有模型选择控件,例如是第一模型选择控件202和第二模型选择控件204。当模型选择控件被触发时,可以在该界面中向用户呈现可选择模型列表,可选择模型列表中包括至少一种模型,每种模型包括至少一个实例,用户可以从可选择模型列表中选择一种模型的一个实例作为第一模型,以及从可选择模型列表中选择另一种模型的一个实例作为第二模型。在该示例中,第一模型可以为transformer模型的一个实例,第二模型可以为CNN模型的一个实例。模型选择界面200还承载有确定控件206和取消控件208。其中,确定控件206用于确定用户的模型选择操作,取消控件208用于取消用户的模型选择操作。
其中,可选择模型列表中模型的实例可以是模型训练系统内置的,也可以是用户预先上传的。在一些可能的实现方式中,用户也可以实时上传AI模型的实例,以便于交互单元102将用户上传的多个AI模型的实例确定为待训练的第一模型和待训练的第二模型。具体地,可选择模型列表中可以包括自定义选项,当用户选择该选项,可以触发上传AI模型的实例的流程,交互单元102可以将用户实时上传的AI模型的实例确定为待训练的第一模型和待训练的第二模型。
训练单元104在进行模型训练时,可以按照训练参数进行模型训练。该训练参数可以是用户手动配置的,也可以是训练单元104自动确定或自适应调整的。训练参数可以包括训练轮次、优化器类型、学习率更新策略、模型参数初始化方式和训练策略中的一种或多种。
训练轮次是指训练期数或训练轮数。一期也即一个时期(epoch)是指训练集中的每个样本参与模型训练一次。优化器是指用于更新模型参数的算法,基于此,优化器类型可以包括梯度下降、动量优化、自适应学习率优化等不同类型。其中,梯度下降还可以进一步细分为批量梯度下降(batch gradient descent,BGD)、随机梯度下降(stochastic gradient descent)或者小批量梯度下降(mini-batch gradient descent)。动量优化包括标准动量(momentum)优化如或者是牛顿加速梯度(nesterov accelerated gradient,NAG)优化。自适应学习率优化包括AdaGrad、RMSProp、Adam或者AdaDelta等等。
学习率是指模型参数更新幅度的控制因子,通常可以设置为0.01、0.001或者是0.0001等等。学习率更新策略可以为分段常数衰减、指数衰减、余弦衰减或者倒数衰减等。模型参数初始化方式包括使用预训练模型进行模型参数初始化,在一些实施例中,模型参数初始化方式还可以包括高斯分布初始化等。训练策略是指训练模型采用的策略。训练策略可以分为单阶段训练策略和多阶段训练策略。当优化器类型为梯度下降时,训练策略还可以包括各个训练阶段的梯度回流方式。
下面以用户通过用户界面手动配置训练参数进行示例说明。用户界面包括训练参数配置界面,该训练参数配置界面可以是GUI,也可以是CUI。本申请实施例以训练参数配置界面为GUI进行示例说明。
参见图3所示的训练参数配置界面的示意图,训练参数配置界面300承载有训练轮次配置控件302、优化器类型配置控件304、学习率更新策略配置控件306、模型初始化方式配置控件308和训练策略配置控件310。
其中,训练轮次配置控件302支持用户通过直接输入数值的方式或者加减数值的方式 配置训练轮次,例如用户可以通过训练轮次配置控件302直接输入数值100,从而配置训练轮次为100轮。优化器类型配置控件304、学习率更新策略配置控件306、模型参数初始化方式配置控件308和训练策略配置控件310支持用户通过下拉选择方式进行相应的参数配置。在该示例中,用户可以配置优化器类型为Adam,学习率更新策略为指数衰减,模型参数初始化方式为根据预训练模型进行初始化,训练策略为三阶段训练策略。
训练参数配置界面300还承载有确定控件312和取消控件314。当确定控件312被触发时,浏览器或客户端可以将用户配置的上述训练参数提交至模型训练系统100。当取消控件314被触发时,则用户对训练参数的配置被取消。
需要说明的是,图3是以用户对第一模型和第二模型统一配置训练参数进行示例说明,在一些可能的实现方式中,用户也可以对第一模型和第二模型分别配置训练参数。
在一些可能的实现方式中,训练参数也可以根据用户设置的AI任务的类型以及第一模型和第二模型自动确定。具体地,模型训练系统100可以维护AI任务的类型、第一模型、第二模型的映射关系,当模型训练系统100确定AI任务的任务类型以及待训练的第一模型、待训练的第二模型后,可以基于上述映射关系确定训练参数。
图1仅仅是模型训练系统100的一种示意性划分方式,在本申请实施例其他可能的实现方式中,模型训练系统100还可以按照其他方式进行划分。本申请实施例对此不作限定。
模型训练系统100可以具有多种部署方式。在一些可能的实现方式中,模型训练系统100可以集中部署在云环境、边缘环境或终端,也可以分布式部署在云环境、边缘环境或终端中的不同环境。
云环境指示云服务提供商拥有的,用于提供计算、存储、通信资源的中心计算设备集群。中心计算设备集群包括一个或多个中心计算设备,该中心计算设备例如可以是中心服务器。边缘环境指示在地理位置上距离端设备(即端侧设备)较近的,用于提供计算、存储、通信资源的边缘计算设备集群。边缘计算设备集群包括一个或多个边缘计算设备。该边缘计算设备例如可以是边缘服务器或者计算盒子等。终端包括但不限于台式机、笔记本电脑、智能手机等用户终端。
下面以模型训练系统100集中式地部署在云环境,向用户提供训练AI模型的云服务进行示例说明。
参见图4所示的模型训练系统100的部署环境示意图,如图4所示,模型训练系统100集中部署在云环境中,例如是部署在云环境的一个中心服务器中。如此,模型训练系统100可以提供训练AI模型的云服务,以供用户使用。
具体地,部署在云环境中的模型训练系统100可以对外提供云服务的应用程序编程接口(application programming interface,API)。浏览器或客户端可以调用该API,以进入模型选择界面200。用户可以通过该模型选择界面200选择AI模型的实例,模型训练系统100根据用户的选择,确定待训练的第一模型和待训练的第二模型。其中,用户提交选择的AI模型的实例后,浏览器或客户端可以进入训练参数配置界面300。用户可以通过训练参数配置界面300承载的控件配置训练轮次、优化器类型、学习率更新策略、模型参数初始化方式和训练策略等训练参数。模型训练系统100根据用户配置的上述训练参数,对第一模型和第二模型进行联合训练。
具体地,云环境中的模型训练系统100可以根据将训练数据输入第一模型和第二模型,获得第一模型对训练数据进行推理后的第一输出,以及第二模型对训练数据进行推理后的第二输出,以第二输出为第一模型的监督信号,结合第一输出迭代更新第一模型的模型参数,直至第一模型满足第一预设条件。其中,模型训练系统100在迭代更新第一模型的模型参数时,可以根据配置的训练参数,采用梯度下降法迭代更新第一模型的参数,以及采用指数衰减方式更新学习率。
接下来,从模型训练系统100的角度,对本申请实施例提供的AI模型训练方法进行介绍。
参见图5所示的AI模型训练方法的流程图,该方法包括:
S502:模型训练系统100确定待训练的第一模型和待训练的第二模型。
第一模型和第二模型为异构的两种AI模型。其中,异构是指AI模型的结构类型不同。AI模型通常是由多个神经元(cell)连接形成,因此,AI模型的结构类型可以根据神经元的结构类型确定。当神经元的结构类型不同时,基于该神经元形成的AI模型的结构类型可以是不同的。
在一些可能的实现方式中,异构的两种AI模型的性能可以是互补的。其中,性能可以通过不同指标衡量。该指标例如可以是精度、推理时间等。异构的两种AI模型的性能互补可以是第一模型在第一指标的表现优于第二模型在第一指标的表现,第二模型在第二指标的表现优于第一模型。例如,低参数量的AI模型的推理时间短于高参数量的AI模型的推理时间,高参数量的AI模型的精度高于低参数量的AI模型的精度。
基于此,第一模型和第二模型可以是transformer模型、CNN模型、循环神经网络(recurrent neural network,RNN)模型中的不同模型。例如,第一模型可以是transformer模型,第二模型可以是CNN模型。
模型训练系统100可以通过多种方式确定待训练的第一模型和待训练的第二模型。下面分别对不同实现方式进行介绍。
第一种实现方式,模型训练系统100根据用户通过用户界面的选择,确定待训练的第一模型和待训练的第二模型。具体地,模型训练系统100可以响应于客户端或浏览器的请求,返回页面元素,以使客户端或浏览器基于该页面元素,向用户呈现模型选择界面200。用户可以通过模型选择界面200选择不同结构类型的AI模型的实例,例如选择transformer模型、CNN模型、循环神经网络(recurrent neural network,RNN)模型中任意两种模型的实例,模型训练系统100可以将用户选择的模型的实例确定为待训练的第一模型和待训练的第二模型。在一些实施例中,模型训练系统100可以确定transformer模型的实例为待训练的第一模型,确定CNN模型的实例为待训练的第二模型。
第二种实现方式,模型训练系统100获取任务类型,根据任务类型与AI模型的映射关系,确定与该任务类型匹配的模型为待训练的第一模型和待训练的第二模型。例如,任务类型为图像分类时,模型训练系统100可以根据任务类型与AI模型的映射关系,确定该图像分类任务匹配的AI模型包括transformer模型和CNN模型,因而可以将transformer模型的实例和CNN模型的实例确定为待训练的第一模型和待训练的第二模型。
其中,与任务类型匹配的AI模型包括多个。模型训练系统100可以根据业务需求从多个与任务类型匹配的AI模型中确定待训练的第一模型和待训练的第二模型。业务需求可以包括对模型性能的需求、模型大小的需求等。其中,模型性能可以通过精度、推理时间、推理速度等指标表征。
例如,模型训练系统100可以根据对模型大小的需求确定16层transformer模型如16层视觉转换器基础模型(vision transformer base/16,ViT-B/16)为待训练的第一模型,确定50层残差网络模型(residual network-50,ResNet-50)为待训练的第二模型。当然,模型训练系统100也可以基于用户的选择,确定ViT-B/16为待训练的第一模型,以及确定ResNet-50为待训练的第二模型。其中,ResNet为CNN模型的一个示例,ResNet通过短路连接解决深度CNN模型中梯度消失或梯度爆炸的问题。
S504:模型训练系统100将训练数据输入第一模型和第二模型,获得第一模型对训练数据进行推理后的第一输出,以及第二模型对训练数据进行推理后的第二输出。
具体地,模型训练系统100可以获取训练数据集,然后将训练数据集中的训练数据分成若干批,例如是按照预先设置的批大小(batch size)分成若干批,接着将训练数据分批输入第一模型和第二模型,获得第一模型对训练数据进行推理后的第一输出和第二模型对训练数据进行推理后的第二输出。
其中,第一模型对训练数据进行推理后的第一输出包括第一模型从训练数据中提取的第一特征和基于第一特征推理的第一概率分布中的至少一个。类似地,第二模型对训练数据进行推理后的第二输出包括第二模型从训练数据中提取的第二特征和基于第二特征推理的第二概率分布中的至少一个。
需要说明的是,模型训练系统100也可以不对训练数据集中的训练数据进行分批,而是将训练数据集中的训练数据逐个输入第一模型和第二模型,获得第一模型对训练数据进行推理后的第一输出,以及第二模型对训练数据进行推理后的第二输出。也即,模型训练系统100可以采用离线训练方式,或者在线训练方式训练AI模型,本申请实施例对此不作限定。
S506:模型训练系统100以第二输出为第一模型的监督信号,结合第一输出迭代更新第一模型的模型参数,直至第一模型满足第一预设条件。
在本实施例中,第二模型对训练数据进行推理后的第二输出可以作为第一模型的监督信号,用于监督训练第一模型。模型训练系统100监督训练第一模型的过程可以为,模型训练系统100根据第一模型从训练数据中提取的第一特征和第二模型从训练数据中提取的第二特征确定第一对比损失,以及根据第一概率分布和第二概率分布确定第一相对熵损失,然后根据第一对比损失和第一相对熵损失中的至少一个,迭代更新第一模型的模型参数。
对比损失主要用于表征同一训练数据经过不同AI模型进行降维处理(例如是特征提取)后产生的损失。对比损失可以根据第一模型对训练数据进行特征提取得到的第一特征以及第二模型对训练数据进行特征提取得到第二特征得到,例如是根据第一特征和第二特征的距离得到。
在一些实施例中,模型训练系统100可以通过公式(1)确定第一模型和第二模型的对比损失:
其中,L
cont表征对比损失,N为一个批次中训练数据的数量,z表征特征,例如
和
分别表征第一模型对第i个训练数据进行特征提取所得的第一特征和第二模型对第i个训练数据进行特征提取所得的第二特征。类似地,
和
分别表征第一模型对第j个训练数据进行特征提取所得的第一特征和第二模型对第j个训练数据进行特征提取所得的第二特征。i和j可以取值为1至N的任意整数(包括1和N两个端点)。特征可以通过特征向量或特征矩阵等形式表征。P表征特征的相似度的逻辑回归(softmax)概率。其中,特征的相似度可以通过特征向量的距离表征,例如通过特征向量的余弦距离进行表征。另外,第一特征和第二特征的相似度的逻辑回归概率与第二特征和第一特征的相似度的逻辑回归概率通常是不相等的,例如
基于上述公式(1)可知,当一个批次中的训练数据比较相似,而第一特征和第二特征在特征空间的距离较大,则说明当前的模型性能不好,可以加大对比损失。类似地,当一个批次中的训练数据完全不相似,而第一特征和第二特征在特征空间的距离反而较小,则对比损失会加大。通过设置上述对比损失,可以实现在提取到不合适的特征时进行惩罚,反向促进AI模型(例如是第一模型)提取合适的特征。
相对熵损失,也称作KL散度(Kullback-Leibler divergence,KLD),是对不同概率分布的非对称性的度量,主要用于表征不同模型对同一训练数据进行预测产生的损失。对于图像分类任务而言,相对熵损失可以是同一训练数据经过第一模型和第二模型的分类器进行分类所产生的损失。相对熵可以根据不同概率分布确定。下面以图像分类任务中的相对熵损失进行示例说明。
在一些实施例中,模型训练系统100可以通过公式(2)确定第一模型和第二模型的相对熵损失:
其中,N表示一个批次中训练数据的数量,P
1(i)表示第一模型对第i个训练数据分类的概率分布,也即第一概率分布,P
2(i)表示第二模型对第i个训练数据分类的概率分布,也即第二概率分布。其中,P
1(i)、P
2(i)为离散的。
基于上述公式(2)可知,P
1(i)>P
2(i)时,相对熵损失将会增加,并且,P
1(i)越大,相对熵损失增加幅度越大。通过设置上述相对熵损失,可以实现在第二模型分类到不准确的类别时进行惩罚。
需要说明的是,相对熵损失(KL散度)不具有对称性,从分布P
1到分布P
2的相对熵损失通常并不等于从分布P
2到分布P
1的相对熵损失,也即D
KL(P
1||P
2)≠D
KL(P
2||P
1)。
模型训练系统100可以根据第一特征和第二特征,结合上述公式(1)确定第一对比损失,以及根据第一概率分布和第二概率分布,结合上述公式(2)确定第一相对熵损失,然后模型训练系统100可以根据第一对比损失的梯度和第一相对熵损失的梯度中的至少一个,迭代更新第一模型的模型参数。该模型参数是指通过训练数据能够学习到的参数。例如,第一模型为深度学习模型时,第一模型的模型参数可以包括神经元的权重w和偏置b。
其中,模型训练系统100在迭代更新第一模型的模型参数时,可以根据预先配置的训练参数迭代更新第一模型的模型参数。其中,训练参数包括优化器类型,该优化器类型可 以是梯度下降、动量优化等不同类型,梯度下降进一步包括批量梯度下降、随机梯度下降或者小批量梯度下降。模型训练系统100可以根据预先配置的优化器类型,迭代更新第一模型的模型参数,例如模型训练系统100可以通过梯度下降迭代更新第一模型的模型参数。
预先配置的训练参数还包括学习率更新策略,相应地,模型管理系统100可以根据该学习率更新策略更新学习率,例如可以按照指数衰减更新学习率。当模型管理系统100迭代更新第一模型的模型参数时,可以根据梯度(具体是第一对比损失的梯度和第一相对熵损失的梯度中的至少一个)和更新后的学习率迭代更新第一模型的模型参数。
第一预设条件可以根据业务需求进行设置。例如,第一预设条件可以设置为第一模型的性能达到预设性能。其中,性能可以通过精度、推理时间等指标进行衡量。又例如,第一预设条件可以设置为第一模型的损失值趋于收敛,或者第一模型的损失值小于预设值。
第一模型的性能可以通过第一模型在测试数据集的表现确定。训练AI模型的数据集包括训练数据集、验证数据集和测试数据集。其中,训练数据集用于学习模型参数,如学习第一模型中神经元的权重,进一步地,还可以学习第一模型中神经元的偏置。验证数据集用于选择第一模型的超参数,如模型层数、神经元数量、学习率等。测试数据集用于评价模型的性能。测试数据集既不参与确定模型参数的过程,也不参与选择超参数的过程。为了保障评价准确度,测试数据集中的测试数据通常使用一次。基于此,模型训练系统100可以将测试数据集中的测试数据输入第一模型,根据第一模型对测试数据进行推理后的输出以及测试数据的标签对第一模型的性能进行评价。如果已训练的第一模型的性能达到预设性能,则模型训练系统100可以输出已训练的第一模型,否则模型训练系统100可以退回模型选择或训练参数配置以进行模型优化,直至已训练的第一模型的性能达到预设性能。
S508:模型训练系统100以第一输出为第二模型的监督信号,结合第二输出迭代更新第二模型的模型参数,直至第二模型满足第二预设条件。
具体地,模型训练系统100还可以根据第一输出对第二模型进行监督训练。其中,第一输出包括第一模型从训练数据中提取的第一特征和基于第一特征推理的第一概率分布中的至少一个。第二输出包括第二模型从训练数据中提取的第二特征和基于第二特征推理的第二概率分布中的至少一个。模型训练系统100可以根据第二输出和第一输出确定第二对比损失,根据第二概率分布和第一概率分布确定第二相对熵损失。接着,模型训练系统100可以根据第二对比损失和第二相对熵损失中的至少一个,迭代更新第二模型的模型参数,直至第二模型满足第二预设条件。
其中,第二对比损失的计算方式可以参考上述公式(1),第二相对熵损失的计算方式可以参考上述公式(2),本实施例在此不再赘述。
进一步地,模型训练系统100在迭代更新第二模型的模型参数时,可以按照预先设置的针对第二模型的训练参数,迭代更新第二模型的模型参数。其中,训练参数可以包括优化器类型,模型训练系统100可以按照该优化器类型,迭代更新第二模型的参数。例如,优化器类型可以为随机梯度下降,则模型训练系统100可以通过随机梯度下降方式,迭代更新第二模型的参数。训练参数还可以包括学习率更新策略。模型训练系统100可以按照学习率更新策略更新学习率,相应地,模型训练系统100可以基于第二对比损失的梯度和第二相对熵损失的梯度中的至少一个,以及更新后的学习率,迭代更新第二模型的模型参 数。
与第一预设条件类似,第二预设条件可以根据业务需求进行设置。例如,第二预设条件可以设置为第二模型的性能达到预设性能。其中,性能可以通过精度、推理时间等指标进行衡量。又例如,第二预设条件可以设置为第二模型的损失值趋于收敛,或者第二模型的损失值小于预设值。
需要说明的是,上述S508为可选步骤,执行本申请实施例的AI模型训练方法也可以不执行上述S508。
基于上述内容描述,本申请实施例提供了一种AI模型训练方法。该方法中,模型训练系统100利用第二模型对训练数据进行推理后的第二输出为第一模型的训练加入额外的监督信号,促进第一模型向与该第一模型互补的第二模型学习,使得第一模型可以加速收敛,使得第一模型可以加速收敛,由此可以实现针对性地训练,无需在大规模的数据集上进行预训练,大幅缩短了训练时间,提高了第一模型训练的效率,满足了业务的需求。
并且,模型训练系统100还可以利用第一模型对训练数据进行推理后的第一输出为第二模型的训练加入额外的监督信号,促进第二模型向与该第二模型互补的第一模型学习,使得第二模型可以加速收敛,无需在大规模的数据集上进行预训练,大幅缩短了训练时间,提高了第二模型训练的效率,满足了业务的需求。
随着训练过程的进行,第一模型的性能、第二模型的性能可以发生变化。例如,第一模型的性能可以由低于第二模型的性能变化为高于第二模型的性能,如果仍基于第一对比损失的梯度和第一相对熵损失的梯度,迭代更新第一模型的模型参数,可以导致第二模型对第一模型产生误导,影响第一模型的训练。基于此,模型训练系统100还可以采用梯度受限回流方式,迭代更新第一模型的模型参数。
其中,梯度受限回流是指对部分梯度进行回流,以迭代更新模型参数。例如,回流对比损失的梯度,或者回流相对熵损失的梯度,以迭代更新模型参数。在实际应用时,模型训练系统100可以在第一模型的性能显著高于第二模型的性能时,采用梯度受限回流方式,迭代更新第一模型的模型参数。
其中,第一模型的性能如精度也可以通过第一模型的监督损失表征。监督损失也称作交叉熵损失(cross entropy loss)。监督损失可以通过公式(3)计算得到:
其中,x
i表示第i个训练数据,n表示一批训练数据中训练数据的数据量。p(x
i)表示真实概率分布,q(x
i)表示预测概率分布,例如是第一模型推理的第一概率分布。通常情况下,第一模型的监督损失越小,表明第一模型的推理结果与标签越接近,第一模型的精度越高,第一模型的监督损失越大,表面第一模型的推理结果与标签越不接近,第一模型的精度越低。
基于此,模型训练系统100训练第一模型的过程可以包括如下步骤:
S5062:模型训练系统100根据所述第一对比损失的梯度和所述第一相对熵损失的梯度迭代更新所述第一模型的模型参数。
具体地,在训练的起始阶段,第一模型和第二模型的性能互补,模型训练系统100可以将第一对比损失的梯度以及第一相对熵损失的梯度均进行回流,以便基于第一对比损失的梯度和第一相对熵损失的梯度,迭代更新第一模型的模型参数。
S5064:当所述第一模型的监督损失与所述第二模型的监督损失的差值小于第一预设阈值时,模型训练系统100停止执行根据所述第一对比损失的梯度迭代更新所述第一模型的模型参数。
具体地,模型训练系统100可以参照上述公式(3)分别确定第一模型的监督损失和第二模型的监督损失。当第一模型的监督损失和第二模型的监督损失的差值小于第一预设阈值时,表明第一模型的监督损失显著小于第二模型的监督损失。基于此,模型训练系统100可以触发梯度受限回流,例如仅回流第一相对熵损失的梯度。模型训练系统100停止执行根据第一对比损失的梯度迭代更新所述第一模型的模型参数。
需要说明的是,S5064是以模型训练系统100回流第一相对熵损失的梯度进行示例说明,在本申请实施例其他可能的实现方式中,模型训练系统100也可以回流第一对比损失的梯度,以便根据第一对比损失的梯度迭代更新第一模型的模型参数。
类似地,当模型训练系统100还利用第一模型的输出作为监督信号,训练第二模型时,也可以在满足梯度受限回流的触发条件时,仅回流部分梯度(例如是第二相对熵损失的梯度),根据部分梯度迭代更新第二模型的模型参数。
通过设置上述损失,模型训练系统100不仅可以使得AI模型学习到如何区分不同的类别,还能够使AI模型参考另一个AI模型的概率估计来提升自身泛化能力。而且,通过对梯度回流进行限制,例如限制对比损失的梯度回流至第一模型,或者限制相对熵损失的梯度回流至第二模型,可以避免性能较差的模型对性能较好的模型产生误导,导致模型朝着错误的方向收敛,由此可以促进第一模型和第二模型高效收敛。
此外,由于模型结构的差异,训练第一模型的分支和训练第二模型的分支的学习速度、数据利用效率及表征能力的上限可以是不同的,模型训练系统100可以调整训练策略,实现在训练的不同阶段,由训练效果好(如收敛快、精度高)的分支充当老师的角色(即提供监督信号的角色),促进训练效果较差的分支进行学习。在训练效果接近的情况下,两个分支可以互为合作伙伴,相互学习。随着训练的递进,分支的角色可以发生互换。也即异构的两个AI模型在训练过程中可以自主地选择相应角色达到互相促进的目的,提高了训练效率。
接下来,结合一个实例对本申请实施例的AI模型训练方法进行说明。
参见图6所示的AI模型训练方法的流程示意图,如图6所示,模型训练系统100获取多个待训练的AI模型,具体为CNN模型的实例和transformer模型的实例,也称作CNN branch(分支)和transformer branch。其中,每个分支包括骨干网络和分类器,骨干网络用于从输入图像中提取特征向量,分类器用于基于特征向量进行图像分类。
在一个训练阶段,CNN模型和transformer模型可以互为老师模型(例如是提供监督信号的模型)和学生模型(例如是基于监督信号进行学习的模型)。模型训练系统100根据CNN模型从训练数据(例如是输入图像)中提取的特征和transformer模型从训练数据中提 取的特征可以确定对比损失。模型训练系统100根据CNN模型对输入图像分类所得的各类别的概率分布以及transformer模型自身对输入图像分类所得的各类别的概率分布可以确定相对熵损失。如图6中指向transformer分支的虚线所示,对比损失的梯度可以回流至transformer模型,模型训练系统100可以根据该对比损失的梯度更新transformer模型的模型参数。如图6中指向CNN分支的虚线所示,相对熵损失(KL散度)的梯度可以回流至CNN模型,模型训练系统100可以根据相对熵损失的梯度更新CNN模型的模型参数。
在另一个训练阶段,当transformer模型的监督损失(通常采用交叉熵损失)远小于CNN模型的监督损失时,对比损失的梯度可以停止回流至transformer模型。模型训练系统100可以根据相对熵损失的梯度更新CNN模型的模型参数。当transformer模型的监督损失远大于CNN模型的监督损失时,相对熵损失的梯度可以停止回流至CNN模型。模型训练系统100可以根据该对比损失的梯度更新transformer模型的模型参数。
需要说明的是,对比损失通常是对偶的,因此,对比损失的梯度也可以回流至第二模型,例如回流至CNN模型。也即,模型训练系统100可以根据对比损失的梯度以及相对熵损失的梯度更新CNN模型的模型参数。
本申请实施例还在多个数据集上对通过本申请的AI模型训练方法训练得到的AI模型的性能进行验证,具体参见下表:
表1模型在多个数据集上的精度
其中,表1示出了本申请实施例联合训练所输出的两个模型,以及独立训练的两个模型在ImageNet、Real、V2、CIFAR 10、CIFAR100、Flowers以及stanford Cars等数据集上的精度。需要说明的是,该精度是模型预测输入图像类别时排序第一的类别的精度,即Top1的精度。由表1可知,本申请实施例联合训练的CNN模型(例如是表1中联合训练的ResNet-50)和联合训练的transformer模型(例如是表1中联合训练的ViT-Base)的精度,相较于独立训练的CNN模型(例如是表1中的ResNet-50)和独立训练的transformer模型(例如是表1中的ViT-Base)有所提升,尤其是在V2数据集上提升较为显著。
此外,相比于独立训练的ResNet-50、ViT-Base,联合训练的ResNet-50和ViT-Base能够更快收敛。参见图7所示的各模型的训练进程示意图,联合训练的ResNet-50和ViT-Base通常在20轮以内可以趋于收敛,而独立训练的ResNet-50和ViT-Base通常在20轮以后趋于收敛。由此可见,异构的AI模型互相学习联合训练,可以有效地缩短训练时间,提高训 练效率。
在该示例中,模型训练系统100加入类似对比学习方式的学习目标,利用从一个AI模型学习到的特征为另一个AI模型的训练加入额外的监督信号,AI模型可以基于该监督信号针对性地更新模型参数,因此,可以实现加速收敛。由于两个异构的AI模型天然的异构特点以及表征能力上的差异,可以有效防止对比学习中常见的模型坍塌和退化解等问题的发生。
并且,该方法无需人为设计启发式的结构算子促进模型收敛、提升模型性能,尽可能保持模型原有结构的特征,减少结构细节上的修改,从而提升模型训练系统100的弹性、扩展性,具有较好的通用性。
上文结合图1至图7对本申请实施例提供的AI模型训练方法进行了详细介绍,下面将结合附图对本申请实施例提供的模型训练系统进行介绍。
参见图1所示的模型训练系统100的结构示意图,该系统100包括:
交互单元102,用于确定待训练的第一模型和待训练的第二模型,所述第一模型和所述第二模型为异构的两种AI模型;
训练单元104,用于将训练数据输入所述第一模型和所述第二模型,获得所述第一模型对所述训练数据进行推理后的第一输出,以及所述第二模型对所述训练数据进行推理后的第二输出;
所述训练单元104,还用于以所述第二输出为所述第一模型的监督信号,结合所述第一输出迭代更新所述第一模型的模型参数,直至所述第一模型满足第一预设条件。
在一些可能的实现方式中,所述训练单元104还用于:
以所述第一输出为所述第二模型的监督信号,结合所述第二输出迭代更新所述第二模型的模型参数,直至所述第二模型满足第二预设条件。
在一些可能的实现方式中,所述第一输出包括所述第一模型从所述训练数据中提取的第一特征和基于所述第一特征推理的第一概率分布中的至少一个,所述第二输出包括所述第二模型从所述训练数据中提取的第二特征和基于所述第二特征推理的第二概率分布中的至少一个;
所述训练单元104具体用于:
根据所述第一特征和所述第二特征确定第一对比损失,和/或者,根据所述第一概率分布和所述第二概率分布确定第一相对熵损失;
根据所述第一对比损失和所述第一相对熵损失中的至少一个,迭代更新所述第一模型的模型参数。
在一些可能的实现方式中,所述训练单元104具体用于:
根据所述第一对比损失的梯度和所述第一相对熵损失的梯度迭代更新所述第一模型的模型参数;
当所述第一模型的监督损失与所述第二模型的监督损失的差值小于第一预设阈值时,停止执行根据所述第一对比损失的梯度迭代更新所述第一模型的模型参数。
在一些可能的实现方式中,所述第一模型为转换器模型,所述第二模型为卷积神经网 络模型。
在一些可能的实现方式中,所述交互单元102具体用于:
根据用户通过用户界面的选择,确定所述待训练的第一模型和所述待训练的第二模型;或者,
根据用户设置的AI任务的类型确定所述待训练的第一模型和所述待训练的第二模型。
在一些可能的实现方式中,所述交互单元102还用于:
接收用户通过用户界面配置的训练参数;和/或,
根据用户设置的AI任务的类型以及所述第一模型、所述第二模型,确定训练参数。
在一些可能的实现方式中,所述训练参数包括训练轮次、优化器类型、学习率更新策略、模型参数初始化方式和训练策略中的一种或多种。
根据本申请实施例的模型训练系统100可对应于执行本申请实施例中描述的方法,并且模型训练系统100的各个模块/单元的上述和其它操作和/或功能分别为了实现图5所示实施例中的各个方法的相应流程,为了简洁,在此不再赘述。
本申请实施例还提供一种计算设备集群。该计算设备集群可以是云环境、边缘环境或者终端设备中的至少一台计算设备形成的计算设备集群。该计算设备集群具体用于实现如图1所示实施例中模型训练系统100的功能。
图8提供了一种计算设备集群的结构示意图,如图8所示,计算设备集群80包括多台计算设备800,计算设备800包括总线801、处理器802、通信接口803和存储器804。处理器802、存储器804和通信接口803之间通过总线801通信。
总线801可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图8中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
处理器802可以为中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)、微处理器(micro processor,MP)或者数字信号处理器(digital signal processor,DSP)等处理器中的任意一种或多种。
通信接口803用于与外部通信。例如,通信接口803可以用于接收用户通过用户界面选择的第一模型和第二模型,接收用户配置的训练参数,或者通信接口803用于输出已训练的第一模型和/或已训练的第二模型等等。
存储器804可以包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM)。存储器804还可以包括非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM),快闪存储器,硬盘驱动器(hard disk drive,HDD)或固态驱动器(solid state drive,SSD)。
存储器804中存储有可执行代码,处理器802执行该可执行代码以执行前述AI模型训练方法。
具体地,在实现图1所示实施例的情况下,且图1实施例中所描述的模型训练系统100的各部分如交互单元102、训练单元104的功能为通过软件实现的情况下,执行图1中功 能所需的软件或程序代码可以存储在计算设备集群80中的至少一个存储器804中。至少一个处理器802执行存储器804中存储的程序代码,以使得计算设备集群800执行前述AI模型训练方法。
本申请实施例还提供了一种计算机可读存储介质。所述计算机可读存储介质可以是计算设备能够存储的任何可用介质或者是包含一个或多个可用介质的数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘)等。该计算机可读存储介质包括指令,所述指令指示计算设备执行上述AI模型训练方法。
本申请实施例还提供了一种计算机程序产品。所述计算机程序产品包括一个或多个计算机指令。在计算设备上加载和执行所述计算机指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算设备或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算设备或数据中心进行传输。所述计算机程序产品可以为一个软件安装包,在需要使用前述AI模型训练方法的任一方法的情况下,可以下载该计算机程序产品并在计算设备上执行该计算机程序产品。
上述各个附图对应的流程或结构的描述各有侧重,某个流程或结构中没有详述的部分,可以参见其他流程或结构的相关描述。
Claims (19)
- 一种人工智能AI模型训练方法,其特征在于,所述方法包括:确定待训练的第一模型和待训练的第二模型,所述第一模型和所述第二模型为异构的两种AI模型;将训练数据输入所述第一模型和所述第二模型,获得所述第一模型对所述训练数据进行推理后的第一输出,以及所述第二模型对所述训练数据进行推理后的第二输出;以所述第二输出为所述第一模型的监督信号,结合所述第一输出迭代更新所述第一模型的模型参数,直至所述第一模型满足第一预设条件。
- 根据权利要求1所述的方法,其特征在于,所述方法还包括:以所述第一输出为所述第二模型的监督信号,结合所述第二输出迭代更新所述第二模型的模型参数,直至所述第二模型满足第二预设条件。
- 根据权利要求1或2所述的方法,其特征在于,所述第一输出包括所述第一模型从所述训练数据中提取的第一特征和基于所述第一特征推理的第一概率分布中的至少一个,所述第二输出包括所述第二模型从所述训练数据中提取的第二特征和基于所述第二特征推理的第二概率分布中的至少一个;所述以所述第二输出为所述第一模型的监督信号,结合所述第一输出迭代更新所述第一模型的模型参数,包括:根据所述第一特征和所述第二特征确定第一对比损失,和/或者,根据所述第一概率分布和所述第二概率分布确定第一相对熵损失;根据所述第一对比损失和所述第一相对熵损失中的至少一个,迭代更新所述第一模型的模型参数。
- 根据权利要求3所述的方法,其特征在于,所述根据所述第一对比损失和所述第一相对熵损失中的至少一个,迭代更新所述第一模型的模型参数,包括:根据所述第一对比损失的梯度和所述第一相对熵损失的梯度迭代更新所述第一模型的模型参数;当所述第一模型的监督损失与所述第二模型的监督损失的差值小于第一预设阈值时,停止执行根据所述第一对比损失的梯度迭代更新所述第一模型的模型参数。
- 根据权利要求1至4任一项所述的方法,其特征在于,所述第一模型为转换器模型,所述第二模型为卷积神经网络模型。
- 根据权利要求1至5任一项所述的方法,其特征在于,所述确定待训练的第一模型和待训练的第二模型,包括:根据用户通过用户界面的选择,确定所述待训练的第一模型和所述待训练的第二模型;或者,根据用户设置的AI任务的类型确定所述待训练的第一模型和所述待训练的第二模型。
- 根据权利要求1至6任一项所述的方法,其特征在于,所述方法还包括:接收用户通过用户界面配置的训练参数;和/或,根据用户设置的AI任务的类型以及所述第一模型、所述第二模型,确定训练参数。
- 根据权利要求7所述的方法,其特征在于,所述训练参数包括训练轮次、优化器类 型、学习率更新策略、模型参数初始化方式和训练策略中的一种或多种。
- 一种模型训练系统,其特征在于,所述系统包括:交互单元,用于确定待训练的第一模型和待训练的第二模型,所述第一模型和所述第二模型为异构的两种AI模型;训练单元,用于将训练数据输入所述第一模型和所述第二模型,获得所述第一模型对所述训练数据进行推理后的第一输出,以及所述第二模型对所述训练数据进行推理后的第二输出;所述训练单元,还用于以所述第二输出为所述第一模型的监督信号,结合所述第一输出迭代更新所述第一模型的模型参数,直至所述第一模型满足第一预设条件。
- 根据权利要求9所述的系统,其特征在于,所述训练单元还用于:以所述第一输出为所述第二模型的监督信号,结合所述第二输出迭代更新所述第二模型的模型参数,直至所述第二模型满足第二预设条件。
- 根据权利要求9或10所述的系统,其特征在于,所述第一输出包括所述第一模型从所述训练数据中提取的第一特征和基于所述第一特征推理的第一概率分布中的至少一个,所述第二输出包括所述第二模型从所述训练数据中提取的第二特征和基于所述第二特征推理的第二概率分布中的至少一个;所述训练单元具体用于:根据所述第一特征和所述第二特征确定第一对比损失,和/或者,根据所述第一概率分布和所述第二概率分布确定第一相对熵损失;根据所述第一对比损失和所述第一相对熵损失中的至少一个,迭代更新所述第一模型的模型参数。
- 根据权利要求11所述的系统,其特征在于,所述训练单元具体用于:根据所述第一对比损失的梯度和所述第一相对熵损失的梯度迭代更新所述第一模型的模型参数;当所述第一模型的监督损失与所述第二模型的监督损失的差值小于第一预设阈值时,停止执行根据所述第一对比损失的梯度迭代更新所述第一模型的模型参数。
- 根据权利要求9至12任一项所述的系统,其特征在于,所述第一模型为转换器模型,所述第二模型为卷积神经网络模型。
- 根据权利要求9至13任一项所述的系统,其特征在于,所述交互单元具体用于:根据用户通过用户界面的选择,确定所述待训练的第一模型和所述待训练的第二模型;或者,根据用户设置的AI任务的类型确定所述待训练的第一模型和所述待训练的第二模型。
- 根据权利要求9至14任一项所述的系统,其特征在于,所述交互单元还用于:接收用户通过用户界面配置的训练参数;和/或,根据用户设置的AI任务的类型以及所述第一模型、所述第二模型,确定训练参数。
- 根据权利要求15所述的系统,其特征在于,所述训练参数包括训练轮次、优化器类型、学习率更新策略、模型参数初始化方式和训练策略中的一种或多种。
- 一种计算设备集群,其特征在于,所述计算设备集群包括至少一台计算设备,所 述至少一台计算设备包括至少一个处理器和至少一个存储器,所述至少一个存储器中存储有计算机可读指令,所述至少一个处理器执行所述计算机可读指令,使得所述计算设备集群执行如权利要求1至8任一项所述的方法。
- 一种计算机可读存储介质,其特征在于,包括计算机可读指令,当所述计算机可读指令在计算设备或计算设备集群上运行时,使得所述计算设备或计算设备集群执行如权利要求1至8任一项所述的方法。
- 一种计算机程序产品,其特征在于,包括计算机可读指令,当所述计算机可读指令在计算设备或计算设备集群上运行时,使得所述计算设备或计算设备集群执行如权利要求1至8任一项所述的方法。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP22860266.0A EP4386585A1 (en) | 2021-08-24 | 2022-08-11 | Model training method and system, cluster, and medium |
US18/586,050 US20240202535A1 (en) | 2021-08-24 | 2024-02-23 | Model training method, system, cluster, and medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110977567.4 | 2021-08-24 | ||
CN202110977567.4A CN115718869A (zh) | 2021-08-24 | 2021-08-24 | 模型训练方法、系统、集群及介质 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US18/586,050 Continuation US20240202535A1 (en) | 2021-08-24 | 2024-02-23 | Model training method, system, cluster, and medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023024920A1 true WO2023024920A1 (zh) | 2023-03-02 |
Family
ID=85254771
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/111734 WO2023024920A1 (zh) | 2021-08-24 | 2022-08-11 | 模型训练方法、系统、集群及介质 |
Country Status (4)
Country | Link |
---|---|
US (1) | US20240202535A1 (zh) |
EP (1) | EP4386585A1 (zh) |
CN (1) | CN115718869A (zh) |
WO (1) | WO2023024920A1 (zh) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117253044A (zh) * | 2023-10-16 | 2023-12-19 | 安徽农业大学 | 一种基于半监督交互学习的农田遥感图像分割方法 |
CN117371111A (zh) * | 2023-11-21 | 2024-01-09 | 石家庄铁道大学 | 基于深度神经网络和数值仿真的tbm卡机预测系统及方法 |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170243114A1 (en) * | 2016-02-19 | 2017-08-24 | International Business Machines Corporation | Adaptation of model for recognition processing |
CN110162766A (zh) * | 2018-02-12 | 2019-08-23 | 深圳市腾讯计算机系统有限公司 | 词向量更新方法和装置 |
CN111291755A (zh) * | 2020-02-13 | 2020-06-16 | 腾讯科技(深圳)有限公司 | 对象检测模型训练及对象检测方法、装置、计算机设备和存储介质 |
CN111522958A (zh) * | 2020-05-28 | 2020-08-11 | 泰康保险集团股份有限公司 | 文本分类方法和装置 |
CN111569429A (zh) * | 2020-05-11 | 2020-08-25 | 超参数科技(深圳)有限公司 | 模型训练方法、模型使用方法、计算机设备及存储介质 |
CN111967343A (zh) * | 2020-07-27 | 2020-11-20 | 广东工业大学 | 基于简单神经网络和极端梯度提升模型融合的检测方法 |
CN112733550A (zh) * | 2020-12-31 | 2021-04-30 | 科大讯飞股份有限公司 | 基于知识蒸馏的语言模型训练方法、文本分类方法及装置 |
-
2021
- 2021-08-24 CN CN202110977567.4A patent/CN115718869A/zh active Pending
-
2022
- 2022-08-11 WO PCT/CN2022/111734 patent/WO2023024920A1/zh active Application Filing
- 2022-08-11 EP EP22860266.0A patent/EP4386585A1/en active Pending
-
2024
- 2024-02-23 US US18/586,050 patent/US20240202535A1/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170243114A1 (en) * | 2016-02-19 | 2017-08-24 | International Business Machines Corporation | Adaptation of model for recognition processing |
CN110162766A (zh) * | 2018-02-12 | 2019-08-23 | 深圳市腾讯计算机系统有限公司 | 词向量更新方法和装置 |
CN111291755A (zh) * | 2020-02-13 | 2020-06-16 | 腾讯科技(深圳)有限公司 | 对象检测模型训练及对象检测方法、装置、计算机设备和存储介质 |
CN111569429A (zh) * | 2020-05-11 | 2020-08-25 | 超参数科技(深圳)有限公司 | 模型训练方法、模型使用方法、计算机设备及存储介质 |
CN111522958A (zh) * | 2020-05-28 | 2020-08-11 | 泰康保险集团股份有限公司 | 文本分类方法和装置 |
CN111967343A (zh) * | 2020-07-27 | 2020-11-20 | 广东工业大学 | 基于简单神经网络和极端梯度提升模型融合的检测方法 |
CN112733550A (zh) * | 2020-12-31 | 2021-04-30 | 科大讯飞股份有限公司 | 基于知识蒸馏的语言模型训练方法、文本分类方法及装置 |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117253044A (zh) * | 2023-10-16 | 2023-12-19 | 安徽农业大学 | 一种基于半监督交互学习的农田遥感图像分割方法 |
CN117253044B (zh) * | 2023-10-16 | 2024-05-24 | 安徽农业大学 | 一种基于半监督交互学习的农田遥感图像分割方法 |
CN117371111A (zh) * | 2023-11-21 | 2024-01-09 | 石家庄铁道大学 | 基于深度神经网络和数值仿真的tbm卡机预测系统及方法 |
CN117371111B (zh) * | 2023-11-21 | 2024-06-18 | 石家庄铁道大学 | 基于深度神经网络和数值仿真的tbm卡机预测系统及方法 |
Also Published As
Publication number | Publication date |
---|---|
CN115718869A (zh) | 2023-02-28 |
US20240202535A1 (en) | 2024-06-20 |
EP4386585A1 (en) | 2024-06-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2023024920A1 (zh) | 模型训练方法、系统、集群及介质 | |
CN109408731B (zh) | 一种多目标推荐方法、多目标推荐模型生成方法以及装置 | |
CN113516250B (zh) | 一种联邦学习方法、装置、设备以及存储介质 | |
US20190279088A1 (en) | Training method, apparatus, chip, and system for neural network model | |
WO2021254114A1 (zh) | 构建多任务学习模型的方法、装置、电子设备及存储介质 | |
WO2020224297A1 (zh) | 计算机执行的集成模型的确定方法及装置 | |
CN114616577A (zh) | 识别最佳权重以改进机器学习技术中的预测准确度 | |
CN111989696A (zh) | 具有顺序学习任务的域中的可扩展持续学习的神经网络 | |
CN113361680A (zh) | 一种神经网络架构搜索方法、装置、设备及介质 | |
JP7498248B2 (ja) | コンテンツ推薦とソートモデルトレーニング方法、装置、機器、記憶媒体及びコンピュータプログラム | |
CN116523079A (zh) | 一种基于强化学习联邦学习优化方法及系统 | |
WO2020026741A1 (ja) | 情報処理方法、情報処理装置及び情報処理プログラム | |
US11941867B2 (en) | Neural network training using the soft nearest neighbor loss | |
WO2023051369A1 (zh) | 一种神经网络的获取方法、数据处理方法以及相关设备 | |
CN113743991A (zh) | 生命周期价值预测方法及装置 | |
CN114972877B (zh) | 一种图像分类模型训练方法、装置及电子设备 | |
US20240152809A1 (en) | Efficient machine learning model architecture selection | |
CN113657538B (zh) | 模型训练、数据分类方法、装置、设备、存储介质及产品 | |
CN114091652A (zh) | 脉冲神经网络模型训练方法、处理芯片以及电子设备 | |
US20240119266A1 (en) | Method for Constructing AI Integrated Model, and AI Integrated Model Inference Method and Apparatus | |
WO2024114659A1 (zh) | 一种摘要生成方法及其相关设备 | |
US20220269835A1 (en) | Resource prediction system for executing machine learning models | |
CN110489435B (zh) | 基于人工智能的数据处理方法、装置、及电子设备 | |
CN115599918B (zh) | 一种基于图增强的互学习文本分类方法及系统 | |
WO2023160309A1 (zh) | 一种联邦学习方法以及相关设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 2022860266 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2022860266 Country of ref document: EP Effective date: 20240313 |