CN114611672A

CN114611672A - Model training method, face recognition method and device

Info

Publication number: CN114611672A
Application number: CN202210261833.8A
Authority: CN
Inventors: 许剑清
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-03-16
Filing date: 2022-03-16
Publication date: 2022-06-10

Abstract

The embodiment of the application discloses a model training method, a face recognition method and a face recognition device. The model training method comprises the steps of training an initial teacher model to obtain a trained teacher model; performing first knowledge distillation training on the initial assistant model according to the teacher model to obtain a trained assistant model; the parameter quantity of the initial assistant model is smaller than that of the initial teacher model; performing second knowledge distillation training on the initial student model according to the teacher model and the assistant model to obtain a trained student model; wherein the parameter quantity of the initial student model is smaller than the parameter quantity of the initial assistant model. The student model with higher accuracy can be obtained by combining the teacher model and the assistant model.

Description

Model training method, face recognition method and device

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a model training method, a face recognition method and a face recognition device.

Background

The knowledge distillation is a method for guiding the training of a student network model by constructing a teacher model and a student model and transferring the learning behaviors of the teacher model with a complex network structure, a large quantity of parameters and strong learning ability to the student models with a simple network structure, a small quantity of parameters and weak learning ability.

However, when the parameter difference between the teacher model and the student model is large, if the characteristics between the teacher model and the student model are forcibly fitted, the overfitting of the student model is serious, and the knowledge distillation effect is reduced.

Disclosure of Invention

In order to solve the technical problem, embodiments of the present application provide a model training method, a face recognition method and a face recognition device.

According to an aspect of an embodiment of the present application, there is provided a model training method, including: training the initial teacher model to obtain a trained teacher model; performing first knowledge distillation training on the initial assistant model according to the teacher model to obtain a trained assistant model; wherein the parameter quantity of the initial assistant model is less than the parameter quantity of the initial teacher model; performing second knowledge distillation training on the initial student model according to the teacher model and the assistant model to obtain a trained student model; wherein the parameter quantity of the initial student model is smaller than the parameter quantity of the initial assistant model.

According to an aspect of an embodiment of the present application, there is provided a face recognition method, including: collecting a face image to be recognized; inputting a face image to be recognized into a face recognition model; the face recognition model is obtained by training according to the model training method, and corresponds to a student model in the model training method; and acquiring a face recognition result output by the face recognition model.

According to an aspect of an embodiment of the present application, there is provided a model training apparatus including: the teacher model training module is configured to train the initial teacher model to obtain a trained teacher model; the assistant model training module is configured to perform first knowledge distillation training on the initial assistant model according to the teacher model to obtain a trained assistant model; wherein the parameter quantity of the initial assistant model is less than the parameter quantity of the initial teacher model; the student model training module is configured to perform second knowledge distillation training on the initial student model according to the teacher model and the assistant model to obtain a trained student model; wherein the parameter quantity of the initial student model is smaller than the parameter quantity of the initial assistant model.

According to an aspect of an embodiment of the present application, there is provided a face recognition apparatus, including: the image acquisition module is configured to acquire a face image to be recognized; the recognition module is configured to input a face image to be recognized into the face recognition model; the face recognition model is obtained by training according to the model training method, and corresponds to a student model in the model training method; and the result acquisition module is configured to acquire a face recognition result output by the face recognition model.

According to an aspect of an embodiment of the present application, there is provided an electronic device including: a memory storing computer readable instructions; and a processor reading the computer readable instructions stored in the memory to perform the model training method or the face recognition method as described above.

According to an aspect of embodiments of the present application, there is provided a computer-readable storage medium having stored thereon computer-readable instructions, which, when executed by a processor of a computer, cause the computer to perform a model training method or a face recognition method as described above.

According to an aspect of an embodiment of the present application, there is also provided a computer program product, including a computer program, which when executed by a processor, implements the steps in the model training method or the face recognition method as described above.

In the technical scheme provided by the embodiment of the application, the initial assistant model is subjected to first knowledge distillation training through the teacher model to obtain the trained assistant model, then the teacher model and the assistant model are combined to perform second knowledge distillation training on the initial student model to obtain the trained student model, and parameter transition is performed on the assistant model which is closer to the student model in parameter quantity relative to the teacher model, so that the situation that the student model is seriously overfitted due to overlarge parameter quantity difference between the teacher model and the student model is avoided, the second knowledge distillation training process is supervised through the teacher model with stronger learning capacity, and the finally obtained student model has higher accuracy.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:

FIG. 1 is a schematic illustration of an implementation environment to which the present application relates;

FIG. 2 is a flow chart of a model training method shown in an exemplary embodiment of the present application;

FIG. 3 is a schematic diagram of a training teacher model shown in an exemplary embodiment of the present application;

FIG. 4 is a flow chart illustrating an initial student model training process in accordance with an exemplary embodiment of the present application;

FIG. 5 is a flowchart of step S420 of FIG. 4 in an exemplary embodiment;

FIG. 6 is a schematic diagram of training an initial student model shown in an exemplary embodiment of the present application;

FIG. 7 is a flow chart of step S220 of FIG. 2 in an exemplary embodiment;

FIG. 8 is a schematic illustration of a first knowledge distillation training shown in an exemplary embodiment of the present application;

FIG. 9 is a flowchart of step S230 of FIG. 2 in an exemplary embodiment;

FIG. 10 is a schematic illustration of a second knowledge distillation training shown in an exemplary embodiment of the present application;

FIG. 11 is a flowchart of step S920 of FIG. 3 in an exemplary embodiment;

FIG. 12 is a flow chart illustrating a face recognition method according to an exemplary embodiment of the present application;

FIG. 13 is a schematic diagram illustrating the acquisition of a face recognition model in accordance with an exemplary embodiment of the present application;

FIG. 14 is a schematic diagram of a model deployment shown in an exemplary embodiment of the present application

FIG. 15 is a block diagram of a model training apparatus shown in an exemplary embodiment of the present application;

FIG. 16 is a block diagram of a face recognition apparatus shown in an exemplary embodiment of the present application;

FIG. 17 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

Reference to "a plurality" in this application means two or more. "and/or" describe the association relationship of the associated objects, meaning that there may be three relationships, e.g., A and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

The following briefly describes possible techniques that may be used in embodiments of the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

Machine Learning (ML) is a multi-domain cross discipline, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

In recent years, the development of machine learning has presented three major trends: the model structure is increasingly complex, the model hierarchy is continuously deepened, and the mass data set is continuously developed. However, as the demand for performing edge computation by using a neural network model in a mobile terminal and an embedded platform is rising, the neural network model is required to be as small as possible and have high computation efficiency due to the limited resources of the edge-side computing platform. Therefore, various types of model compression methods, such as model pruning, low rank decomposition, low precision quantization of model parameters, etc., have been proposed in recent years in the academic world and the industrial world. Hinton et al, in 2014, have proposed a Knowledge distillation (Knowledge distillation) method in the Knowledge of the Knowledge in a Neural Network, can regard a large-scale Neural Network model obtained on training of large-scale training data set as teacher's model (teacher Network), and regard a small-scale Neural Network model as student's model (student Network), through the artificial mark of probability distribution vector and training set of uniting teacher's model output, train student's model, in order to overcome the training difficulty of the small-scale Neural Network model on the large-scale data set, can get the test result close to or exceeding teacher's model on the classification task after training is finished. The method can be regarded as a knowledge transfer (knowledge transfer) means, and knowledge is transferred from a teacher model to a student model through training. After the migration is completed, a large and heavy teacher model is replaced by a student model with a design target of little agility, and a task is applied, so that the deployment of the neural network model on the edge side platform is greatly facilitated.

However, when the parameter difference between the teacher model and the student model is large, if the teacher model is directly used to distill knowledge of the student model, the characteristics of the teacher model and the characteristics of the student model are forcibly fitted, so that the student model is seriously over-fitted, and the student model training is collapsed.

Based on this, in order to ensure the training effect of the student model when the parameter difference between the teacher model and the student model is large, embodiments of the present application provide a model training method, a face recognition device, a computer-readable storage medium, and an electronic device.

The model training method provided in the embodiments of the present application is described below, where fig. 1 is a schematic diagram of an implementation environment related to the model training method in the present application. As shown in fig. 1, the implementation environment includes a terminal 110 and a server 120, and the terminal 110 and the server 120 may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

The terminal 110 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal 110 may be generally referred to as one of a plurality of terminals, and the embodiment is only illustrated by the terminal 110. Those skilled in the art will appreciate that the number of terminals described above may be greater or fewer. For example, the number of the terminals may be only one, or several tens or hundreds, or more, and in this case, the implementation environment of the model training method further includes other terminals. The number of terminals and the type of the device are not limited in the embodiments of the present application.

The server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like. The server 120 is used for providing background services for the application programs executed by the terminal 110.

Optionally, the wireless or wired networks described above use standard communication techniques and/or protocols. The Network is typically the Internet, but may be any Network including, but not limited to, a Local Area Network (LAN), a Metropolitan Area Network (MAN), a Wide Area Network (WAN), a mobile, wireline or wireless Network, a private Network, or any combination of virtual private networks. In some embodiments, data exchanged over a network is represented using techniques and/or formats including Hypertext Mark-up Language (HTML), Extensible Markup Language (XML), and the like. All or some of the links may also be encrypted using conventional encryption techniques such as Secure Socket Layer (SSL), Transport Layer Security (TLS), Virtual Private Network (VPN), Internet Protocol Security (IPsec). In other embodiments, custom and/or dedicated data communication techniques may also be used in place of, or in addition to, the data communication techniques described above.

Optionally, the server 120 undertakes primary model training work and the terminal 110 undertakes secondary model training work; alternatively, the server 120 undertakes the secondary model training work and the terminal 110 undertakes the primary model training work; alternatively, the server 120 or the terminal 110, respectively, may undertake the model training work separately.

Illustratively, the terminal 110 sends a model training instruction to the server 120, and the server 120 receives the model training instruction sent by the terminal 110 to obtain a trained teacher model; the server 120 performs first knowledge distillation training on the initial assistant model according to the teacher model to obtain a trained assistant model; and the server 120 performs second knowledge distillation training on the initial student model according to the teacher model and the assistant model to obtain a trained student model.

The model training method provided by the embodiment of the application can be widely applied to training of the model based on the neural network.

Referring to fig. 2, fig. 2 is a flowchart illustrating a model training method according to an exemplary embodiment of the present application. The model training method can be applied to the implementation environment shown in fig. 1, and is specifically executed by the server 120 in the implementation environment. It should be understood that the method may be applied to other exemplary implementation environments and is specifically executed by devices in other implementation environments, and the embodiment does not limit the implementation environment to which the method is applied.

The following describes the model training method proposed in the embodiment of the present application in detail with a server as a specific implementation subject.

As shown in fig. 2, in an exemplary embodiment, the model training method at least includes steps S210 to S230, which are described in detail as follows:

and step S210, training the initial teacher model to obtain a trained teacher model.

The initial teacher model is a teacher training model in which model parameters are initialized, and the teacher training model is a teacher model during training.

In one embodiment, referring to FIG. 3, FIG. 3 is a schematic diagram of training an initial teacher model. As shown in fig. 3, training data is acquired, the training data is input into an initial teacher model to perform data feature extraction processing to obtain training data features, then classification processing is performed according to the training data features to obtain a teacher model prediction result, then loss information of the initial teacher model is calculated according to the teacher model prediction result, whether a teacher model training completion condition is met or not is judged according to the loss information of the initial teacher model, if not, model parameters of the initial teacher model are updated, then the previous steps are repeated to perform iterative training, and if so, the current initial teacher model is used as an assistant model.

In the embodiment of the application, according to an application scenario of the model, a corresponding training data set can be obtained, the training data set includes a plurality of marked training data, the training data in the training data set is input into the initial teacher model in batches according to batch size, and a teacher model with a complex structure can be obtained through training.

It should be noted that the type of training data is different according to the application scenario of the model. Optionally, the training data is image data labeled with a category, such as a dog picture labeled as a dog, an automobile picture labeled as an automobile, or an advertisement picture labeled with characters included in the picture. Optionally, the training data is text data with marked corresponding content, such as a marked corresponding english chinese text, a marked corresponding german english text, or a marked corresponding chinese english text. Optionally, the training data is labeled audio data, video data, and the like, which is not limited in this embodiment of the application.

And inputting the training data into the feature extraction network of the initial teacher model to obtain the training data features output by the feature extraction network of the initial teacher model. The feature extraction network may include convolution calculation processing, pooling calculation processing, nonlinear activation function calculation processing, and the like, and of course, other processing steps may also be available, which is not limited in this embodiment of the present application. For example, after performing convolution calculation on input training data, the feature extraction network introduces a nonlinear factor by using an activation function, extracts a nonlinear feature, and obtains a training data feature of the training data, where the activation function may be, for example, a ReLu function. The server inputs training data in the training data set into an initial teacher model, extracts training data features corresponding to the training data through a feature extraction network in the initial teacher model to obtain feature vectors of the training data corresponding to the training data, and then calculates the training data features through a classification module in the initial teacher model to obtain training data prediction results corresponding to the training data.

Illustratively, in the training process of the initial teacher model, a clustering center vector is obtained according to the training data features, and different clustering center vectors correspond to different training data categories, for example, the teacher model includes N clustering center vectors, the N clustering center vectors correspond to N categories included in training data used by the training teacher model one to one, that is, each category of the training data corresponds to one clustering center vector, and the clustering center vector corresponding to a category is used for representing the classification result as the overall features of the training data of the category. The clustering center vectors corresponding to different types of training data can be determined according to the feature vector of each training data. For example, when the training data is one, the server determines the feature vector of the training data as the cluster center vector corresponding to the training data category, and when the training data is multiple, the server determines the mean value of the feature vectors of the training data as the cluster center vector corresponding to the training data category. The obtained cluster center vector corresponding to the training data set can be represented as a vector matrix of d × m, where d is a feature dimension of a single training data, and m is the number of categories of the training data.

Therefore, the training data type corresponding to the current training data is obtained by performing matrix operation on the training data features corresponding to the current training data and the clustering center vector existing in the initial teacher model, and the training data type is used as the initial teacher model prediction result.

Further, a corresponding sample label is obtained through labeling of each training data, the sample label of the training data is used as a target of the initial teacher model to be output, the difference between the prediction result of the initial teacher model and the sample label of the training data is calculated, and whether the training end condition is met or not is judged according to the difference between the prediction result of the initial teacher model and the sample label of the training data. The training end condition may be that the training iteration number meets a preset iteration number threshold, or that a loss value corresponding to the teacher model training loss function is smaller than a preset loss value threshold. Illustratively, the server constructs a teacher model training loss function according to the difference between the training data prediction result corresponding to the training data and the sample label of the training data, reversely updates the parameters of the initial teacher model by using a gradient descent algorithm based on the teacher model training loss function, and returns to the step of acquiring the training data from the training data set to continue training until the training end condition is met. The teacher model training Loss function may be a Softmax function, a contextual Loss function, a triple Loss function, a Center Loss function, a Margin function, or the like, and the gradient descent algorithm may be an SGD (statistical gradient descent) algorithm, a stochastic gradient descent algorithm with momentum terms, an adaptive gradient algorithm, an adaptive matrix estimation algorithm (adaptive motion estimation), or the like.

Step S220, performing first knowledge distillation training on the initial assistant model according to the teacher model to obtain a trained assistant model; wherein the parameter quantity of the initial assistant model is smaller than the parameter quantity of the initial teacher model.

It should be noted that the initial assistant model refers to an assistant training model with initialized model parameters, and the assistant training model is an assistant model during training. The parameter quantity of the model refers to the number of parameters contained in the model, and the smaller the parameter quantity of the model is, the smaller the volume of the model is, and conversely, the larger the parameter quantity of the model is, the larger the volume of the model is. The parameter quantity of the model can be obtained by obtaining the quantity of undetermined parameters of the model, and taking the convolutional neural network as an example, assuming that the size of a convolutional kernel of the convolutional neural network is k × k, an input channel is i, an output channel is o, and an offset term is y, the quantity of the parameters required for performing one convolution operation is as follows: k × k × i × o + y.

The first knowledge distillation training aims to guide the assistant model by using the output of the teacher model, and the output distribution of the assistant model is the same as or similar to that of the teacher model.

In the embodiment of the application, the assistant model is trained by using the teacher model, so that the output of the trained assistant model is the same as or similar to the output of the teacher model, namely, the data processing capacity of the assistant model is close to that of the teacher model through the first knowledge distillation training, but the structure of the assistant model is simpler than that of the teacher model. For example, after the first knowledge distillation training, the training data a is input into the teacher model and the assistant model respectively, and the output of the teacher model and the output of the assistant model are both the result a.

Step S230, performing second knowledge distillation training on the initial student model according to the teacher model and the assistant model to obtain a trained student model; wherein the parameter quantity of the initial student model is smaller than the parameter quantity of the initial assistant model.

It should be noted that the initial student model refers to a student training model with initialized model parameters, and the student training model is a student model during training.

The second knowledge distillation training aims to guide the student models by using the output of the teacher model and the output of the assistant model, so that the output distribution of the student models is the same as or similar to the output distribution of the teacher model and the output distribution of the assistant model.

In the embodiment of the application, the teacher model and the assistant model are used for training the student models, so that the output of the trained student models is the same as or similar to the output of the teacher model and the output of the assistant model, namely, the data processing capacity of the student models is close to the data processing capacity of the teacher model and the data processing capacity of the assistant model through the second knowledge distillation training, but the structure of the student models is simpler than that of the teacher model and the assistant model. For example, after the second knowledge distillation training, the training data a is input to the teacher model, the assistant model and the student model, respectively, and the output of the teacher model, the output of the assistant model and the output of the student model are the result a.

When the parameter difference between the teacher model and the student model is large, the problem that the student model is over-fitted in the learning process due to the fact that the parameters of the teacher model are large exists easily exists, and accuracy of the student model is reduced. Therefore, the model training method provided by the embodiment of the application introduces the assistant model in order to improve the knowledge distillation effect of the student model, and the parameter quantity of the assistant model is smaller than the parameter quantity of the teacher model but larger than the parameter quantity of the student model, so that parameter quantity transition can be performed through the assistant model which is closer to the parameter quantity of the student model than the teacher model, and the situation that the student model is seriously overfitted due to overlarge parameter quantity difference between the teacher model and the student model is avoided. Meanwhile, the data processing capacity of the teacher model is higher than that of the assistant model, so that the initial student model can be trained by combining the teacher model and the assistant model, and the obtained student model has better data processing capacity.

In some embodiments, the student model of the present application is a classification model for identifying a category of input data. For example, the student model is an image classification model, an image to be classified is input into the image classification model, and the image classification model outputs the class to which the image to be classified belongs; for example, the student model is a character emotion classification model, the sentence to be classified is input into the character emotion classification model, and the character emotion classification model outputs the category to which the sentence to be classified belongs. On the basis of the above embodiment, the trained teacher model has a more accurate clustering center vector, so that the initial student model is trained according to the more accurate clustering center vector in the teacher model before the second knowledge distillation training is performed on the initial student model, and the following solutions are provided for details.

Referring to fig. 4, in the above exemplary embodiment, before performing the second knowledge distillation training on the initial student models according to the teacher model and the assistant models, steps S410 to S430 are further included, which are described in detail as follows:

and S410, inputting the training data into the uninitialized student model to obtain a feature extraction result output by the uninitialized student model.

It should be noted that the training data carries a sample label, and the sample label is used to indicate the real category information to which the training data belongs. An uninitialized student model refers to an initial student training model with uninitialized model parameters, which is the initial student model at the time of training. The feature extraction result refers to a result obtained by vectorizing input training data, and the uninitialized student model obtains the feature extraction result of the training data by vectorizing the training data.

Vectorization processing refers to representing training data to be classified by using feature vectors. The way of vectorization processing is determined according to the actual data type of the training data. And when the training data are images, the server acquires the feature vectors of the training data through the feature extraction module of the uninitialized student model. When the training data is characters, the server obtains the feature vector of the training data through at least one of a Word vector (Word2 vector) algorithm, a Word embedding (Word embedding) algorithm and a one-hot algorithm.

And step S420, obtaining an uninitialized student model prediction result according to the feature extraction result and the plurality of clustering center vectors.

It should be noted that the uninitialized student model prediction result is used to characterize a result obtained by classifying the input training data by the uninitialized student model.

And the server transfers the plurality of clustering center vectors in the classification module of the teacher model to the classification module of the uninitialized student model, and obtains the training data category corresponding to the feature extraction result according to the plurality of clustering center vectors in the classification module of the uninitialized student model and the feature extraction result, so as to obtain the prediction result of the uninitialized student model.

The plurality of clustering center vectors in the classification module of the teacher model are obtained in the training process of the teacher model, and the classification module of the teacher model is used for predicting the class of the training data.

And step S430, adjusting model parameters of the uninitialized student model according to the difference between the prediction result of the uninitialized student model and the sample label of the training data to obtain an initial student model.

And obtaining a corresponding sample label of each training data through marking of each training data, outputting the sample label of the training data as a target of the uninitialized student model, calculating the difference between the prediction result of the uninitialized student model and the sample label of the training data, and judging whether the training end condition of the uninitialized student model is met or not according to the difference between the prediction result of the uninitialized student model and the sample label of the training data. The condition for meeting the uninitialized student model training end condition comprises any one of the following conditions: the training times reach a time threshold; converging a loss function; the loss function is less than the loss function threshold. The number threshold and the loss function threshold are set empirically or flexibly adjusted according to an application scenario, which is not limited in the embodiment of the present application.

Illustratively, the loss function value is calculated by calculating the difference between the uninitialized student model prediction result and the sample label of the training data by the uninitialized student model loss function. The uninitialized student model loss function may be a triplet loss function (triple loss function), or may also be other loss functions such as a cross entropy loss function. And then, judging whether a preset condition for terminating the model training is met or not according to the error between the prediction result of the uninitialized student model and the sample label of the training data, if so, taking the current uninitialized student model as an initial student model, if not, adjusting the model parameters of the uninitialized student model, then, repeatedly executing the steps until the training termination condition is met, and taking the uninitialized student model obtained when the training termination condition is met as the initial student model.

In the embodiment of the application, besides a plurality of clustering center vectors in the classification module, the uninitialized student model further comprises other modules (such as a feature extraction module) which need to update parameters, and the parameters of the uninitialized student model are reversely updated by using the loss function.

Optionally, in the embodiment of the present application, step S420 may also be implemented by using steps S510 to S520 shown in fig. 5, details of which are as follows:

and step S510, respectively calculating the similarity between the feature extraction result and each cluster center vector.

And the server acquires the similarity between the feature extraction result and each cluster center vector through a classification module of the uninitialized student model. The similarity between the feature extraction result and the clustering center vector is used for measuring the similarity between the training data to be classified and the training data category corresponding to the clustering center vector. The greater the similarity between the feature extraction result and the clustering center vector is, the greater the similarity between the training data and the training data category corresponding to the clustering center vector as the classification result is; the smaller the similarity between the feature extraction result and the clustering center vector is, the smaller the similarity between the training data and the training data category corresponding to the clustering center vector as the classification result is.

The similarity between the feature extraction result and the cluster center vector can be represented by at least one of the following items: euclidean distance between the feature extraction result and the clustering center vector, cosine similarity between the feature extraction result and the clustering center vector, Manhattan distance between the feature extraction result and the clustering center vector, Chebyshev distance between the feature extraction result and the clustering center vector and the like. In the embodiment of the present application, only the similarity between the feature extraction result and the cluster center vector is expressed by using the cosine similarity therebetween as an example for explanation.

Optionally, the cosine similarity cos (θ) between the feature extraction result and the cluster center vector is calculated by the following formula 1:

formula 1,

X_iFeatures of the i-th dimension, Y, representing the result of feature extraction_iAnd n is the number of dimensions included in the feature extraction result or the clustering center vector.

The closer the cosine similarity between the feature extraction result and the clustering center vector is to 1, the greater the similarity between the feature extraction result and the clustering center vector is; the closer the cosine similarity between the feature extraction result and the clustering center vector is to-1, the smaller the similarity between the feature extraction result and the clustering center vector is.

And S520, determining the training data type of the training data corresponding to the feature extraction result according to the similarity, and taking the training data type as the prediction result of the uninitialized student model.

The feature extraction result and each cluster center vector correspond to a similarity. Optionally, each similarity may be ranked to obtain a cluster center vector with the highest similarity corresponding to the feature extraction result according to the ranking result, and then a training data category corresponding to the cluster center vector is used as a training data category to which training data corresponding to the feature extraction result belongs, that is, the training data category is used as an uninitialized student model prediction result.

In one embodiment, referring to fig. 6, fig. 6 is a schematic diagram of training an uninitialized student model. As shown in fig. 6, training data is obtained, the training data is input into an uninitialized student model to perform data feature extraction processing to obtain training data features, classification processing is performed according to the training data features to obtain an uninitialized student model prediction result, loss information of the uninitialized student model is calculated according to the uninitialized student model prediction result, whether an uninitialized student model training end condition is met is judged according to the loss information of the uninitialized student model, if not, model parameters of the uninitialized student model are updated, then the steps are repeated to perform iterative training, and if so, the current uninitialized student model is used as an initial student model. The clustering center vector used for classification processing of the uninitialized student model is migrated from the teacher model.

In the training process, the plurality of clustering center vectors in the classification module of the uninitialized student model only provide classification calculation, do not participate in the updating process of the model parameters, and only update other parameters of the uninitialized student model. After the uninitialized student models are trained according to the plurality of clustering center vectors, the obtained initial student models preliminarily obtain the data processing capacity of the teacher model, the training efficiency of the student models can be improved, and the follow-up second knowledge distillation training of the student models is facilitated.

Optionally, in the embodiment of the present application, step S220 may be implemented by steps S710 to S730 shown in fig. 7, and details are as follows:

and step S710, inputting the training data into the initial assistant model and the teacher model respectively to obtain an assistant model prediction result output by the initial assistant model and a teacher model prediction result output by the teacher model.

It should be noted that the teacher model is a deep network with high precision and large parameter quantity, and the parameter quantity of the initial assistant model is smaller than that of the teacher model. The assistant model prediction result output by the initial assistant model is an output result obtained after the training data is processed by the initial assistant model, and the teacher model prediction result output by the teacher model is an output result obtained after the training data is processed by the teacher model. For example, when the initial assistant model and the teacher model need to perform data classification processing on the training data, the assistant model prediction result and the teacher model prediction result are data classification results of the initial assistant model and the teacher model on the training data, respectively.

And S720, calculating first knowledge distillation loss information according to the assistant model prediction result and the teacher model prediction result.

It should be noted that the first knowledge distillation loss information is used to characterize the difference between the assistant model prediction result and the teacher model prediction result, that is, to characterize the difference between the output distribution of the initial assistant model and the output distributions of the teacher model and the assistant models.

The server can calculate the error between the assistant model prediction result and the teacher model prediction result by using a preset loss function to obtain first knowledge distillation loss information. For example, the first knowledge distillation loss information may be calculated using equation 2 shown below.

Formula 2,

Wherein L is₁Refers to the first knowledge distillation loss information, KL refers to KL divergence loss function, y_aRefers to the assistant model prediction result, y_tI refers to the teacher model prediction result, i refers to the training data amount,

representing the probability distribution of the teacher model's predictions,

refers to the probability distribution of the assistant model's prediction results. Wherein, KL divergence needs to satisfy: the initial assistant model resembles the distribution of the output obtained by the student model.

For example, the first knowledge distillation loss information may be obtained through at least one or more of a distance constraint and an angle constraint between the assistant model prediction result and the teacher model prediction result, wherein the distance constraint and the angle constraint need to be satisfied: for the same batch of training data, the distances and angles between the features extracted by the initial assistant model and the teacher model should be similar. For example, the distance may be a euclidean distance or a cosine distance, but may also be other distances (e.g., L1 distance, etc.).

And step S730, adjusting model parameters of the initial assistant model according to the first knowledge distillation loss information to obtain the assistant model.

And the server judges whether the first knowledge distillation training completion condition is met or not according to the first knowledge distillation loss information, if the first knowledge distillation training completion condition is not met, the model parameters in the initial assistant model are reversely updated by using the first knowledge distillation loss information, and the steps S710 to S720 are iterated until the first knowledge distillation training completion condition is met, and the initial assistant model when the first knowledge distillation training completion condition is met is used as the assistant model.

In some embodiments, the present embodiment may implement step S720 by using the following steps: calculating the error between the assistant model prediction result and the teacher model prediction result to obtain result loss information; calculating errors between the prediction results of the assistant model and sample labels corresponding to the training data to obtain label loss information; and calculating to obtain first knowledge distillation loss information according to the result loss information and the label loss information.

And the label loss information is used for representing the error between the assistant model prediction result and the sample label corresponding to the training data.

For example, the server may calculate an error between the assistant model prediction result and the teacher model prediction result using the KL divergence loss function to obtain result loss information, and then calculate an error between the assistant model prediction result and a sample label corresponding to the training data using different loss functions in different application scenarios to obtain label loss information. For example, in an application scenario of image classification recognition, a classification loss function may be used to calculate an error between an assistant model prediction result and a sample label corresponding to training data, and obtain label loss information, for example, in an application product of image segmentation, a cross entropy loss function may be used to calculate an error between an assistant model prediction result and a sample label corresponding to training data, and obtain label loss information, and the like. The teacher model can be used for predicting the training data to obtain the probability value of each classification, the probability value is used as the soft label of the training data, and the KL divergence loss function is calculated according to the soft label.

Referring to fig. 8, fig. 8 is a schematic diagram of the distillation training of the first knowledge. As shown in fig. 8, training data is obtained, the training data is respectively input into the initial assistant model and the teacher model for data processing, a processing result output by the initial assistant model and a processing result output by the teacher model are obtained, then first knowledge distillation loss information is obtained through calculation according to the outputs of the initial assistant model and the teacher model, whether a first knowledge distillation training completion condition is met is judged according to the first knowledge distillation loss information, if not, model parameters of the initial assistant model are updated, then the previous steps are repeated for iterative training, and if yes, the current initial assistant model is used as an assistant model.

In some embodiments, the present application may implement step S230 by using steps S910 to S930 shown in fig. 9, so as to further match the output distribution of the initial student model with the output distributions of the teacher model and the assistant model.

Step S910, the training data are respectively input into the initial student model, the assistant model and the teacher model, and an initial student model prediction result output by the initial student model, an assistant model prediction result output by the assistant model and a teacher model prediction result output by the teacher model are obtained.

It should be noted that the teacher model is a deep network with high precision and large parameter quantity, the parameter quantity of the assistant model is smaller than the parameter quantity of the teacher model, and the parameter quantity of the initial student model is smaller than the parameter quantity of the assistant model. The initial student model prediction result output by the initial student model is an output result obtained after the initial student model processes the training data, the assistant model prediction result output by the assistant model is an output result obtained after the assistant model processes the training data, and the teacher model prediction result output by the teacher model is an output result obtained after the teacher model processes the training data. For example, when the initial student model, the assistant model, and the teacher model need to perform data classification processing on the training data, the initial student model prediction result, the assistant model prediction result, and the teacher model prediction result are data classification results of the initial student model, the assistant model, and the teacher model on the training data, respectively.

And step S920, calculating second knowledge distillation loss information according to the initial student model prediction result, the assistant model prediction result and the teacher model prediction result.

It should be noted that the second knowledge distillation loss information is used to represent the difference between the initial student model prediction result and the assistant model prediction result and the teacher model prediction result, that is, to represent the difference between the output distribution of the initial student model and the output distribution of the teacher model and the assistant model.

For example, in order to make the output distribution of the initial student model the same as or similar to the output distribution of the assistant model and the teacher model, the second knowledge distillation loss information may be obtained by at least one or more of KL divergence, distance constraint, and angle constraint between the initial student model prediction result and the assistant model prediction result and the teacher model prediction result. Wherein, KL divergence needs to satisfy: the distribution of the output obtained by the assistant model and the initial student model is similar; distance and angle constraints need to be satisfied: for the same batch of training data, the distance and angle between the features extracted by the assistant model and the initial student model should be similar, for example, the distance may be euclidean distance or cosine distance, or may be other distances (e.g., L1 distance).

And step S930, adjusting model parameters of the initial student model according to the second knowledge distillation loss information to obtain the student model.

The second knowledge distillation aims to guide the initial student model by using the output of the teacher model and the output of the assistant model, so that the output distribution of the initial student model is the same as or similar to the output distribution of the teacher model and the output distribution of the assistant model.

And the server judges whether the second knowledge distillation training completion condition is met or not according to the second knowledge distillation loss information, if the second knowledge distillation training completion condition is not met, the model parameters in the initial student model are reversely updated by using the second knowledge distillation loss information, and the steps S910 to S920 are iterated until the second knowledge distillation training completion condition is met, and the initial student model when the second knowledge distillation training completion condition is met is used as an assistant model.

Referring to fig. 10, fig. 10 is a schematic diagram of a distillation training of the second knowledge. As shown in fig. 10, training data is acquired, the training data is input into the initial student model, the assistant model and the teacher model, second knowledge distillation loss information is obtained through calculation according to the output of the initial student model, the assistant model and the teacher model, whether a second knowledge distillation training completion condition is met is judged according to the second knowledge distillation loss information, if not, model parameters of the initial student model are updated, then, the previous steps are repeated to conduct iterative training, and if yes, the current initial student model is used as the assistant model.

Optionally, in the embodiment of the present application, step S920 may also be implemented by using step S1110 to step S1130 shown in fig. 11, where details are as follows:

and step S1110, calculating an error between the initial student model prediction result and the assistant model prediction result to obtain assistant model loss information.

The assistant model loss information is used for representing the difference between the initial student model prediction result and the assistant model prediction result, namely representing the difference between the output distribution of the initial student model and the output distribution of the assistant model.

For example, the server may calculate an error between the initial student model prediction result and the assistant model prediction result using a preset loss function, and obtain assistant model loss information. For example, the assistant model loss information can be calculated using equation 3 shown below.

Formula 3,

Wherein L is_aRefers to assistant model loss information, KL refers to KL divergence loss function, y_aRefers to the assistant model prediction result, y_sRefers to the initial student model prediction result, i refers to the training data volume,

representing the probability distribution of the assistant model's predictions,

refers to the probability distribution of the initial student model prediction results.

And step 1120, calculating an error between the initial student model prediction result and the teacher model prediction result to obtain teacher model loss information.

The teacher model loss information is used for representing the difference between the initial student model prediction result and the teacher model prediction result, namely representing the difference between the output distribution of the initial student model and the output distribution of the teacher model.

For example, the server may calculate an error between the initial student model prediction result and the teacher model prediction result by using a preset loss function to obtain teacher model loss information. For example, teacher model loss information may be calculated using equation 4 shown below.

Formula 4,

Wherein L is_tIs teacher model loss information, KL is KL divergence loss function, y_tMeans thatTeacher model predicts the result, y_sRefers to the initial student model prediction result, i refers to the training data volume,

representing the probability distribution of the teacher model's predictions,

And S1130, obtaining second knowledge distillation loss information according to the assistant model loss information and the teacher model loss information.

The server can use preset weight coefficients to perform weighted calculation on the assistant model loss information and the teacher model loss information to obtain second knowledge distillation loss information. For example, the second knowledge distillation loss information may be calculated using equation 5 shown below.

Equation 5, L₂＝βL_a+γL_t

Wherein L is₂Refers to the second knowledge distillation loss information, L_aMeans assistant model loss information, L_tThe teacher model loss information is referred to as beta, the assistant model loss information is referred to as gamma, and the assistant model loss information is referred to as gamma.

In an alternative implementation manner, the model training method provided in the embodiment of the present application can be used in a face recognition scenario, which is described below with reference to a scenario for training a face recognition model, as shown in fig. 12, the face recognition method includes steps S1210 to S1230.

And step S1210, collecting a face image to be recognized.

It should be noted that the face image to be recognized is an image to be subjected to face recognition. The face image to be recognized may include one or at least two faces to be recognized. The server can identify one or at least two faces to be identified in the face images to be identified based on the face images to be identified.

It should be noted that, in the specific implementation manner of the present application, the data related to the face image to be recognized and the like are involved, when the above embodiment of the present application is applied to a specific product or technology, user permission or consent needs to be obtained, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related countries and regions.

Illustratively, the terminal 110 is a client deployed with application software having a face recognition function, and the server 120 is deployed with a trained face recognition model to recognize a face on the terminal side. For example, in the field of financial payment, a user can perform operations requiring authentication, such as transferring accounts, paying accounts or modifying account information, through a smart phone, and can realize authentication by recognizing a face image to be recognized of the user. In the process, the terminal 110 uploads the face image to be recognized to be detected to the server 120, or the server 120 directly calls the face image to be recognized to be detected in the database, and then the trained face recognition model is adopted to recognize the face image to be recognized so as to obtain a face recognition result. The server 120 may feed back the face recognition result to the terminal 110, or may store the face recognition result locally for other service applications or processes.

For example, the terminal may acquire a to-be-recognized face image of a real scene through a built-in camera. The terminal can also acquire the face image to be recognized of the real scene through an external camera which is associated with the terminal. For example, the terminal may be connected to the image acquisition device through a connection line or a network, and the image acquisition device acquires a to-be-recognized face image of a real scene through the camera and transmits the acquired to-be-recognized face image to the terminal. The cameras may be monocular cameras, binocular cameras, depth cameras, three-dimensional (3D) cameras, and the like. The terminal can collect the face image to be recognized of the living body in the real scene, and can also collect the existing image containing the face in the real scene, such as an identity document scanning piece and the like.

Step S1220, inputting a face image to be recognized into a face recognition model; the face recognition model is obtained by training according to the model training method, and corresponds to the student model in the model training method.

And carrying out face recognition on the face image to be recognized through the trained face recognition model to obtain a face recognition result corresponding to the face image to be recognized. As shown in fig. 13, the process of obtaining the face recognition model includes a model training phase and a model deployment phase. And the model training stage comprises the steps of carrying out first knowledge distillation training on the initial assistant model through the trained teacher model to obtain a trained assistant model, carrying out preliminary optimization training on the uninitialized student model through a plurality of clustering center vectors contained in the teacher model to obtain an initial student model, and then carrying out second knowledge distillation training on the initial student model by combining the teacher model and the assistant model to obtain a trained student model, namely the face recognition model. The model deployment stage is configured to perform combined deployment on the relevant modules obtained in the module training stage to obtain a complete face recognition model, for example, as shown in fig. 14, a picture acquisition input module, an image feature extraction module and an image classification module are integrated, the picture acquisition input module acquires a face image to be recognized, then feature extraction is performed on the face image to be recognized according to the image feature extraction module to obtain face image features, and then the image classification module performs face recognition according to the face image features. The human face recognition model is obtained by performing knowledge migration on a teacher model and an assistant model with stronger expression capacity, and the extracted feature distribution of the human face recognition model is higher in similarity with the feature distribution of the teacher model and the feature distribution of the assistant model, so that the human face recognition model has higher recognition accuracy.

And step S1230, obtaining a face recognition result output by the face recognition model.

The server obtains the face image characteristics corresponding to the face image to be recognized through the trained face recognition model, and obtains the face recognition result corresponding to the face image to be recognized based on the face image characteristics.

In an embodiment, the face recognition result may be an identity corresponding to a face in a face image to be recognized, for example, in a one-to-many identity recognition scenario, the server obtains a face image feature corresponding to the face image to be recognized through a trained face recognition model, matches the face image feature with at least one reference image feature, and uses an identity corresponding to the reference image feature with the largest matching degree as the identity of the face in the face image to be recognized. In other embodiments, the face recognition result may be an authentication result corresponding to a face in the face image to be recognized. For example, in a one-to-one identity authentication scene, the server acquires the face image features corresponding to the face image to be recognized through the trained face recognition model, acquires the reference image features corresponding to the identity to be verified, and determines that the face image to be recognized passes identity authentication when the similarity between the image features to be verified and the reference image features exceeds a threshold value.

For the training method of the face recognition model, reference may be made to the above embodiments, which are not described herein again.

Therefore, in the technical scheme provided by the embodiment of the application, the teacher model is used for conducting first knowledge distillation training on the initial assistant model to obtain the trained assistant model, then the teacher model and the assistant model are combined to conduct second knowledge distillation training on the initial student model to obtain the trained student model, parameter transition is conducted on the assistant model which is closer to the student model in parameter quantity relative to the teacher model, the situation that the student model is seriously over-fitted due to overlarge parameter quantity difference between the teacher model and the student model is avoided, the teacher model with stronger learning capacity is used for supervising the second knowledge distillation training process, and the finally obtained student model has higher accuracy.

FIG. 15 is a block diagram of a model training apparatus shown in an exemplary embodiment of the present application. The model training apparatus may be applied to the implementation environment shown in fig. 1. The model training device may also be applied to other exemplary implementation environments, and is specifically configured in other devices, and this embodiment does not limit the implementation environment to which the device is applied.

As shown in fig. 15, the exemplary model training apparatus 1500 includes: a teacher model training module 1510, an assistant model training module 1520, and a student model training module 1530. Specifically, the method comprises the following steps:

the teacher model training module 1510 is configured to train the initial teacher model to obtain a trained teacher model.

An assistant model training module 1520 configured to perform a first knowledge distillation training on the initial assistant model according to the teacher model to obtain a trained assistant model; wherein the parameter quantity of the initial assistant model is smaller than the parameter quantity of the initial teacher model.

The student model training module 1530 is configured to perform second knowledge distillation training on the initial student model according to the teacher model and the assistant model to obtain a trained student model; wherein the parameter quantity of the initial student model is smaller than the parameter quantity of the initial assistant model.

In the exemplary model training apparatus, in order to improve the knowledge distillation effect of the student models, assistant models are introduced, and since the parameter quantity of the assistant models is smaller than the parameter quantity of the teacher model but larger than the parameter quantity of the student models, parameter quantity transition can be performed by the assistant models which are closer to the parameter quantity of the student models than the teacher model, so that the situation that the student models are over-fitted seriously due to too large parameter quantity difference between the teacher model and the student models is avoided. Meanwhile, the data processing capacity of the teacher model is higher than that of the assistant model, so that the initial student model can be trained by combining the teacher model and the assistant model, and the obtained student model has better data processing capacity.

On the basis of the above exemplary embodiment, the model training apparatus 1500 further includes a first feature extraction module, a first prediction module, and a first model parameter adjustment module. Specifically, the method comprises the following steps:

and the first feature extraction module is configured to input the training data into the uninitialized student model to obtain a feature extraction result output by the uninitialized student model.

And the first prediction module is configured to obtain an uninitialized student model prediction result according to the feature extraction result and the plurality of clustering center vectors.

And the first model parameter adjusting module is configured to adjust the model parameters of the uninitialized student model according to the difference between the prediction result of the uninitialized student model and the sample label of the training data to obtain the initial student model.

In this exemplary model training device, through directly migrating the more accurate clustering center vector that obtains among the teacher model training process to uninitialized student model, train the uninitialized student model that obtains the clustering center vector according to training data, with preliminary obtaining possess with the little initial student model of teacher model's data processing ability difference, it can accelerate the follow-up efficiency of carrying out second knowledge distillation training to initial student model, reduce the knowledge migration degree of difficulty of second knowledge distillation training, and then can promote student model's accuracy.

On the basis of the above exemplary embodiment, the first prediction module further includes a similarity calculation module and a training data category acquisition module. Specifically, the method comprises the following steps:

and the similarity calculation module is configured to calculate the similarity between the feature extraction result and each cluster center vector respectively.

And the training data category acquisition module is configured to determine a training data category to which training data corresponding to the feature extraction result belongs according to the similarity, and take the training data category as an uninitialized student model prediction result.

In the exemplary model training device, the similarity between the feature extraction result output by the uninitialized student model and each cluster center vector is calculated to confirm the training data class of the training data input into the uninitialized student model according to the similarity, and the accuracy of the process of carrying out training data class identification on the feature extraction result is improved because the cluster center vector is a parameter transferred by the teacher model.

Based on the above exemplary embodiment, the assistant model training module 1520 further includes a second prediction module, a first knowledge distillation loss information calculation module, and an assistant model parameter adjustment module. Specifically, the method comprises the following steps:

and the second prediction module is configured to input the training data into the initial assistant model and the teacher model respectively to obtain an assistant model prediction result output by the initial assistant model and a teacher model prediction result output by the teacher model.

And the first knowledge distillation loss information calculation module is configured to calculate first knowledge distillation loss information according to the assistant model prediction result and the teacher model prediction result.

And the assistant model parameter adjusting module is configured to adjust the model parameters of the initial assistant model according to the first knowledge distillation loss information so as to obtain the assistant model.

In the exemplary model training device, knowledge distillation training is performed on the initial assistant model according to the teacher model prediction result output by the teacher model, so that the data processing capacity of the teacher model is migrated to the assistant model with smaller parameters, and then the condition of overfitting of the student model due to large parameter difference between the teacher model and the student model can be avoided on the basis of the assistant model with smaller parameters, so that the accuracy of the student model is improved.

On the basis of the above exemplary embodiment, the first knowledge distillation loss information calculation module further includes a result loss information acquisition module, a tag loss information acquisition module, and a comprehensive calculation module. Specifically, the method comprises the following steps:

and the result loss information acquisition module is configured to calculate an error between the assistant model prediction result and the teacher model prediction result to obtain result loss information.

And the label loss information acquisition module is configured to calculate an error between the prediction result of the assistant model and a sample label corresponding to the training data to obtain label loss information.

And the first comprehensive calculation module is configured to calculate to obtain first knowledge distillation loss information according to the result loss information and the label loss information.

In the exemplary model training device, label loss information is obtained according to the assistant model prediction result and the sample label corresponding to the training data, result loss information is obtained according to the assistant model prediction result and the teacher model prediction result, and then first knowledge distillation loss information is obtained according to the label loss information and the result loss information, so that the prediction result of the teacher model on the training data and the real result of the training data are considered at the same time, and the accuracy of the assistant model is improved.

On the basis of the above exemplary embodiment, the student model training module 1530 further includes a third prediction module, a second knowledge distillation loss information calculation module, and a student model parameter adjustment module. Specifically, the method comprises the following steps:

and the third prediction module is configured to input the training data into the initial student model, the assistant model and the teacher model respectively to obtain an initial student model prediction result output by the initial student model, an assistant model prediction result output by the assistant model and a teacher model prediction result output by the teacher model.

And the second knowledge distillation loss information calculation module is configured to calculate second knowledge distillation loss information according to the initial student model prediction result, the assistant model prediction result and the teacher model prediction result.

And the student model parameter adjusting module is configured to adjust the model parameters of the initial student model according to the second knowledge distillation loss information so as to obtain the student model.

In the exemplary model training device, knowledge distillation training is performed on an initial student model according to a teacher model prediction result output by a teacher model and an assistant model prediction result output by an assistant model, so that the situation that the student model is over-fitted due to a large parameter difference between the teacher model and the student model is avoided through the assistant model with a smaller parameter, and a second knowledge distillation training process is supervised through the teacher model with a stronger learning capacity, so that the finally obtained student model has higher accuracy.

On the basis of the above exemplary embodiment, the second knowledge distillation loss information calculation module further includes an assistant model loss information acquisition module, a teacher model loss information acquisition module, and a second comprehensive calculation module.

Specifically, the method comprises the following steps:

and the assistant model loss information acquisition module is configured to calculate an error between the initial student model prediction result and the assistant model prediction result to obtain assistant model loss information.

And the teacher model loss information acquisition module is configured to calculate an error between the initial student model prediction result and the teacher model prediction result to obtain teacher model loss information.

And the second comprehensive calculation module is configured to obtain second knowledge distillation loss information according to the assistant model loss information and the teacher model loss information.

In the exemplary model training apparatus, by performing the calculation of the second knowledge distillation loss information on the initial student model in combination with the assistant model loss information and the teacher model loss information, the output distributions of the assistant model and the teacher model can be combined, so that the obtained second knowledge distillation loss information is more accurate.

It should be noted that the model training apparatus provided in the foregoing embodiment and the model training method provided in the foregoing embodiment belong to the same concept, and specific ways for each module and unit to perform operations have been described in detail in the method embodiment, and are not described herein again. In practical applications, the model training device provided in the above embodiment may distribute the functions through different function modules according to needs, that is, the internal structure of the device is divided into different function modules to complete all or part of the functions described above, which is not limited herein.

Fig. 16 is a block diagram of a face recognition apparatus according to an exemplary embodiment of the present application. The face recognition apparatus can be applied to the implementation environment shown in fig. 1. The face recognition apparatus may also be applied to other exemplary implementation environments, and is specifically configured in other devices, and the embodiment does not limit the implementation environment to which the apparatus is applied.

As shown in fig. 16, the exemplary face recognition apparatus 1600 includes: an image acquisition module 1610, a recognition module 1620, and a result acquisition module 1630. Specifically, the method comprises the following steps:

an image acquisition module 1610 configured to acquire a face image;

a recognition module 1620 configured to input the face image to a face recognition model; the face recognition model is obtained by training according to the model training method, and corresponds to a student model in the model training method;

the result obtaining module 1630 is configured to obtain a face recognition result output by the face recognition model.

In the exemplary model training device, a face recognition model is obtained based on the model training method, the face recognition model has data processing capacity similar to or the same as that of a teacher model, and the face recognition model has smaller parameter number and is convenient to deploy.

It should be noted that the face recognition apparatus provided in the foregoing embodiment and the face recognition method provided in the foregoing embodiment belong to the same concept, and specific ways of performing operations by the modules and units have been described in detail in the method embodiment, and are not described herein again. In practical applications, the model training apparatus provided in the above embodiment may distribute the above functions by different function modules according to needs, that is, divide the internal structure of the apparatus into different function modules to complete all or part of the above described functions, which is not limited herein.

An embodiment of the present application further provides an electronic device, including: one or more processors; a storage device, configured to store one or more programs, which when executed by the one or more processors, cause the electronic device to implement the model training method provided in the above-described embodiments.

FIG. 17 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application. It should be noted that the computer system 1700 of the electronic device shown in fig. 17 is only an example, and should not bring any limitation to the functions and the scope of the application of the embodiments.

As shown in fig. 17, a computer system 1700 includes a Central Processing Unit (CPU)1701 that can perform various appropriate actions and processes, such as executing the methods described in the above embodiments, according to a program stored in a Read-Only Memory (ROM) 1702 or a program loaded from a storage portion 1708 into a Random Access Memory (RAM) 1703. In the RAM 1703, various programs and data necessary for system operation are also stored. The CPU 1701, ROM 1702, and RAM 1703 are connected to each other through a bus 1704. An Input/Output (I/O) interface 1705 is also connected to the bus 1704.

The following components are connected to the I/O interface 1705: an input portion 1706 including a keyboard, a mouse, and the like; an output section 1707 including a Display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage portion 1708 including a hard disk and the like; and a communication section 1709 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 1709 performs communication processing via a network such as the internet. A driver 1710 is also connected to the I/O interface 1705 as necessary. A removable medium 1711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1710 as necessary, so that a computer program read out therefrom is mounted into the storage portion 1708 as necessary.

In particular, according to embodiments of the application, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising a computer program for performing the method illustrated by the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via the communication portion 1709, and/or installed from the removable media 1711. When the computer program is executed by a Central Processing Unit (CPU)1701, various functions defined in the system of the present application are executed.

It should be noted that the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. The computer readable storage medium may be, for example, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer-readable signal medium may comprise a propagated data signal with a computer-readable computer program embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. The computer program embodied on the computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. Each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.

Yet another aspect of the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a model training method as described above. The computer-readable storage medium may be included in the electronic device described in the above embodiment, or may exist alone without being assembled into the electronic device.

Another aspect of the application also provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the model training method provided in the above embodiments.

The above description is only a preferred exemplary embodiment of the present application, and is not intended to limit the embodiments of the present application, and those skilled in the art can easily make various changes and modifications according to the main concept and spirit of the present application, so that the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of model training, the method comprising:

training the initial teacher model to obtain a trained teacher model;

carrying out first knowledge distillation training on the initial assistant model according to the teacher model to obtain a trained assistant model; wherein the parameter quantity of the initial assistant model is smaller than the parameter quantity of the initial teacher model;

performing second knowledge distillation training on the initial student model according to the teacher model and the assistant model to obtain a trained student model; wherein the parameter quantities of the initial student model are smaller than the parameter quantities of the initial assistant model.

2. The method of claim 1, wherein the teacher model includes a plurality of cluster center vectors, different cluster center vectors corresponding to different classes of training data; prior to the second knowledge distillation training of the initial student models according to the teacher model and the assistant models, the method further comprises:

inputting training data into an uninitialized student model to obtain a feature extraction result output by the uninitialized student model;

obtaining an uninitialized student model prediction result according to the feature extraction result and the plurality of clustering center vectors;

and adjusting the model parameters of the uninitialized student model according to the difference between the prediction result of the uninitialized student model and the sample label of the training data to obtain an initial student model.

3. The method of claim 2, wherein obtaining uninitialized student model prediction results from the feature extraction results and the plurality of cluster center vectors comprises:

respectively calculating the similarity between the feature extraction result and each clustering center vector;

and determining the training data category to which the training data corresponding to the feature extraction result belongs according to the similarity, and taking the training data category as the prediction result of the uninitialized student model.

4. The method of claim 1, wherein the first knowledge distillation training of the initial assistant model based on the teacher model to obtain a trained assistant model comprises:

respectively inputting training data into the initial assistant model and the teacher model to obtain an assistant model prediction result output by the initial assistant model and a teacher model prediction result output by the teacher model;

calculating first knowledge distillation loss information according to the assistant model prediction result and the teacher model prediction result;

adjusting model parameters of the initial assistant model according to the first knowledge distillation loss information to obtain the assistant model.

5. The method of claim 4, wherein calculating first knowledge distillation loss information from the assistant model predictions and the teacher model predictions comprises:

calculating an error between the assistant model prediction result and the teacher model prediction result to obtain result loss information;

calculating the error between the prediction result of the assistant model and the sample label corresponding to the training data to obtain label loss information;

and calculating to obtain the first knowledge distillation loss information according to the result loss information and the label loss information.

6. The method of claim 1, wherein said second knowledge distillation training of initial student models from said teacher model and said assistant models to obtain trained student models comprises:

respectively inputting training data into the initial student model, the assistant model and the teacher model to obtain an initial student model prediction result output by the initial student model, an assistant model prediction result output by the assistant model and a teacher model prediction result output by the teacher model;

calculating second knowledge distillation loss information according to the initial student model prediction result, the assistant model prediction result and the teacher model prediction result;

and adjusting the model parameters of the initial student model according to the second knowledge distillation loss information to obtain the student model.

7. The method of claim 6, wherein said calculating second knowledge distillation loss information based on said initial student model prediction, said assistant model prediction, and said teacher model prediction comprises:

calculating an error between the initial student model prediction result and the assistant model prediction result to obtain assistant model loss information;

calculating an error between the initial student model prediction result and the teacher model prediction result to obtain teacher model loss information;

and obtaining the second knowledge distillation loss information according to the assistant model loss information and the teacher model loss information.

8. A face recognition method, comprising:

collecting a face image to be recognized;

inputting the face image to be recognized into a face recognition model; the face recognition model is obtained by training according to the model training method of any one of claims 1 to 7, and the face recognition model corresponds to a student model in the model training method;

and acquiring a face recognition result output by the face recognition model.

9. A model training apparatus, the apparatus comprising:

the teacher model training module is configured to train the initial teacher model to obtain a trained teacher model;

the assistant model training module is configured to perform first knowledge distillation training on the initial assistant model according to the teacher model to obtain a trained assistant model; wherein the parameter quantity of the initial assistant model is smaller than the parameter quantity of the initial teacher model;

the student model training module is configured to perform second knowledge distillation training on the initial student model according to the teacher model and the assistant model to obtain a trained student model; wherein the parameter quantities of the initial student model are smaller than the parameter quantities of the initial assistant model.

10. An apparatus for face recognition, the apparatus comprising:

the image acquisition module is configured to acquire a face image to be recognized;

the recognition module is configured to input the human face image to be recognized into a human face recognition model; the face recognition model is obtained by training according to the model training method of any one of claims 1 to 7, and the face recognition model corresponds to a student model in the model training method;

and the result acquisition module is configured to acquire a face recognition result output by the face recognition model.