CN111611852A - Method, device and equipment for training expression recognition model - Google Patents

Method, device and equipment for training expression recognition model Download PDF

Info

Publication number
CN111611852A
CN111611852A CN202010281184.9A CN202010281184A CN111611852A CN 111611852 A CN111611852 A CN 111611852A CN 202010281184 A CN202010281184 A CN 202010281184A CN 111611852 A CN111611852 A CN 111611852A
Authority
CN
China
Prior art keywords
images
image
expression
model
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010281184.9A
Other languages
Chinese (zh)
Inventor
罗达新
万单盼
刘毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202010281184.9A priority Critical patent/CN111611852A/en
Publication of CN111611852A publication Critical patent/CN111611852A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • G06V40/176Dynamic expression

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Multimedia (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Human Computer Interaction (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

A training method, a device and equipment for an expression recognition model relate to the technical field of artificial intelligence, and can improve the stability and accuracy of an expression recognition model recognition result, and the method comprises the following steps: obtaining N second images with facial expression micro-changes and the same expression as the original first images according to the M first images, respectively inputting the M first images and the N second images into a pre-training model to obtain first expression prediction results of the M first images and second expression prediction results of the N second images, respectively calculating the distance between the first expression prediction results of each image in the M first images and the second expression prediction results of the corresponding second image, and if the first expression prediction results of P images in the M first images and the second expression prediction results of the corresponding second images are not larger than a threshold value, training the pre-training model by using the first expression prediction results of the P images to obtain an expression recognition model.

Description

Method, device and equipment for training expression recognition model
Technical Field
The application relates to the field of artificial intelligence, in particular to a method, a device and equipment for training an expression recognition model.
Background
Facial expression recognition is a basic problem in computer vision, has a wide application prospect in the current market, and is concerned by a lot of attention. In application scenes related to human-computer interaction, such as nursing, online courses and the like, the facial expression recognition can be used for tracking the user state in real time, so that the timely feedback and effective evaluation of the service quality are completed, and a very important role is played.
However, deep learning based approaches tend to face challenges in robustness and stability. For example, a smile expression is still a smile expression after adding a slight change such as a slight drop of the mouth corner or a slight rise of the eye corner, i.e., the expressions before and after the slight change should be recognized as the same expression by the algorithm. However, the deep learning algorithm is often difficult to accomplish such a goal, and small changes similar to facial expressions have a great influence on the algorithm recognition result, so that the accuracy of the facial expression recognition algorithm is low.
Disclosure of Invention
The method, the device and the equipment for training the expression recognition model can improve the stability and accuracy of the recognition result of the expression recognition model.
In order to achieve the above object, the embodiments of the present application provide the following technical solutions:
in a first aspect, a method for training an expression recognition model is provided, including: generating N second images according to M first images containing human faces, wherein each image in the M first images corresponds to one or more of the N second images; the face of each second image in the N second images is changed compared with the face of the corresponding first image, and the face of each second image in the N second images and the face of the corresponding first image belong to the same expression; respectively inputting the M first images and the N second images into a pre-training model to obtain M first expression prediction results of the M first images and N second expression prediction results of the N second images; respectively calculating the distance between a first expression prediction result of each image in the M first images and a second expression prediction result of the corresponding second image; if the distance between the first expression prediction results of the P images in the M first images and the second expression prediction results of the corresponding second images is smaller than or equal to a threshold value, training a pre-training model by using the first expression prediction results of the P images in the M first images to obtain a target model; wherein M, N, P is a positive integer and P is less than or equal to M.
That is, a second image with a slightly changed face and an expression similar to that of the original first image is generated according to a training image including a human face, that is, the first image. Then, the first image and the corresponding second image are input into a pre-training model to obtain the prediction results of the first image and the corresponding second image. And then, according to the distance between the two prediction results, determining a first image with larger fluctuation of the prediction result, and determining the first image as unreliable. And the unreliable first image is used as the unreliable prediction result. The first image with a large fluctuation of the prediction result means that the prediction result of the first image has a large difference with the prediction result of the sample generated from the first image. On the contrary, the first image with little fluctuation of the prediction result is a reliable first image. The prediction result of the reliable first image is a reliable prediction result. And then, further training the pre-training model by using the reliable prediction result to obtain an expression recognition model.
Therefore, the expression recognition model obtained by the training method of the embodiment of the application does not show large fluctuation for the recognition result of the small change of the face. That is, two images in which only a slight change in the face occurs are recognized as the same type of expression. Namely, the stability of the expression recognition model is improved, and the accuracy of the expression recognition model recognition is improved.
In a possible implementation manner, the threshold is an average value or a median value of distances between the first expression prediction result of each of the M first images and the second expression prediction result of the corresponding second image.
In one possible implementation, the distance is a euclidean distance or a divergence.
In one possible implementation, generating N second images from M first images including faces includes: inputting M first images containing human faces into a generation model to obtain N second images; the generation model is an auto-encoder model with a ResBlock structure, and/or the generation model is an auto-encoder model with a decoder having a neural network layer number larger than that of an encoder.
It should be noted that if the generated model obtained by training the conventional self-encoder model is directly used, the image generated by the generated model is blurred. Therefore, in order to improve the capability of the generated model to reconstruct a new image according to the input image and improve the definition of reconstructing the new image, the network structure of the conventional self-encoder model used in the embodiments of the present application is optimized.
In one example, convolutional layers in a conventional self-encoder model (including an encoder module and a decoder module) can be replaced by a ResBlock structure, so that the complexity of the model is increased and the definition of images generated by the model is improved. In another example, one or more layers of the same convolutional layer may be added after the convolutional layer of each layer of the decoder module in the conventional self-encoder model. In yet another example, the convolutional layer in the conventional self-encoder model (including the encoder module and the decoder module) may be replaced with a ResBlock structure, and then one or more layers of the same ResBlock structure may be added after the ResBlock structure of each layer in the decoder module.
In a possible implementation, the method further includes: in the process of training the generative model, two-norm loss and perception loss are used for constraint.
For example, in the process of training the generative model, in order to make the image output by the generative model consistent with the size of the image sample of the original input, a two-norm loss may be firstly adopted for constraint so as to quickly ensure that the generative model input image and the output image are close to each other, but may cause the output image to be blurred. Therefore, after a better generated model is obtained through training, the perception loss can be adopted for constraint so as to improve the quality of an image generated by the generated model. Meanwhile, a discriminator is added for countermeasure training, and the reality of the image generated by the auxiliary generation model is improved.
In an example, when the perception loss is calculated, an expression recognition model may be trained in advance by using the prior art, and then the perception loss is calculated by using the pre-trained expression recognition model, so that an output image and an input image of the generation model are the same type of expression, and the loss is not affected by the expression-unrelated factors.
In a second aspect, a method for facial expression recognition is provided, including: receiving an input image of an expression to be recognized, performing expression recognition on the image of the expression to be recognized by using an expression recognition model, and outputting an expression category corresponding to the image of the expression to be recognized; the expression recognition model is obtained by iterative training according to a first image which is input in batches and contains a human face and a pre-training model; wherein each batch comprises M first images; in the training process of each batch, generating N second images according to M first images, wherein each image in the M first images corresponds to one or more of the N second images; the face of each second image in the N second images is changed compared with the face of the corresponding first image, and the face of each second image in the N second images and the face of the corresponding first image belong to the same expression; respectively inputting the M first images and the N second images into a pre-training model to obtain M first expression prediction results of the M first images and N second expression prediction results of the N second images; respectively calculating the distance between a first expression prediction result of each image in the M first images and a second expression prediction result of the corresponding second image; if the distance between the first expression prediction result of the P images in the M first images and the second expression prediction result of the corresponding second image is less than or equal to the threshold, the pre-training model is trained by using the first expression prediction results of the P images in the M first images, wherein M, N, P is a positive integer, and P is less than or equal to M.
In a possible implementation manner, the threshold is an average value or a median value of distances between the first expression prediction result of each of the M first images and the second expression prediction result of the corresponding second image.
In one possible implementation, the distance is a euclidean distance or a divergence.
In one possible implementation, generating N second images from M first images includes: inputting the M first images into a generation model to obtain N second images; the generation model is an auto-encoder model with a ResBlock structure, and/or the generation model is an auto-encoder model with a decoder having a neural network layer number larger than that of an encoder.
In a possible implementation, the method further includes: in the process of training the generative model, two-norm loss and perception loss are used for constraint.
The third aspect provides a training device for an expression recognition model, which comprises a generating unit, a calculating unit and a training unit; the generating unit is used for generating N second images according to M first images containing human faces, wherein each image in the M first images corresponds to one or more of the N second images; the face of each second image in the N second images is changed compared with the face of the corresponding first image, and the face of each second image in the N second images and the face of the corresponding first image belong to the same expression; the computing unit is used for respectively inputting the M first images and the N second images into a pre-training model to obtain M first expression prediction results of the M first images and N second expression prediction results of the N second images; the computing unit is further used for respectively computing the distance between the first expression prediction result of each image in the M first images and the second expression prediction result of the corresponding second image; the training unit is used for training a pre-training model by using the first expression prediction results of the P images in the M first images to obtain a target model if the distance between the first expression prediction results of the P images in the M first images and the second expression prediction results of the corresponding second images is smaller than or equal to a threshold value; wherein M, N, P is a positive integer and P is less than or equal to M.
In a possible implementation manner, the threshold is an average value or a median value of distances between the first expression prediction result of each of the M first images and the second expression prediction result of the corresponding second image.
In one possible implementation, the distance is a euclidean distance or a divergence.
In one possible implementation, generating N second images from M first images including faces includes: inputting M first images containing human faces into a generation model to obtain N second images; the generation model is an auto-encoder model with a ResBlock structure, and/or the generation model is an auto-encoder model with a decoder having a neural network layer number larger than that of an encoder.
In a possible implementation manner, the training unit is further configured to use the two-norm loss and the perceptual loss for constraint in the process of training the generative model.
The fourth aspect is an expression recognition device, which comprises a receiving unit, a calculating unit and an output unit, wherein the receiving unit is used for receiving an input image of an expression to be recognized; the computing unit is used for carrying out expression recognition on the image of the expression to be recognized received by the receiving unit by utilizing the expression recognition model; the output unit is used for outputting the expression category corresponding to the image of the expression to be recognized, which is recognized by the calculation unit; the expression recognition model is obtained by iterative training according to a first image which is input in batches and contains a human face and a pre-training model; wherein each batch comprises M first images; in the training process of each batch, generating N second images according to M first images, wherein each image in the M first images corresponds to one or more of the N second images; the face of each second image in the N second images is changed compared with the face of the corresponding first image, and the face of each second image in the N second images and the face of the corresponding first image belong to the same expression; respectively inputting the M first images and the N second images into a pre-training model to obtain M first expression prediction results of the M first images and N second expression prediction results of the N second images; respectively calculating the distance between a first expression prediction result of each image in the M first images and a second expression prediction result of the corresponding second image; if the distance between the first expression prediction result of the P images in the M first images and the second expression prediction result of the corresponding second image is less than or equal to the threshold, the pre-training model is trained by using the first expression prediction results of the P images in the M first images, wherein M, N, P is a positive integer, and P is less than or equal to M.
In a possible implementation manner, the threshold is an average value or a median value of distances between the first expression prediction result of each of the M first images and the second expression prediction result of the corresponding second image.
In one possible implementation, the distance is a euclidean distance or a divergence.
In one possible implementation, generating N second images from M first images includes: inputting the M first images into a generation model to obtain N second images; the generation model is an auto-encoder model with a ResBlock structure, and/or the generation model is an auto-encoder model with a decoder having a neural network layer number larger than that of an encoder.
In a possible implementation manner, the computing unit is further configured to perform constraint using the two-norm loss and the perceptual loss in the process of training the generative model.
A fifth aspect provides a computer readable storage medium comprising computer instructions which, when run on a mobile terminal, cause the mobile terminal to perform the method as described in the above aspect and any one of its possible implementations.
A sixth aspect provides a computer program product for causing a computer to perform the method as described in the above aspects and any one of the possible implementations when the computer program product runs on the computer.
A seventh aspect provides a server or a terminal, including: a processor and a memory coupled to the processor, the memory for storing computer program code, the computer program code comprising computer instructions that, when read by the processor from the memory, cause the server or terminal to perform the method as described in the above aspects and any possible implementation thereof.
In an eighth aspect, a chip system is provided, which includes a processor, and when the processor executes the instructions, the processor executes the method as described in the above aspects and any one of the possible implementations.
Drawings
Fig. 1 is a schematic structural diagram of a data processing system according to an embodiment of the present application;
FIG. 2A is a block diagram of another data processing system according to an embodiment of the present application;
fig. 2B is a schematic network structure diagram of an expression recognition model according to an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of a chip system according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of a training apparatus provided in an embodiment of the present application;
fig. 5 is a schematic flowchart of a training method for an expression recognition model according to an embodiment of the present disclosure;
FIG. 6A is a schematic diagram of a network structure of a prior art auto-transformer;
fig. 6B is a schematic diagram of a network structure of an optimized auto-transformer according to an embodiment of the present application;
fig. 7 is a schematic network structure diagram of an expression recognition model according to an embodiment of the present application;
FIG. 8 is a schematic structural diagram of a sample selection model/algorithm provided in an embodiment of the present application;
FIG. 9 is a schematic structural diagram of an exercise device according to an embodiment of the present disclosure;
fig. 10 is a schematic structural diagram of an identification device according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of a chip system according to an embodiment of the present disclosure.
Detailed Description
In the embodiments of the present application, words such as "exemplary" or "for example" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.
Herein, the term "and/or" is only one kind of association relationship describing an associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.
The terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the embodiments of the present application, "a plurality" means two or more unless otherwise specified.
For ease of understanding, the following description will first discuss relevant terms and concepts of a neural network to which embodiments of the present application may relate.
(1) Neural network
The neural network may be composed of neural units, the neural units may refer to operation units with xs and intercept 1 as inputs, and the output of the operation units may be:
Figure BDA0002446643810000051
where s is 1, 2, … … n, n is a natural number greater than 1, Ws is the weight of xs, and b is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit into an output signal. The output signal of the activation function may be used as an input for the next convolutional layer, and the activation function may be a sigmoid function. A neural network is a network formed by a plurality of the above-mentioned single neural units being joined together, i.e. the output of one neural unit may be the input of another neural unit. The input of each neural unit can be connected with the local receiving domain of the previous layer to extract the characteristics of the local receiving domain, and the local receiving domain can be a region composed of a plurality of neural units.
(2) Deep neural network
Deep Neural Networks (DNNs), also called multi-layer neural networks, can be understood as neural networks with multiple hidden layers. The DNNs are divided according to the positions of different layers, and the neural networks inside the DNNs can be divided into three categories: input layer, hidden layer, output layer. Generally, the first layer is an input layer, the last layer is an output layer, and the middle layers are hidden layers. The layers are all connected, that is, any neuron of the ith layer is necessarily connected with any neuron of the (i + 1) th layer.
Although DNN appears complex, it is not really complex in terms of the work of each layer, simply the following linear relational expression:
Figure BDA0002446643810000052
wherein the content of the first and second substances,
Figure BDA0002446643810000053
is the input vector of the input vector,
Figure BDA0002446643810000054
is the output vector of the output vector,
Figure BDA0002446643810000055
is an offset vector, W is a weight matrix (also called coefficient), α () is an activation function
Figure BDA0002446643810000056
Obtaining the output vector through such simple operation
Figure BDA0002446643810000057
Due to the large number of DNN layers, the coefficient W and the offset vector
Figure BDA0002446643810000058
The number of the same is also large. The definition of these parameters in DNN is as follows: taking coefficient W as an example: assume that in a three-layer DNN, the linear coefficients of the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined as
Figure BDA0002446643810000059
The superscript 3 represents the number of layers in which the coefficient W is located, while the subscripts correspond to the third layer index 2 of the output and the second layer index 4 of the input.
In summary, the coefficients from the kth neuron at layer L-1 to the jth neuron at layer L are defined as
Figure BDA00024466438100000510
Note that the input layer is without the W parameter. In deep neural networks, more hidden layers make the network more able to depict complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the larger the "capacity", which means that it can accomplish more complex learning tasks. The final goal of the process of training the deep neural network, i.e., learning the weight matrix, is to obtain the weight matrix (the weight matrix formed by the vectors W of many layers) of all the layers of the deep neural network that is trained.
(3) Convolutional neural network
Convolutional Neural Networks (CNN) are a type of deep neural Network with convolutional structures. The convolutional neural network includes a feature extractor consisting of convolutional layers and sub-sampling layers. The feature extractor may be viewed as a filter and the convolution process may be viewed as convolving an input image or convolved feature plane (feature map) with a trainable filter. The convolutional layer is a neuron layer for performing convolutional processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights may be understood as the way image information is extracted is location independent. The underlying principle is: the statistics of a certain part of the image are the same as the other parts. Meaning that image information learned in one part can also be used in another part. We can use the same learned image information for all locations on the image. In the same convolution layer, a plurality of convolution kernels can be used to extract different image information, and generally, the greater the number of convolution kernels, the more abundant the image information reflected by the convolution operation.
The convolution kernel can be initialized in the form of a matrix of random size, and can be learned to obtain reasonable weights in the training process of the convolutional neural network. In addition, sharing weights brings the direct benefit of reducing connections between layers of the convolutional neural network, while reducing the risk of overfitting.
(4) Loss function
In the process of training the deep neural network, because the output of the deep neural network is expected to be as close to the value really expected to be predicted as possible, the weight vector of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the really expected target value (of course, an initialization process is usually carried out before the first updating, namely parameters are preset for each layer in the deep neural network), for example, if the predicted value of the network is high, the weight vector is adjusted to be lower, and the adjustment is continuously carried out until the deep neural network can predict the really expected target value or the value which is very close to the really expected target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the deep neural network becomes the process of reducing the loss as much as possible.
(5) Back propagation algorithm
The neural network can adopt a Back Propagation (BP) algorithm to correct the size of parameters in the initial neural network model in the training process, so that the reconstruction error loss of the neural network model is smaller and smaller. Specifically, the error loss is generated by transmitting the input signal in the forward direction until the output, and the parameters in the initial neural network model are updated by reversely propagating the error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion with error loss as a dominant factor, aiming to obtain the optimal parameters of the neural network model, such as a weight matrix.
(6) Facial expression recognition technology
The facial expression recognition technology is a branch of the facial recognition technology, and with the development of artificial intelligence technology and computer technology, the expression recognition technology can determine the psychological mood of a recognized object by separating a specific expression state from a static image or a dynamic video sequence. Therefore, the understanding and the recognition of the computer to the facial expression are realized, the relationship between the human and the computer can be fundamentally changed, the computer can better serve the human, and the better human-computer interaction is achieved.
Fig. 1 is a schematic structural diagram of a data processing system 100 according to an embodiment of the present disclosure. The data processing system 100 comprises at least one terminal 11 and at least one server 12. The terminal 11 and the server 12 establish a communication connection through one or more networks. The network may be a Local Area Network (LAN) or a Wide Area Network (WAN), such as the internet. The network may be implemented using any known network communication protocol, which may be any of a variety of wired or wireless communication protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE (FIREWIRE), any cellular communication protocol (e.g., 3G/4G/5G), Bluetooth, Wireless Fidelity (Wi-Fi), NFC, or any other suitable communication protocol.
The server 12 may be a device or a server having a data processing function, such as a cloud server, a web server, an application server, and a management server. The terminal 11 may be, for example, a mobile phone, a tablet computer, a Personal Computer (PC), a Personal Digital Assistant (PDA), a smart watch, a netbook, a wearable electronic device, an Augmented Reality (AR) device, a Virtual Reality (VR) device, an in-vehicle device, a smart car, a smart audio, a robot, or the like.
In some embodiments of the present application, the server 12 obtains a large amount of training data including facial expressions in advance, and obtains an expression recognition model by training in a machine learning/deep learning manner based on the large amount of training data. The expression recognition model can be used for predicting the face image to obtain an expression classification result corresponding to the face image, namely, the expression recognition of the face image is realized. The expression classification result is, for example, "anger (anger)", "happy (happy)", "sad (sadness)", "surprised (surprie)", "disgust (disgust)", and "fear (fear)", etc. Then, the server 12 performs corresponding operations according to the recognized expression classification result. For example, the server 12 may recommend different content to the user according to the recognized expression classification result. For another example, the server 12 may send the recognized expression classification result to the terminal 11, and the terminal 11 presents the result to the user, or the terminal 11 recommends different content to the user according to the recognized expression classification result.
The expression recognition model can be applied to the fields of psychology, intelligent robots, intelligent monitoring, virtual reality, composite animation and the like. For example, in the retail industry, the expression of a customer can be identified through an expression identification model, the preference of the customer on goods is acquired, or suitable advertisements are intelligently recommended through expression identification, so that accurate marketing is realized. For another example, in the aspect of games, the expression recognition model can recognize the expression of the user to play games, so as to perform human-computer interaction experience. For another example, the method can also be applied to teaching, and online education monitors the class condition of students by recognizing student tables through the expression recognition model. For example, the state of the driver can be recognized through the expression recognition model in the aspect of traffic safety, and the occurrence of fatigue driving and the like can be effectively reduced.
The training method provided by the embodiment of the application is just applied to the training process of the expression recognition model. That is, in this embodiment, the server 12 may perform the training method of the embodiment of the present application.
In other embodiments of the present application, after the server 12 obtains the expression recognition model through training, the server may also send the expression recognition model to the terminal 11. In this way, after receiving the image of the expression to be recognized input by the user or captured by the terminal 11, the terminal may directly input the image of the expression to be recognized into the expression recognition model for processing, so as to obtain the expression classification corresponding to the image.
In an embodiment, the server 12 may perform the training method of the embodiment of the present application.
In still other embodiments of the present application, the entire data processing system 100 may not include server 12. That is, the terminal 11 may acquire a large amount of training data including facial expressions, and train out an expression recognition model based on the training data. Then, after receiving the image of the expression to be recognized input by the user or shot by the terminal 11, the terminal may directly input the image of the expression to be recognized into the expression recognition model for processing, so as to obtain the expression classification corresponding to the image.
In this embodiment, the terminal 11 may execute the training method of the embodiment of the present application.
Referring to FIG. 2A, another data processing system 200 according to an embodiment of the present application is shown. The data processing system 200 includes an execution device 210, a training device 220, a database 230, a client device 240, a data storage system 250, a data collection device 260, and the like. It is to be understood that the illustrated architecture of the embodiments of the present application does not constitute a specific limitation on the data processing system 200. In other embodiments of the present application, data processing system 200 may include more or fewer devices than shown, or some devices may be combined, some devices may be split, etc.
Wherein the data acquisition device 260 is used to acquire training data. For example, in the embodiment of the present application, the training data may be a training image for training an expression recognition model, and may be, for example, a photo or a video including a human face.
After the training data is collected, the data collection device 260 stores the training data in the database 230, and the training device 220 trains the predictive model/rule 201 based on the training data maintained in the database 230. Specifically, the training device 220 records a first image according to an acquired training image including a human face, and generates a second image in which the face changes slightly. Then, the first image and the corresponding second image are input into a pre-training model to obtain the prediction results of the first image and the corresponding second image. And then, according to the distance between the two prediction results, determining a first image with larger fluctuation of the prediction result, and determining the first image as unreliable. And the unreliable first image is used as the unreliable prediction result. The first image with a large fluctuation of the prediction result means that the prediction result of the first image has a large difference with the prediction result of the sample generated from the first image. On the contrary, the first image with little fluctuation of the prediction result is a reliable first image. The prediction result of the reliable first image is a reliable prediction result. The pre-trained model is then further trained using reliable prediction results to arrive at a prediction model/rule 201. In the embodiment of the present application, the prediction model/rule 201 may be used to identify an expression of a face in an image, and is also referred to as an expression recognition model. How the training device 220 derives the predictive models/rules 201 based on the training data will be described in more detail later in connection with fig. 5.
Therefore, the prediction model/rule 201 obtained by the training method provided by the embodiment of the present application does not show large fluctuation in the recognition result of the small change of the face. That is, two images in which only a slight change in the face occurs are recognized as the same type of expression. That is, the stability of the prediction model/rule 201 is improved, and the accuracy of the recognition of the prediction model/rule 201 is improved.
As shown in fig. 2B, the predictive model/rule 201 may be a CNN model. The CNN model includes an input layer, an intermediate layer, and an output layer. Wherein the input layer can process multidimensional data. Taking image processing as an example, the input layer may receive pixel values (three-dimensional arrays) of an image, that is, values of two-dimensional pixels and RGB channels on a plane. The middle layer includes one or more convolutional layers (convolutional layers) and one or more pooling layers (pooling layers). One or more convolutional layers are usually followed by a pooling layer. Optionally, the middle layer may include one or more fully-connected layers (full-connected layers). And the structure and the working principle of the output layer are the same as the output of the traditional feedforward neural network. For example, for a convolutional neural network for facial expression classification, the output layer outputs a classification label using a logistic function or normalized exponential function (softmax function), such as: "anger (anger)", "happy (happy)", "sadness)", "surprised (surrise)", "disgust (disgust)", and "fear (fear)", and the like.
Then, with reference to fig. 2A, after the prediction model/rule 201 is obtained through training, after the execution device 210 receives the image of the expression to be recognized, which is input by the user device 240, the image may be input into the prediction model/rule 201 through relevant preprocessing (which may be processed by the preprocessing module 213 and/or the preprocessing module 214) and then processed, so as to obtain the expression classification corresponding to the image.
It should be noted that, in practical applications, the training data maintained in the database 230 may not necessarily all come from the collection of the data collection device 260, and may also be received from other devices.
It should be noted that, the training device 220 does not necessarily perform the training of the prediction model/rule 201 based on the training data maintained by the database 230, and may also obtain the training data from the cloud or other places for performing the model training, and the above description should not be taken as a limitation to the embodiments of the present application. It should also be noted that at least a portion of the training data maintained in the database 230 may also be used to perform the process of the device 210 processing the image of the expression to be recognized.
In fig. 2A, the execution device 210 configures an input/output (I/O) interface 212 for data interaction with an external device, and a user may input data to the I/O interface 212 through the client device 240, where the input data may include: and (5) an image of the expression to be recognized.
The preprocessing module 213 and/or the preprocessing module 214 are configured to perform preprocessing according to the input data received by the I/O interface 212, and in this embodiment of the application, the input data may be processed directly by the computing module 211 without the preprocessing module 213 and the preprocessing module 214 (or without one of them). The preprocessing module 213 or the preprocessing module 214 may preprocess all input data, or may preprocess a part of input data.
It should be noted that the pre-processing module 113 and/or the pre-processing module 214 may also be trained in the training device 220. The calculation module 211 may be configured to perform relevant processing such as calculations on input data from the preprocessing module 213 or the I/O interface 212 according to the prediction model/rule 201 described above.
In the process that the execution device 210 preprocesses the input data or in the process that the calculation module 211 of the execution device 210 executes the calculation or other related processes, the execution device 210 may call the data, the code, and the like in the data storage system 250 for corresponding processes, and may store the data, the instruction, and the like obtained by corresponding processes in the data storage system 250.
Finally, the I/O interface 212 feeds back the processing result (i.e., the classification result of the emotions) to the client device 240. Or, the I/O interface 212 feeds back the processing result (i.e., the classification result of the expression) to other devices for corresponding processing. The embodiment of the present application does not limit this.
In the case shown in fig. 2A, the user may manually give input data (e.g., an image of an expression to be recognized), which may be operated through an interface provided by the I/O interface 212. Alternatively, the client device 240 may automatically send input data (e.g., an image of the expression to be recognized) to the I/O interface 212, and if the client device 240 is required to automatically send the input data to obtain authorization from the user, the user may set corresponding rights in the client device 240. The user can view the result output by the execution device 210 at the client device 240, and the specific presentation form can be display, sound, action, and the like. The client device 240 may also serve as a data collection terminal for collecting input data to the I/O interface 212 and output results from the I/O interface 212 as new training data and storing the new training data in the database 230. Of course, the input data inputted into the I/O interface 212 and the output result outputted from the I/O interface 212 as shown in the figure may be directly stored into the database 230 as new training data by the I/O interface 212 without being collected by the client device 240.
It should be noted that fig. 2A is only a schematic diagram of a system architecture provided in the embodiment of the present application, and the position relationship between the devices, modules, and the like shown in the diagram does not constitute any limitation. For example, in FIG. 2A, the data storage system 250 is an external memory with respect to the execution device 210, in other cases, the data storage system 250 may be disposed in the execution device 210. For another example, the training device 220 and the performing device 210 may be different devices or the same device.
In some examples, client device 240 may be terminal 11 shown in fig. 1; the performing device 210 and the training device 220 may be the server 12 shown in fig. 1. In other examples, the client device 240 and the execution device 210 may be the terminal 11 shown in fig. 1, and the training device 220 may be the server 12 shown in fig. 1. In still other examples, the client device 240, the performance device 210, and the training device 220 may be the terminal 11 shown in FIG. 1.
Fig. 3 is a hardware structure of a chip system provided in an embodiment of the present application, where the chip system includes a neural network processor 300. The chip may be disposed in the execution device 210 shown in fig. 2A to complete the calculation work of the calculation module 211. The chip may also be disposed in the training device 220 as shown in fig. 2A to complete the training work of the training device 220 and output the predictive model/rule 201.
The neural network processor NPU300 is mounted as a coprocessor on a main Central Processing Unit (CPU) (host CPU), and tasks are distributed by the main CPU. The core portion of the NPU300 is an arithmetic circuit 303, and the controller 304 controls the arithmetic circuit 303 to extract data in a memory (weight memory or input memory) and perform an operation.
In some implementations, the arithmetic circuitry 303 includes a plurality of processing units (PEs) internally. In some implementations, the operational circuitry 303 is a two-dimensional systolic array. The arithmetic circuit 303 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 303 is a general-purpose matrix processor.
For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit 303 fetches the data corresponding to the matrix B from the weight memory 302 and buffers the data on each PE in the arithmetic circuit 303. The arithmetic circuit 303 takes the matrix a data from the input memory 301 and performs matrix operation with the matrix B, and a partial result or a final result of the obtained matrix is stored in an accumulator 308 (accumulator).
The vector calculation unit 307 may further process the output of the operation circuit 303, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like. For example, the vector calculation unit 307 may be used for network calculation of a non-convolution/non-FC layer in a neural network, such as pooling (Pooling), batch normalization (batch normalization), local response normalization (local response normalization), and the like.
In some implementations, the vector calculation unit 307 can store the processed output vector to the unified memory 306. For example, the vector calculation unit 307 may apply a non-linear function to the output of the arithmetic circuit 303, such as a vector of accumulated values, to generate the activation value.
In some implementations, the vector calculation unit 307 generates normalized values, combined values, or both.
In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuitry 303, e.g., for use in subsequent layers in a neural network.
The unified memory 306 is used to store input data as well as output data. The weight data directly passes through a memory unit access controller 305 (DMAC) to store the input data in the external memory into the input memory 301 and/or the unified memory 306, store the weight data in the external memory into the weight memory 302, and store the data in the unified memory 306 into the external memory.
A bus interface unit 310 (BIU) for implementing interaction between the main CPU, the DMAC, and the instruction fetch memory 309 through a bus.
An instruction fetch buffer 309(instruction fetch buffer) connected to the controller 304 is used to store instructions used by the controller 304. The controller 304 is configured to call an instruction cached in the instruction fetch memory 309, so as to control the operation process of the operation accelerator.
Generally, the unified memory 306, the input memory 301, the weight memory 302, and the instruction fetch memory 309 are On-Chip (On-Chip) memories, the external memory is a memory outside the NPU, and the external memory may be a double data rate synchronous dynamic random access memory (DDR SDRAM), a High Bandwidth Memory (HBM), or other readable and writable memories.
Among them, the operations of the layers in the prediction model/rule 201 shown in fig. 2B may be performed by the operation circuit 303 or the vector calculation unit 307.
Optionally, the method provided in each embodiment of the present application may be processed by a CPU, or may be processed by the CPU and a GPU together, or may use other processors suitable for neural network computing instead of the GPU, which is not limited in the present application.
At present, in the task of facial expression recognition, the acquisition and labeling of training data have subjectivity, the training data also have the problems of unbalanced data, small data size and the like, and the problems cause the recognition result of the facial expression recognition model obtained by training to show larger fluctuation. That is, two images with only a slight change in the face are recognized as different expressions by the expression recognition model trained by the existing method. However, in reality, the two images should be recognized as the same kind of expression.
Therefore, the embodiment of the application provides a training method of an expression recognition model. As shown in fig. 4, in the embodiment of the present application, a training image including a human face is first written as a first image, and the first image is input to an assist generation model to generate a second image in which the face is slightly changed. Then, the first image and the corresponding second image are input into a pre-training model (namely, a prediction model) to obtain the prediction results of the first image and the second image. The prediction results of the first image and the second image are then input into a sample selection model/algorithm. And the sample selection model/algorithm is used for calculating the distance between the two predicted results. And then, according to the distance between the two prediction results, determining a first image with larger fluctuation of the prediction result, and determining the first image as unreliable. And the unreliable first image is used as the unreliable prediction result. The first image with a large fluctuation of the prediction result means that the prediction result of the first image has a large difference with the prediction result of the sample generated from the first image. On the contrary, the first image with little fluctuation of the prediction result is a reliable first image. The prediction result of the reliable first image is a reliable prediction result. And then, further training the pre-training model by using the reliable prediction result to obtain an optimized prediction model. The optimized prediction model does not show large fluctuation for the recognition result of the small change of the face. That is, two images in which only a slight change in the face occurs are recognized as the same type of expression. Namely, the training method provided by the embodiment of the application can improve the stability of the expression recognition model and improve the accuracy of the expression recognition model recognition.
The following describes in detail a training method of an expression recognition model provided in an embodiment of the present application with reference to the accompanying drawings.
As shown in fig. 5, a schematic flow chart of a training method for an expression recognition model provided in an embodiment of the present application specifically includes:
s501, generating N second images according to the M first images containing the human faces.
Each of the M first images corresponds to one or more of the N second images, the face of each of the N second images changes compared with the face of the corresponding first image, and the face of each of the N second images and the face of the corresponding first image belong to the same type of expression.
In some embodiments, the training device 220 acquires a large number of first images containing faces as training data for the expression recognition model. Inputting the acquired first images into the auxiliary generation model in batches, and generating one or more second images corresponding to each first image. The auxiliary generation model is used for fine-tuning the facial expression in the first image to obtain one or more second images corresponding to the first image. For example: the expression of the face included in a certain first image is smile, and the facial expression of the face can be finely adjusted, for example, the mouth corner of the face in the first image is slightly dropped, or the eye corner is slightly raised, etc., and the first image is generated to correspond to a second image, so that the face in the second image is still smile. That is, the second image generated from the first image is of the same type of expression. Here, the example of inputting M first images per batch will be described. It should be noted that the number M of the first images input per batch may be a constant value or a variable value. That is, the number of the first images input in each batch may be the same or different, and this is not limited in this embodiment of the application.
In specific implementation, the auxiliary generation model may be obtained by training a generation countermeasure network (GAV) model in advance. The generation countermeasure network includes an Auto-encoder (AE), a Variational Auto-encoder (VAE), a dimensionality reduction Auto-encoder (DAE), a contraction Auto-encoder (CAE), a Sparse Auto-encoder (SAE), and the like.
For example, the method of obtaining the auxiliary generative model by training the self-encoder is described here as an example. In training the aided Generator model, the self-encoder model used includes a Generator (Generator) and a Discriminator (Discriminator). Wherein the generator comprises an encoder module, a noise module and a decoder module. The encoder module is used for carrying out dimensionality reduction processing on the input training image to obtain a low-dimensional vector. The noise module then applies noise to the low-dimensional vector, acting as a small perturbation to alter the low-dimensional vector. The decoder module is then used to generate the noisy low-dimensional vector into an image that is dimensionally consistent with the input training image. The decoder inputs the generated image to a discriminator for judging whether the generated image is a real image. For example, if the input training image is an image including a human face, the generated image should also be a human face image. And if the generated image is not a normal human face image, the discriminator judges that the generated image is an unreal image.
It should be noted that if the auxiliary generation model obtained by the traditional self-encoder model training is directly adopted, the image generated by the auxiliary generation model is blurred. Therefore, in order to improve the capability of the auxiliary generation model to reconstruct a new image from an input image and improve the definition of reconstructing the new image, the network structure of the conventional self-encoder model used in the embodiment of the present application is optimized.
In one example, convolutional layers in a conventional self-encoder model (including an encoder module and a decoder module) can be replaced by a ResBlock structure, so that the complexity of the model is increased and the definition of images generated by the model is improved. In another example, one or more layers of the same convolutional layer may be added after the convolutional layer of each layer of the decoder module in the conventional self-encoder model. In yet another example, the convolutional layer in the conventional self-encoder model (including the encoder module and the decoder module) may be replaced with a ResBlock structure, and then one or more layers of the same ResBlock structure may be added after the ResBlock structure of each layer in the decoder module.
For example, as shown in fig. 6A, a conventional self-coder model is shown. It can be seen that the encoder module is symmetric to the decoder module in this model. That is, the encoder module includes the same number of layers of neural networks as the number of layers of neural networks included in the decoder module. Fig. 6B shows an optimized self-encoder model provided in the embodiment of the present application. It can be seen that in the decoder module in the conventional self-encoder model, one or more layers of the same ResBlock structure are added after each layer of the ResBlock structure. Therefore, the encoder module and the decoder module in the optimized self-encoder model are not symmetrical any more. That is, the number of layers of the neural network included in the decoder module is greater than the number of layers of the neural network included in the encoder module.
In addition, in the process of training the auxiliary generative model, in order to make the size of the image output by the auxiliary generative model consistent with the size of the image sample of the original input, a two-norm loss may be adopted for constraint to quickly ensure that the input image and the output image of the auxiliary generative model are close to each other, but the output image may be blurred. Therefore, after a better auxiliary generation model is obtained through training, the perception loss can be adopted for constraint so as to improve the quality of an image generated by the auxiliary generation model. Meanwhile, a discriminator is added for countermeasure training, and the reality of the image generated by the auxiliary generation model is improved.
Wherein, the two-norm loss can be calculated using equation (1) as follows:
Figure BDA0002446643810000121
in the formula (1), zlThe characteristic value of the output of the l layer neural network in the encoder module is obtained;
Figure BDA0002446643810000122
α L is a weighted value corresponding to the L-th layer neural network, can be set by a user or a default setting or an empirical value, and L is the number of layers of the neural network layers contained in the encoder module.
When the perception loss is calculated, an expression recognition model can be trained in advance by adopting the prior art, and then the perception loss is calculated by using the pre-trained expression recognition model, so that an output image and an input image of the auxiliary generation model are of the same expression type, and the loss cannot be influenced by factors irrelevant to the expression. Wherein, the perceptual loss can be calculated using equation (2), as follows:
Figure BDA0002446643810000123
in the formula (2), IuFor the pixel values of the u-th input image,
Figure BDA0002446643810000124
to assist in generating the corresponding function of the encoder module in the model,
Figure BDA0002446643810000125
and generating a corresponding function of the decoder module in the model for assistance. Then it is determined that,
Figure BDA0002446643810000126
representing the two-norm of the model input image and output image pixel values with assistance to the generation.
Figure BDA0002446643810000127
And obtaining a function corresponding to the expression recognition model for pre-training.
Figure BDA0002446643810000128
And the two norms represent the prediction results after the input image and the output image are respectively input into the trained expression recognition model. Lambda [ alpha ]pixelAnd λpercThe weight value corresponding to the corresponding two norms can be set by a user or a default setting or an empirical value.
When the arbiter is added for the confrontation training, the arbiter loss can be calculated by adopting the formula (3), as follows:
Figure BDA0002446643810000129
in the formula (3), the first and second groups,
Figure BDA00024466438100001210
is the corresponding loss function of the discriminator. The larger the value of the discriminator loss, the larger the tableThe more realistic the image output by the aided generative model is. The smaller the value of the discriminator loss, the less realistic the image output by the auxiliary generation model.
Optionally, after a better auxiliary generation model is obtained through training, the auxiliary generation model may be constrained by using the two-norm loss, the perceptual loss, and the discriminator loss. Then, the total loss can be calculated using equation (4), as follows:
Figure BDA00024466438100001211
in the formula (4), λAE、λAdvAnd λRecThe corresponding weight value for the corresponding loss can be set by the user or set by default or an empirical value.
The auxiliary generation model obtained by training by the method can obtain N second images with slightly changed expressions according to the existing M first images. Wherein M and N are positive integers.
Optionally, in other embodiments, before generating the N second images according to the M first images, the training device 200 may further perform preprocessing on the acquired first images to reduce the influence of expression-independent changing features in the first images, such as different backgrounds, lights, head gestures, and the like, on the trained expression recognition model. For example, face detection may be performed on a first image containing a facial expression, and then extraneous regions such as a background may be removed. For another example, the illumination normalization process may be performed on the first image containing the facial expression. The illumination normalization processing method includes, but is not limited to, one or more of isotropic diffusion-based normalization (iso-diffusion-based normalization), discrete cosine transform-based normalization (DCT-based normalization), and difference of gaussian (DoG). For another example, the head pose normalization process may also be performed on the first image containing the facial expression. The head normalization processing may generate a reference model of a 3D texture according to key points of a face in the first image, and obtain a face image in a certain direction corresponding to the first image, for example, a front face image, by using the reference model of the 3D texture.
S502, inputting the M first images and the N second images into a pre-training model respectively to obtain M first prediction results of the M first images and N second prediction results of the N second images.
In some embodiments, the pre-trained model is, for example, an expression recognition model. An expression recognition model may be trained in advance using the prior art, or an expression recognition model trained in advance when training the auxiliary generation model may be directly used. The expression recognition model may have a convolutional neural network structure as shown in fig. 2B. Further, as shown in fig. 7, the expression recognition model includes a feature extractor and a classifier. The feature extractor is used for extracting feature values of the input image, and the feature values are high-dimensional tensors. The classifier is used for calculating probability vectors with the same dimensionality as the number of the categories to be distinguished according to the characteristic values of the input images, and the probability vectors are prediction results of the expression recognition models.
Therefore, M first prediction results of the input M first images and N second prediction results of the input N second images are obtained.
And S503, respectively calculating the distance between the first prediction result of each image in the M first images and the second prediction result of the corresponding second image.
In some embodiments, after obtaining the M first prediction results of the M first images and the N second prediction results of the N second images, the M first prediction results and the N second prediction results are input into a sample selection model/algorithm.
As shown in fig. 8, the sample selection model/algorithm may calculate the distance between the first prediction result of each image in the M first images of the batch and the second prediction result of the corresponding second image. The distance may be, for example, a euclidean distance or a divergence.
For example, taking the distance as divergence as an example, the distance between the first image and the corresponding second image can be calculated by using formula (5), as follows:
Figure BDA0002446643810000131
in the formula (5), IoIs the pixel value of the first image, ItrIs the pixel value of the second image corresponding to the first image,
Figure BDA0002446643810000132
is a function corresponding to the pre-training model.
Wherein the content of the first and second substances,
Figure BDA0002446643810000133
where N is noise.
If the KL divergence is calculated, then
Figure BDA0002446643810000134
S504, if the distance between the first prediction results of the P images in the M first images and the second prediction results of the corresponding second images is smaller than or equal to a threshold value, the pre-training model is trained by using the first prediction results of the P images in the M first images to obtain a target model.
First, it should be noted that the first image and the corresponding second image are two images with slightly changed facial expressions, and the first image and the second image should belong to the same expression category, that is, the first image and the second image corresponding to the first image are expected to be identified as the same expression category. In the foregoing, when the process of training the auxiliary generation model is introduced, how to train the auxiliary generation model using the perceptual loss has been described, so that the first image input by the auxiliary generation model and the second image output by the auxiliary generation model are of the same expression class, and details are not described here.
If the distance between the first prediction result of the P images in the M first images and the second prediction result of the corresponding second image is smaller than or equal to the threshold value, the expression category which indicates that the P images and the second images corresponding to the P images can be identified as the same expression category with the high probability is the same as the expected prediction result. That is, the prediction results of the expression recognition model for the P images are accurate and reliable, and then the loss may be calculated using the prediction results of the P images, or the loss of the prediction results of the P images may be used to train the expression recognition model in a backward propagation manner.
Accordingly, the distance between the first prediction result of the other M-P images in the M first images and the second prediction result of the corresponding second image is larger than the threshold value, which indicates that the M-P images and the corresponding second images are recognized as different expression classes with a high probability, however, the opposite of the expected prediction result. That is, the prediction results of the expression recognition model for the M-P images may be incorrect and unreliable, so that the loss cannot be calculated using the prediction results of the M-P images, or the loss of the prediction results of the M-P images cannot be propagated backwards to train the expression recognition model.
The threshold may be set as an empirical value, or may be calculated by using a corresponding formula. For example, the threshold may be set as an average or median of the distances between all the first images of the batch and their corresponding second images.
For example, equation (6) may be used to calculate an average value of distances between all the first images and their corresponding second images of the batch, as follows:
Figure BDA0002446643810000141
wherein N isbatchThe logarithm of each image in the M first images of the batch and the corresponding second image is formed. For example, if each of the M first images corresponds to N second images, then N isbatchEqual to n x M. If n is 1, it means that each of the M first images corresponds to only one second image.
For example, the image a is input to the auxiliary generation model, and the image B with the facial micro-changes is obtained. Image a and image B are expected to be recognized as the same expression category. If the prediction result of the image A is different from that of the image B after the image A and the image B are input into the pre-trained expression recognition model, the prediction result of the image A is considered to be wrong, and the prediction result is unreliable. Then, the prediction result of the image a will not participate in the training process of the expression recognition model. For another example, the image C is input to the auxiliary generation model, and an image D of the face micro-changes is obtained. Image C and image D are expected to be recognized as the same expression category. And if the prediction result of the image C is the same as that of the image D after the image C and the image D are input into the pre-trained expression recognition model, the prediction result of the image C is considered to be correct and is a reliable prediction result. Then, the prediction result of the image C will participate in the training process of the expression recognition model.
In a specific implementation manner, screening out reliable prediction results from the M first prediction results can be implemented in a masking manner. With continued reference to fig. 8, a mask may be constructed based on the distance between the first prediction of the first image and its corresponding second prediction of the second image, and the threshold. In the mask, a position where the distance between the first prediction result of the first image and the second prediction result of the second image corresponding thereto is greater than the threshold value is set to "0", and a position where the distance between the first prediction result of the first image and the second prediction result of the second image corresponding thereto is less than or equal to the threshold value is set to "1". In one example, the total loss of the predicted result of the first image of the batch may be calculated first, and then multiplied by the obtained mask to obtain the mask loss, i.e. the loss of the first predicted result after screening. And then, updating the weight of the pre-training model by adopting a back propagation algorithm to finish the training iteration of the first image of the batch. In another example, the predicted result of the first image of the batch may be multiplied by the mask to obtain the predicted result of the screened first image, and then the predicted result of the first image may be used to calculate the loss. And then, updating the weight of the pre-training model by adopting a back propagation algorithm to finish the training iteration of the first image of the batch.
By analogy, the same method is adopted to complete the processing of the first images of other batches, and multiple training iterations are carried out to finally obtain a target model, namely the optimized expression recognition model.
Therefore, the training method provided by the embodiment of the application can automatically generate the image with the slightly changed expression based on the training image, and then automatically eliminate the unreliable prediction result according to the distance between the prediction result of the training image and the prediction result of the generated image. Because the loss is calculated by adopting the reliable prediction result and the training iteration is carried out on the prediction model, the stability and the accuracy of the prediction result of the prediction model are favorably improved.
An embodiment of the present application further provides a training apparatus, as shown in fig. 9, the training apparatus includes a generating unit 901, a calculating unit 902, and a training unit 903. The generating unit 901 is configured to generate N second images according to M first images including faces, where each of the M first images corresponds to one or more of the N second images; the face of each second image in the N second images is changed compared with the face of the corresponding first image, and the face of each second image in the N second images and the face of the corresponding first image belong to the same expression; a calculating unit 902, configured to input the M first images and the N second images into a pre-training model respectively, so as to obtain M first expression prediction results of the M first images and N second expression prediction results of the N second images; the computing unit is further used for respectively computing the distance between the first expression prediction result of each image in the M first images and the second expression prediction result of the corresponding second image; a training unit 903, configured to train a pre-training model using the first expression prediction results of the P images in the M first images to obtain a target model if a distance between the first expression prediction result of the P images in the M first images and the second expression prediction result of the corresponding second image is less than or equal to a threshold; wherein M, N, P is a positive integer and P is less than or equal to M.
All relevant contents of each step related to the above method embodiment may be referred to the functional description of the corresponding functional module, and are not described herein again.
The application embodiment further provides an identification device, as shown in fig. 10, the identification device includes a receiving unit 1001, a calculating unit 1002, and an outputting unit 1003. The receiving unit 1001 is configured to receive an input image of an expression to be recognized; a calculating unit 1002, configured to perform expression recognition on the image of the expression to be recognized, received by the receiving unit 1001, by using an expression recognition model; an output unit 1003, configured to output an expression category corresponding to the image of the expression to be recognized, which is recognized by the calculation unit 1002; the expression recognition model is obtained by iterative training according to a first image which is input in batches and contains a human face and a pre-training model; wherein each batch comprises M first images; in the training process of each batch, generating N second images according to M first images, wherein each image in the M first images corresponds to one or more of the N second images; the face of each second image in the N second images is changed compared with the face of the corresponding first image, and the face of each second image in the N second images and the face of the corresponding first image belong to the same expression; respectively inputting the M first images and the N second images into a pre-training model to obtain M first expression prediction results of the M first images and N second expression prediction results of the N second images; respectively calculating the distance between a first expression prediction result of each image in the M first images and a second expression prediction result of the corresponding second image; if the distance between the first expression prediction result of the P images in the M first images and the second expression prediction result of the corresponding second image is less than or equal to the threshold, the pre-training model is trained by using the first expression prediction results of the P images in the M first images, wherein M, N, P is a positive integer, and P is less than or equal to M.
All relevant contents of each step related to the above method embodiment may be referred to the functional description of the corresponding functional module, and are not described herein again.
Embodiments of the present application further provide a chip system, as shown in fig. 11, where the chip system includes at least one processor 1101 and at least one interface circuit 1102. The processor 1101 and the interface circuit 1102 may be interconnected by wires. For example, interface circuit 1102 may be used to receive signals from other devices, such as a memory of mobile terminal 110. As another example, the interface circuit 1102 may be used to send signals to other devices (e.g., the processor 1101). Illustratively, the interface circuit 1102 may read instructions stored in the memory and send the instructions to the processor 1101. The instructions, when executed by the processor 1101, may cause the mobile terminal to perform the various steps performed by the mobile terminal 110 (e.g., a handset) in the embodiments described above. Of course, the chip system may further include other discrete devices, which is not specifically limited in this embodiment of the present application.
Embodiments of the present application further provide a computer-readable storage medium, which includes computer instructions, and when the computer instructions are executed on a server or a terminal, the server or the terminal is caused to execute any one of the methods in the foregoing embodiments.
The embodiments of the present application also provide a computer program product, which when run on a computer, causes the computer to execute any one of the methods in the above embodiments.
It is to be understood that the above-mentioned terminal and the like include hardware structures and/or software modules corresponding to the respective functions for realizing the above-mentioned functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present embodiments.
In the embodiment of the present application, the terminal and the like may be divided into functional modules according to the method example, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, the division of the modules in the embodiment of the present invention is schematic, and is only a logic function division, and there may be another division manner in actual implementation.
Through the above description of the embodiments, it is clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the above described functions. For the specific working processes of the system, the apparatus and the unit described above, reference may be made to the corresponding processes in the foregoing method embodiments, and details are not described here again.
Each functional unit in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or make a contribution to the prior art, or all or part of the technical solutions may be implemented in the form of a software product stored in a storage medium and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) or a processor to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: flash memory, removable hard drive, read only memory, random access memory, magnetic or optical disk, and the like.
The above description is only an embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions within the technical scope of the present disclosure should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (22)

1. A method for training an expression recognition model, the method comprising:
generating N second images according to M first images containing human faces, wherein each image in the M first images corresponds to one or more second images; the face of each second image in the N second images is changed compared with the face of the corresponding first image, and the face of each second image in the N second images and the face of the corresponding first image belong to the same expression;
inputting the M first images and the N second images into a pre-training model respectively to obtain M first expression prediction results of the M first images and N second expression prediction results of the N second images;
respectively calculating the distance between a first expression prediction result of each image in the M first images and a second expression prediction result of the corresponding second image;
if the distance between a first expression prediction result of P images in the M first images and a second expression prediction result of the corresponding second image is smaller than or equal to a threshold value, training the pre-training model by using the first expression prediction results of the P images in the M first images to obtain a target model;
wherein M, N, P is a positive integer and P is less than or equal to M.
2. The method according to claim 1, wherein the threshold is an average value or a median value of distances between the first expression prediction result of each of the M first images and the corresponding second expression prediction result of the second image.
3. The method of claim 1 or 2, wherein the distance is a euclidean distance or a divergence.
4. The method according to any one of claims 1-3, wherein generating N second images from M first images containing faces comprises:
inputting the M first images containing the human faces into a generation model to obtain N second images;
the generation model is a self-encoder model with a ResBlock structure, and/or the generation model is a self-encoder model with a decoder having a neural network layer number larger than that of an encoder.
5. The method of claim 4, further comprising:
in the process of training the generative model, two-norm loss and perception loss are used for constraint.
6. A method of expression recognition, the method comprising:
receiving an input image of an expression to be recognized, performing expression recognition on the image of the expression to be recognized by using an expression recognition model, and outputting an expression category corresponding to the image of the expression to be recognized;
the expression recognition model is obtained by iterative training according to a first image which is input in batches and contains a human face and a pre-training model; wherein each batch comprises M first images;
in the training process of each batch, generating N second images according to the M first images, wherein each image in the M first images corresponds to one or more of the N second images; the face of each second image in the N second images is changed compared with the face of the corresponding first image, and the face of each second image in the N second images and the face of the corresponding first image belong to the same expression; inputting the M first images and the N second images into the pre-training model respectively to obtain M first expression prediction results of the M first images and N second expression prediction results of the N second images; respectively calculating the distance between a first expression prediction result of each image in the M first images and a second expression prediction result of the corresponding second image; if the distance between a first expression prediction result of P images in the M first images and a second expression prediction result of the corresponding second image is less than or equal to a threshold, training the pre-training model using the first expression prediction results of the P images in the M first images, wherein M, N, P is a positive integer and P is less than or equal to M.
7. The method according to claim 6, wherein the threshold is an average value or a median value of distances between the first expression prediction result of each of the M first images and the corresponding second expression prediction result of the second image.
8. The method of claim 6 or 7, wherein the distance is a Euclidean distance or a divergence.
9. The method according to any one of claims 6-8, wherein said generating N second images from said M first images comprises:
inputting the M first images into a generation model to obtain N second images;
the generation model is a self-encoder model with a ResBlock structure, and/or the generation model is a self-encoder model with a decoder having a neural network layer number larger than that of an encoder.
10. The method of claim 9, further comprising:
in the process of training the generative model, two-norm loss and perception loss are used for constraint.
11. The training device for the expression recognition model is characterized by comprising a generating unit, a calculating unit and a training unit;
the generating unit is used for generating N second images according to M first images containing human faces, wherein each image in the M first images corresponds to one or more second images; the face of each second image in the N second images is changed compared with the face of the corresponding first image, and the face of each second image in the N second images and the face of the corresponding first image belong to the same expression;
the computing unit is configured to input the M first images and the N second images into a pre-training model respectively to obtain M first expression prediction results of the M first images and N second expression prediction results of the N second images;
the computing unit is further configured to compute distances between first expression prediction results of each of the M first images and corresponding second expression prediction results of the second image;
the training unit is configured to train the pre-training model by using the first expression prediction results of the P images in the M first images to obtain a target model if a distance between a first expression prediction result of the P images in the M first images and a second expression prediction result of the corresponding second image is smaller than or equal to a threshold;
wherein M, N, P is a positive integer and P is less than or equal to M.
12. The apparatus according to claim 11, wherein the threshold is an average or median of distances between the first expression prediction result of each of the M first images and the corresponding second expression prediction result of the second image.
13. The apparatus of claim 11 or 12, wherein the distance is a euclidean distance or a divergence.
14. The apparatus according to any one of claims 11-13, wherein said generating N second images from M first images containing faces comprises:
inputting the M first images containing the human faces into a generation model to obtain N second images;
the generation model is a self-encoder model with a ResBlock structure, and/or the generation model is a self-encoder model with a decoder having a neural network layer number larger than that of an encoder.
15. The apparatus of claim 14,
the training unit is further configured to use the two-norm loss and the perception loss for constraint in the process of training the generative model.
16. The device for recognizing the expression is characterized by comprising a receiving unit, a calculating unit and an output unit,
the receiving unit is used for receiving an input image of an expression to be recognized;
the computing unit is used for performing expression recognition on the image of the expression to be recognized received by the receiving unit by using an expression recognition model;
the output unit is used for outputting the expression category corresponding to the image of the expression to be recognized, which is recognized by the calculation unit;
the expression recognition model is obtained by iterative training according to a first image which is input in batches and contains a human face and a pre-training model; wherein each batch comprises M first images;
in the training process of each batch, generating N second images according to the M first images, wherein each image in the M first images corresponds to one or more of the N second images; the face of each second image in the N second images is changed compared with the face of the corresponding first image, and the face of each second image in the N second images and the face of the corresponding first image belong to the same expression; inputting the M first images and the N second images into the pre-training model respectively to obtain M first expression prediction results of the M first images and N second expression prediction results of the N second images; respectively calculating the distance between a first expression prediction result of each image in the M first images and a second expression prediction result of the corresponding second image; if the distance between a first expression prediction result of P images in the M first images and a second expression prediction result of the corresponding second image is less than or equal to a threshold, training the pre-training model using the first expression prediction results of the P images in the M first images, wherein M, N, P is a positive integer and P is less than or equal to M.
17. The apparatus according to claim 16, wherein the threshold is an average or median of distances between the first expression prediction result of each of the M first images and the corresponding second expression prediction result of the second image.
18. The apparatus of claim 16 or 17, wherein the distance is a euclidean distance or a divergence.
19. The apparatus according to any one of claims 16-18, wherein said generating N second images from said M first images comprises:
inputting the M first images into a generation model to obtain N second images;
the generation model is a self-encoder model with a ResBlock structure, and/or the generation model is a self-encoder model with a decoder having a neural network layer number larger than that of an encoder.
20. The apparatus of claim 19,
the calculation unit is further configured to perform constraint by using a two-norm loss and a perceptual loss in the process of training the generative model.
21. A computer-readable storage medium, characterized by comprising computer instructions which, when run on a terminal, cause the terminal to perform a method of training an expression recognition model according to any one of claims 1-5, or a method of expression recognition according to any one of claims 6-10.
22. A chip system, comprising one or more processors which, when executing instructions, perform a method of training an expression recognition model according to any one of claims 1 to 5, or a method of expression recognition according to any one of claims 6 to 10.
CN202010281184.9A 2020-04-10 2020-04-10 Method, device and equipment for training expression recognition model Withdrawn CN111611852A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010281184.9A CN111611852A (en) 2020-04-10 2020-04-10 Method, device and equipment for training expression recognition model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010281184.9A CN111611852A (en) 2020-04-10 2020-04-10 Method, device and equipment for training expression recognition model

Publications (1)

Publication Number Publication Date
CN111611852A true CN111611852A (en) 2020-09-01

Family

ID=72205577

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010281184.9A Withdrawn CN111611852A (en) 2020-04-10 2020-04-10 Method, device and equipment for training expression recognition model

Country Status (1)

Country Link
CN (1) CN111611852A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113327212A (en) * 2021-08-03 2021-08-31 北京奇艺世纪科技有限公司 Face driving method, face driving model training device, electronic equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113327212A (en) * 2021-08-03 2021-08-31 北京奇艺世纪科技有限公司 Face driving method, face driving model training device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN108510012B (en) Target rapid detection method based on multi-scale feature map
US20220092351A1 (en) Image classification method, neural network training method, and apparatus
WO2021043168A1 (en) Person re-identification network training method and person re-identification method and apparatus
CN110532871B (en) Image processing method and device
CN111667399B (en) Training method of style migration model, video style migration method and device
US20220215227A1 (en) Neural Architecture Search Method, Image Processing Method And Apparatus, And Storage Medium
CN110222717B (en) Image processing method and device
CN111507378A (en) Method and apparatus for training image processing model
CN112446476A (en) Neural network model compression method, device, storage medium and chip
WO2019227479A1 (en) Method and apparatus for generating face rotation image
WO2022001805A1 (en) Neural network distillation method and device
CN112639828A (en) Data processing method, method and equipment for training neural network model
CN110222718B (en) Image processing method and device
CN111914997B (en) Method for training neural network, image processing method and device
WO2021073311A1 (en) Image recognition method and apparatus, computer-readable storage medium and chip
CN114255361A (en) Neural network model training method, image processing method and device
CN113191489B (en) Training method of binary neural network model, image processing method and device
US20220157046A1 (en) Image Classification Method And Apparatus
CN112529146A (en) Method and device for training neural network model
CN113807183A (en) Model training method and related equipment
WO2021136058A1 (en) Video processing method and device
CN113536970A (en) Training method of video classification model and related device
WO2022156475A1 (en) Neural network model training method and apparatus, and data processing method and apparatus
US20220222934A1 (en) Neural network construction method and apparatus, and image processing method and apparatus
CN116958324A (en) Training method, device, equipment and storage medium of image generation model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20200901

WW01 Invention patent application withdrawn after publication