CN115035581A

CN115035581A - Facial expression recognition method, terminal device and storage medium

Info

Publication number: CN115035581A
Application number: CN202210738438.4A
Authority: CN
Inventors: 韦燕华
Original assignee: Wingtech Communication Co Ltd
Current assignee: Wingtech Communication Co Ltd
Priority date: 2022-06-27
Filing date: 2022-06-27
Publication date: 2022-09-09
Also published as: WO2024001095A1

Abstract

The embodiment of the application discloses a facial expression recognition method, terminal equipment and a storage medium, which are applied to the technical field of image recognition and can solve the problem of how to accurately detect the facial expression of a user. The method comprises the following steps: acquiring a face image; carrying out global feature extraction on the face image to obtain a global feature vector, and determining the global expression classification probability corresponding to the face image according to the global feature vector; extracting local features of the facial image through the trained neural network model to obtain local feature vectors, and determining local expression classification probability corresponding to the facial image according to the local feature vectors; and determining a target expression classification probability corresponding to the facial image according to the global expression classification probability and the local expression classification probability, and determining a facial expression corresponding to the facial image according to the target expression classification probability.

Description

Facial expression recognition method, terminal device and storage medium

Technical Field

The embodiment of the application relates to the technical field of image recognition, in particular to a facial expression recognition method, a terminal device and a storage medium.

Background

The expression recognition is to recognize the facial expressions of the current face, and the different facial expressions express different emotional states and current physiological and psychological reactions of the user. At present, the facial expression recognition method mainly comprises detection based on geometric features or appearance features, wherein the method based on the geometric features is difficult to be executed under complex light rays and variable facial movements; the method based on the appearance features has low adaptability to environmental changes and is very sensitive to changes of unbalanced illumination, complex imaging and noise, so that a large amount of texture and edge information in an image is lost, thereby reducing the accuracy of face recognition. Therefore, how to accurately detect the facial expression of the user becomes a problem which needs to be solved urgently at present.

Disclosure of Invention

The embodiment of the application provides a facial expression recognition method, terminal equipment and a storage medium, which are used for solving the problem of how to accurately detect the facial expression of a user in the prior art.

In a first aspect, a method for facial expression recognition is provided, the method comprising: acquiring a face image;

carrying out global feature extraction on the face image to obtain a global feature vector, and determining the global expression classification probability corresponding to the face image according to the global feature vector;

extracting local features of the face image through the trained neural network model to obtain local feature vectors, and determining local expression classification probability corresponding to the face image according to the local feature vectors;

and determining a target expression classification probability corresponding to the facial image according to the global expression classification probability and the local expression classification probability, and determining a facial expression corresponding to the facial image according to the target expression classification probability.

As an optional implementation manner, in the first aspect of this embodiment of the present application, the trained neural network model includes: the method comprises the following steps of extracting local features of the face image through the trained neural network model to obtain local feature vectors, wherein the first neural network model and the second neural network model comprise:

performing super-resolution processing and noise reduction processing on the face image through the first neural network model to obtain a first image;

and performing local feature extraction on the first image through the second neural network model to obtain the local feature vector.

As an optional implementation manner, in the first aspect of the embodiment of the present application, the performing, by the second neural network model, local feature extraction on the first image to obtain the local feature vector includes:

local key point detection is carried out on the first image, and eye key points and mouth key points are obtained;

extracting the first image according to the eye key points and the mouth key points to obtain an eye region image and a mouth region image;

and respectively carrying out local feature extraction on the eye region map and the mouth region map through the second neural network model to obtain local feature vectors respectively corresponding to the eye region map and the mouth region map.

As an optional implementation manner, in the first aspect of the embodiment of the present application, the performing, by the second neural network model, local feature extraction on the eye region map and the mouth region map respectively to obtain local feature vectors corresponding to the eye region map and the mouth region map respectively includes:

respectively sliding the eye region map and the mouth region map to a plurality of preset positions according to preset sliding distances through a preset sliding window in the second neural network model, and performing local feature extraction on each preset position to obtain a plurality of local feature vectors respectively corresponding to the eye region map and the mouth region map;

wherein the preset sliding window is sized according to the width and height of the eye region map and the mouth region map, respectively.

As an optional implementation manner, in the first aspect of the embodiments of the present application, the performing local feature extraction at each preset position to obtain a plurality of local feature vectors corresponding to the eye region map and the mouth region map respectively includes:

cutting the eye region image and the mouth region image at each preset position to obtain a plurality of eye feature images and a plurality of mouth feature images;

and performing local feature extraction on each eye feature map and each mouth feature map to obtain a plurality of eye feature vectors corresponding to the eye region maps and a plurality of mouth feature vectors corresponding to the mouth region maps.

As an optional implementation manner, in the first aspect of the embodiment of the present application, the determining, according to the local feature vector, a local expression classification probability corresponding to the face image includes:

correspondingly inputting the eye feature vectors and the mouth feature vectors into a full-connection layer network model respectively to obtain a plurality of expression classification probabilities corresponding to the eye feature vectors and the mouth feature vectors, wherein each expression classification probability corresponds to one eye feature vector and one mouth feature vector;

and averaging the expression classification probabilities to determine the local expression classification probability corresponding to the facial image.

As an optional implementation manner, in a first aspect of an embodiment of the present application, the performing super-resolution processing and noise reduction processing on the face image through a first neural network model to obtain a first image includes:

amplifying the face image, and cutting the amplified face image according to a preset direction and a preset size to obtain a plurality of first sub-images;

performing super-resolution processing and noise reduction processing on the plurality of first sub-images respectively through the first neural network model to obtain a plurality of second sub-images, wherein the plurality of second sub-images correspond to the plurality of first sub-images one to one;

and splicing the plurality of second sub-images to obtain the first image.

As an optional implementation manner, in the first aspect of the embodiment of the present application, the stitching the multiple second sub-images to obtain the first image includes:

acquiring a position identifier of each first sub-image in the face image;

according to the position identification, the plurality of second sub-images are respectively spliced to obtain the first image;

the position identification of a target first sub-image in the face image is the same as the position identification of a target second sub-image in the first image, the target first sub-image is any one of the plurality of first sub-images, and the target second sub-image is an image corresponding to the target first sub-image in the plurality of second sub-images.

As an optional implementation manner, in the first aspect of the embodiment of the present application, the determining, according to the global expression classification probability and the local expression classification probability, a target expression classification probability corresponding to the face image includes:

acquiring a first weight corresponding to the global expression classification probability and a second weight corresponding to the local expression classification probability, wherein the sum of the first weight and the second weight is 1;

and determining the target expression classification probability corresponding to the face image according to the global expression classification probability, the first weight, the local expression classification probability and the second weight.

As an optional implementation manner, in the first aspect of this embodiment of the present application, the method further includes:

determining a target rendering map corresponding to the facial expression;

rendering the face image through the target rendering image to obtain a target rendering face image;

outputting the target rendered face image.

In a second aspect, there is provided a facial expression recognition apparatus including: the acquisition module is used for acquiring a face image;

the feature extraction module is used for carrying out global feature extraction on the face image to obtain a global feature vector;

the processing module is used for determining the global expression classification probability corresponding to the face image according to the global feature vector;

the feature extraction module is further used for extracting local features of the face image through the trained neural network model to obtain local feature vectors;

the processing module is further used for determining the local expression classification probability corresponding to the face image according to the local feature vector;

the processing module is further configured to determine a target expression classification probability corresponding to the facial image according to the global expression classification probability and the local expression classification probability, and determine a facial expression corresponding to the facial image according to the target expression classification probability.

In a third aspect, a terminal device is provided, where the terminal device includes:

a memory storing executable program code;

a processor coupled with the memory;

the processor calls the executable program code stored in the memory to execute the steps of the facial expression recognition method in the first aspect of the embodiment of the present application.

In a fourth aspect, a computer-readable storage medium is provided, which stores a computer program that causes a computer to execute the steps of the facial expression recognition method in the first aspect of the embodiments of the present application. The computer readable storage medium includes a ROM/RAM, a magnetic or optical disk, or the like.

In a fifth aspect, there is provided a computer program product for causing a computer to perform some or all of the steps of any one of the methods of the first aspect when the computer program product is run on the computer.

A sixth aspect provides an application publishing platform for publishing a computer program product, wherein the computer program product, when run on a computer, causes the computer to perform some or all of the steps of any one of the methods of the first aspect.

Compared with the prior art, the embodiment of the application has the following beneficial effects:

according to the embodiment of the application, the global expression classification probability and the local expression classification probability can be respectively calculated through two branch frameworks of the global feature and the local feature, and the global expression classification probability and the local expression classification probability are fused to determine the facial expression, so that the influence of environmental factors on the global feature and the local feature respectively can be effectively reduced, and the accuracy of facial expression detection is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1A is a scene schematic diagram of a facial expression recognition method according to an embodiment of the present application;

fig. 1B is a first flowchart illustrating a facial expression recognition method according to an embodiment of the present application;

fig. 1C is a schematic face diagram of a method for identifying facial expressions according to an embodiment of the present application;

fig. 2 is a flowchart illustrating a second method for recognizing facial expressions according to an embodiment of the present disclosure;

fig. 3 is a third schematic flowchart of a facial expression recognition method according to an embodiment of the present application;

fig. 4 is a first schematic diagram illustrating a first cropping of a method for recognizing facial expressions according to an embodiment of the present application;

fig. 5 is a schematic diagram of a second cropping in a facial expression recognition method according to an embodiment of the present application;

fig. 6 is a rendering schematic diagram of a facial expression recognition method according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a facial expression recognition apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.

The terms "first" and "second," and the like, in the description and in the claims of the present application, are used for distinguishing between different objects and not for describing a particular order of the objects. For example, the first neural network model and the second neural network model, etc. are specific sequences for distinguishing different neural network models, rather than for describing the neural network models.

The terms "comprises," "comprising," and "having," and any variations thereof, of the embodiments of the present application, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be noted that in the embodiments of the present application, words such as "exemplary" or "for example" are used to indicate examples, illustrations or explanations. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.

The facial expression recognition device related to the embodiment of the present application may be a terminal device, or may also be a functional module and/or a functional entity that are arranged in the terminal device and are capable of implementing the facial expression recognition method, and may specifically be determined according to actual use requirements, which is not limited in the embodiment of the present application. It should be noted that the terminal device may be an electronic device such as a Mobile phone, a tablet Computer, a notebook Computer, a palm Computer, a vehicle-mounted terminal device, a wearable device, an Ultra-Mobile Personal Computer (UMPC), a netbook, or a Personal Digital Assistant (PDA). The wearable device can be an intelligent watch, an intelligent bracelet, a watch phone and the like, and the embodiment of the application is not limited.

As shown in fig. 1A, a scene schematic diagram of a facial expression recognition method disclosed in the embodiment of the present application is shown, and the facial expression recognition method provided in the present application may be applied to an application environment shown in fig. 1A. The facial expression recognition method is applied to a facial expression recognition system. The facial expression recognition system includes a user 11, a terminal device 12, and a server 13. Wherein the terminal device 12 communicates with the server 13 via a network. The terminal device 12 may first obtain a face image of the user 11, then perform global feature extraction on the face image to obtain a global feature vector, and determine a global expression classification probability corresponding to the face image according to the global feature vector; extracting local features of the face image through a neural network model trained by the server 13 to obtain local feature vectors, and determining local expression classification probabilities corresponding to the face image according to the local feature vectors; and finally, determining a target expression classification probability corresponding to the facial image according to the global expression classification probability and the local expression classification probability, and determining a facial expression corresponding to the facial image according to the target expression classification probability. The terminal device 12 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 13 may be implemented by an independent server or a server cluster formed by a plurality of servers.

In the scheme, the server 13 may train a neural network model, and store the trained neural network model in the server 13, and when the terminal device 12 acquires the face image of the user 11, the trained neural network model may be downloaded from the server 13 to perform facial expression recognition on the face image; the trained neural network model may also be trained at terminal device 12 and stored at terminal device 12.

By implementing the method, the terminal device 12 can respectively calculate the global expression classification probability and the local expression classification probability through the two branch architectures of the global feature and the local feature, and fuse the global expression classification probability and the local expression classification probability to determine the facial expression, so that the influence of environmental factors on the global feature and the local feature respectively can be effectively reduced, and the accuracy of facial expression detection is improved.

In one embodiment, as shown in fig. 1B, the present application provides a facial expression recognition method, which may be applied to the terminal device 12 or the server 13 in fig. 1A, and the terminal device 12 is taken as an example for description. The method may comprise the steps of:

101. and acquiring a human face image.

In the embodiment of the application, the terminal device can acquire the face image of the user.

It should be noted that the face image is an image including facial features of the user, and the face image may be obtained by the terminal device through shooting by a camera, or may be obtained by the terminal device from a pre-stored picture library.

102. And performing global feature extraction on the face image to obtain a global feature vector, and determining the global expression classification probability corresponding to the face image according to the global feature vector.

In the embodiment of the application, the terminal device may perform global feature extraction on the whole face region in the face image to obtain a global feature vector.

The global features refer to overall attributes of the face image, and common global features include color features, texture features, shape features and the like, such as intensity, histogram and the like; the global features have the characteristics of good invariance, simple calculation, visual representation and the like because of the low-level visual features at the pixel level.

The global feature vector may be expressed by a gray value, Red, Green, Blue (Red, Green, Blue, RGB) value, Hue, Saturation, Intensity (Hue, Saturation, HSI) value, and the like.

Optionally, the global feature extraction method may include: principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and the like.

The terminal device can mainly determine the global feature vector through PCA, and the PCA can determine the global feature vector by adopting a dimension reduction method. The face image can be identified, each face characteristic point contained in the face image can be determined, each face characteristic point can determine a corresponding n-dimensional vector for describing the face characteristic point, and then a covariance matrix is obtained for the n-dimensional vectors of the face characteristic points, wherein the covariance matrix calculates the covariance between a plurality of dimensional vectors corresponding to each characteristic point instead of between different characteristic points. The eigenvalue and the eigenvector of each eigenvalue can be calculated according to the covariance matrix, the eigenvalue of each eigenvalue is sequenced from big to small, k eigenvalues arranged in the front are selected, then the k eigenvectors corresponding to the k eigenvalues are respectively used as column vectors to form an eigenvector matrix, and the k eigenvectors are respectively projected on the selected eigenvectors, so that each eigenvalue is changed into a k-dimensional vector from an original n-dimensional vector. n and k are integers, and n is greater than k, for example, in the embodiment of the present application, n may be 128, and k may be 31, that is, feature points in the face image may be reduced from 128 dimensions to 31 dimensions by PCA.

In the scheme, when the global feature extraction is performed on the face image, the feature points in the face image can be subjected to the dimensionality reduction processing through the PCA, so that the global feature vector is obtained, the calculated amount aiming at the feature vector can be reduced, and the purpose of global feature extraction is achieved.

Alternatively, the global expression classification probability may be used to represent the probability of the facial expression category corresponding to the global feature vector.

It should be noted that the facial expressions may include various categories, for example, the categories may include calm, happy, sad, surprised, fear, anger, and disgust, and the global feature vector may be subjected to feature classification to determine the probability of the global feature vector corresponding to each facial expression category, and the probability of each facial expression category may be any value between [0 and 1 ]. The highest probability may be used as a global expression classification probability, and the facial expression category corresponding to the highest probability is the facial expression determined by global feature vector classification. As an embodiment, the global expression classification probability may also include a probability corresponding to each facial expression category.

Exemplarily, assuming that the terminal device analyzes the global feature vector, the obtained sad probability is 68%, the angry probability is 32%, the disgust probability is 26%, the fear probability is 33%, the calm probability is 8%, the happy probability is 2%, and the surprised probability is 16%; it can be seen that the probability of sadness is highest compared to the probabilities of other expressions, so the global expression classification probability is 68% of the sadness corresponding probability, and sadness can be regarded as a facial expression determined by global feature vector classification.

103. And extracting local features of the facial image through the trained neural network model to obtain local feature vectors, and determining the local expression classification probability corresponding to the facial image according to the local feature vectors.

In the embodiment of the application, the terminal device may pre-train the neural network model, and perform local feature extraction on the local region in the face image through the trained neural network model to obtain the local feature vector.

The local features are extracted from local regions of the image, and include edges, corners, lines, curves, regions with special attributes, and the like, and the common local features include two major description modes of corner classes and region classes. The local image features have the characteristics of abundant quantity in the image, small correlation degree among the features, no influence on detection and matching of other features due to disappearance of partial features under the shielding condition and the like.

Optionally, since the user generally has a larger change in eyes and mouth than other five sense organs when the user generates a change in facial expression, the terminal device may extract local feature vectors of the eyes of the user from the eye region map of the face image and extract local feature vectors of the mouth from the mouth region map of the face image. The eye region map refers to an image including the user's eyes, eyebrows, and nose bridge region, and may be divided into a left eye region map and a right eye region map, and the mouth region map refers to an image including the user's mouth and nostril regions.

Illustratively, as shown in fig. 1C, a in fig. 1C is a face image of a user, b in fig. 1C and C in fig. 1C are eye region maps of the user, where b in fig. 1C is a right eye region map, C in fig. 1C is a left eye region map, and d in fig. 1C is a mouth region map of the user.

The local feature vector may be expressed by a gray value, Red, Green, Blue (Red, Green, Blue, RGB) value, Hue, Saturation, Intensity (Hue, Saturation, HSI) value, and the like.

Alternatively, the local expression classification probability may be used to represent the probability of the facial expression class corresponding to the local feature vector.

Exemplarily, assuming that the terminal device analyzes the mouth local feature vector, the happy probability is 68%, the surprised probability is 32%, the calm probability is 26%, the fear probability is 13%, the anger probability is 8%, the sadness probability is 2%, and the disgust probability is 16%; it can be seen that the probability of happiness is highest compared to the probabilities of other expressions, so the local expression classification probability is 68% of the probability of happiness, and sadness can be regarded as a facial expression determined by local feature vector classification.

104. And determining a target expression classification probability corresponding to the facial image according to the global expression classification probability and the local expression classification probability, and determining a facial expression corresponding to the facial image according to the target expression classification probability.

In the embodiment of the application, the terminal device may fuse the global expression classification probability and the local expression classification probability to determine a target expression classification probability, where the target expression classification probability may be used to represent the current facial expression of the user.

It should be noted that, since the global expression classification probability and the local expression classification probability may both include probabilities corresponding to a plurality of facial expression categories, in the process of fusing the global expression classification probability and the local expression classification probability, the probabilities corresponding to each facial expression category need to be fused to obtain a target expression classification probability, and correspondingly, the target expression classification probability may also include a probability corresponding to each facial expression category.

For example, assume that the target expression classification probability obtained by the terminal device is: the surprise probability is 56%, the fear probability is 40%, the anger probability is 38%, the disgust probability is 26%, the sadness probability is 22%, the calm probability is 2%, and the happy probability is 13%, so that the surprise probability is the highest in comparison with the probability of other expressions, the target expression classification probability is the surprise corresponding probability 56%, and surprise can be taken as the facial expression determined by fusing the global expression classification probability and the local expression classification probability, namely the facial expression determined by jointly classifying the global feature vector and the local feature vector.

In the embodiment of the application, the terminal device can respectively calculate the global expression classification probability and the local expression classification probability through two branch architectures of the global feature and the local feature, and fuse the global expression classification probability and the local expression classification probability to determine the facial expression, so that the influence of environmental factors on the global feature and the local feature respectively can be effectively reduced, and the accuracy of facial expression detection is improved.

In another embodiment, as shown in fig. 2, the present application provides a facial expression recognition method, which may be applied to the terminal device 12 or the server 13 in fig. 1A, and the terminal device 12 is taken as an example for description. The method may further comprise the steps of:

201. and acquiring a human face image.

202. And carrying out global feature extraction on the face image to obtain a global feature vector.

In the embodiment of the present application, for the description of steps 201 to 202, please refer to the detailed description of steps 101 to 102 in the first embodiment, which is not repeated herein.

203. And carrying out feature classification on the global feature vector through a full-connection layer network model so as to determine the global expression classification probability corresponding to the face image.

In this embodiment of the application, the terminal device may input the global feature vector into the full-connection layer network model, so that the full-connection layer network model performs feature classification on the global feature vector, thereby determining the global expression classification probability corresponding to the face image.

It should be noted that the full-connection layer network model includes a full-connection layer and an activation function, where the activation function is to improve the nonlinear expression capability of the full-connection layer network model, each layer of the full-connection layer is a tiled structure composed of many neurons, and the full-connection layer mainly functions to implement classification, that is, to classify global feature vectors to determine the corresponding global expression classification probability.

Optionally, the full-connection-layer network may perform sample training in advance, that is, a large number of face images and labeled facial expression categories corresponding to each face image may be trained in advance, so that the full-connection-layer network may learn the correspondence between the face images and the facial expression categories through training, but because there are many features in the face images, the face images may correspond to more than one facial expression categories, so that each facial expression category may be assigned with a classification probability, and the higher the classification probability is, the more likely the facial expression categories corresponding to the classification probabilities become facial expressions corresponding to the face images. Therefore, when the full-connection layer network is used, only the face image needs to be input, and the classification probability of the facial expression category corresponding to the face image can be obtained.

204. And performing super-resolution processing and noise reduction processing on the face image through the first neural network model to obtain a first image.

In this embodiment of the application, the terminal device may input the face image into the first neural network model, so that the first neural network model performs super-resolution processing and noise reduction processing on the face image, thereby obtaining the first image. The trained neural network model includes a first neural network model and a second neural network model.

The super-resolution processing is a processing mode of amplifying the resolution of the picture, and can convert a low-definition small-size picture into a high-definition large-size picture. The specific method of super-resolution processing can be various, such as: interpolation, reconstruction, model training, etc.; the interpolation method is common and intuitive, and the pixel values of the pixel points corresponding to the target pixel points in the original image in the new image after the super-resolution processing can be determined according to the pixel values of the target pixel points in the original image and the pixel values of a plurality of surrounding pixel points. The noise reduction processing is a common processing mode in the current image processing, and can generally directly use a filter to reduce noise of an image, so as to remove influences caused by equipment factors, environmental interference and the like.

It should be noted that the first neural network model is obtained by training a large number of face test images in advance, and in the training process, corresponding model parameters can be generated according to the preset connection between the low-resolution small-size image and the high-resolution large-size image, so as to obtain a convolutional neural network capable of realizing super-resolution and noise reduction at the same time. The first neural network model may have a structure of convolution blocks connected in sequence according to a preset number, each convolution block includes a convolution layer and an activation layer connected in sequence, wherein the activation function of the activation layer may be one or more of a Sigmoid function, a Relu function, a Leaky ReLU function, a Tanh function, and a softmax function.

Optionally, after the face image is input into the first neural network model, the first neural network model may process the pixel points in the face image through a convolution function and an activation function to obtain a first image with higher resolution and being clearer, and update the connection between the low-definition small-size image and the high-definition large-size image in the first neural network model in real time, so that the accuracy of the first neural network model in processing the face image can be improved, the convergence rate of the first neural network model can be improved, and the efficiency of image processing is further improved.

205. And performing local feature extraction on the first image through a second neural network model to obtain a local feature vector.

In this embodiment, the terminal device may input the first image to the second neural network model, so that the second neural network model performs local feature extraction on the first image, thereby obtaining a local feature vector.

Optionally, the second neural network model may be a convolutional neural network model, and the second neural network model may be trained through a large number of sample images and feature vectors corresponding to the sample images, so that when the second neural network model is used, local feature vectors corresponding to the first image may be quickly extracted according to the input first image.

Further, the training mode of the second neural network model may include: acquiring a plurality of sample images and corresponding sample categories; performing feature extraction on the plurality of sample images through a second neural network model to be trained to obtain reference features of the sample images corresponding to the second neural network model; determining a loss value between the reference feature and the corresponding sample class; and adjusting the model parameters in the corresponding second neural network model according to the loss value until the determined loss value reaches the training stopping condition.

The reference features are sample image features obtained after feature extraction is carried out on the sample images by the second neural network model to be trained. As the number of training passes of the second neural network model increases, the reference features also change. The second neural network model can adopt a network obtained by training and learning with methods such as deep learning and neural network.

It should be noted that the training stopping condition is that the loss value of the reference feature in each sample image and the corresponding known sample class reaches a preset range, that is, the prediction accuracy of each sample image reaches the preset range.

Specifically, the terminal device obtains a plurality of sample images and corresponding sample categories, and respectively extracts image features of each sample image through a second neural network model running on the terminal device to obtain reference features of the corresponding sample images; the reference features are related to the expression classification probability corresponding to the second neural network model, and the features belonging to the corresponding expression classification probability can be well represented. Further, the terminal determines the loss values of the reference features and the known sample types by adopting a loss function, adjusts model parameters in the second neural network model according to the loss values, and stops training of the second neural network model until the loss values are in accordance with a preset range. The loss function may be a mean square error loss function, a mean absolute value loss function, a cross entropy loss function, or the like.

206. And determining the local expression classification probability corresponding to the face image according to the local feature vector.

Optionally, when determining the local expression classification probability corresponding to the face image according to the local feature vector, the terminal device may input the local feature vector into the full-connection layer network model, so that the full-connection layer network model performs feature classification on the local feature vector, thereby determining the local expression classification probability corresponding to the face image.

Note that, the full-connection layer network model is the same as the full-connection layer network model in step 203, and is used to classify the feature vectors to obtain the expression classification probability.

207. And acquiring a first weight corresponding to the global expression classification probability and a second weight corresponding to the local expression classification probability.

In the embodiment of the application, after determining the global expression classification probability and the local expression classification probability, the terminal device may obtain a first weight corresponding to the global expression classification probability and a second weight corresponding to the local expression classification probability. It should be noted that the first weight corresponding to the global expression classification probability and the second weight corresponding to the local expression classification probability may be determined empirically or may be determined according to historical training data.

Wherein the sum of the first weight and the second weight is 1.

Optionally, the terminal device may pre-construct a mapping relationship between the first weight and the global expression classification probability and a mapping relationship between the second weight and the local expression classification probability, and store the mapping relationships in a database of the terminal device or the server, so that after the terminal device obtains the global expression classification probability and the local expression classification probability, the terminal device may determine the corresponding first weight and second weight from the database.

For example, if the global expression classification probability corresponding to the global feature vector is a little higher for the accuracy of facial expression recognition, the first weight is greater than the second weight, the first weight may be 0.7, and the second weight may be 0.3; if the local expression classification probability corresponding to the local feature vector is a little higher for the accuracy of facial expression recognition, the second weight is greater than the first weight, which may be 0.26, and the second weight may be 0.74, but is not limited thereto.

208. And determining the target expression classification probability corresponding to the facial image according to the global expression classification probability, the first weight, the local expression classification probability and the second weight.

In the embodiment of the application, in the process of fusing the global expression classification probability and the local expression classification probability, the terminal device combines the respective weights of the global expression classification probability and the local expression classification probability.

It should be noted that the terminal device may determine the target expression classification probability according to a first formula, where the first formula is: p ═ wP _G +(1-w)P _c (ii) a Wherein P is a target expression classification probability, w is a first weight corresponding to the global expression classification probability, and (1-w) is a second weight corresponding to the local expression classification probability, and P _G For global expression classification probability, P _c The probabilities are classified for the local expressions.

Optionally, if the global expression classification probability and the local expression classification probability both include a probability corresponding to each facial expression category, when the global expression classification probability and the local expression classification probability are brought into the first formula, probabilities corresponding to each facial expression category need to be brought into the first formula respectively, a target expression classification probability corresponding to the facial expression category is obtained through calculation, and finally, according to the target expression classification probability corresponding to each facial expression category, a target expression classification probability with the highest probability is selected to determine the target expression classification probability corresponding to the facial image.

Illustratively, the assumption of global expression classification probabilities includes: the probability of sadness is 99%, the probability of anger is 32%, the probability of disgust is 26%, and the probability of fear is 33%; the local expression classification probability includes: the sadness probability is 95%, the anger probability is 41%, the disgust probability is 33%, and the fear probability is 29%; then, 99% of the global expression classification probability corresponding to sadness and 95% of the corresponding local expression classification probability can be substituted into the first formula to obtain 99% of the target expression classification probability corresponding to sadness, 35% of the target expression classification probability corresponding to anger, 28% of the target expression classification probability corresponding to disgust and 31% of the target expression classification probability corresponding to fear, and finally the highest 99% of the four target expression classification probabilities is determined as the target expression classification probability corresponding to the face image.

209. And determining the facial expression corresponding to the facial image according to the target expression classification probability.

In the embodiment of the present application, for the description of step 209, please refer to the detailed description of step 104 in the first embodiment, which is not repeated herein.

In the embodiment of the application, the global expression classification probability and the local expression classification probability can be respectively calculated through two branch frameworks of the global feature and the local feature, the global expression classification probability and the local expression classification probability are combined with respective weights to be fused to determine the facial expression, and the image feature is extracted by combining a neural network model, so that the resolution of a face image and the robustness to the environment can be effectively improved, the global feature and the local feature can be more accurately extracted, the facial expression can be better expressed, the classification accuracy is improved, the influence of environmental factors on the global feature and the local feature respectively can be effectively reduced, and the accuracy of facial expression detection is improved.

In one embodiment, as shown in fig. 3, the present application provides a facial expression recognition method, which may be applied to the terminal device 12 or the server 13 in fig. 1A, and is described by taking the terminal device 12 as an example. The method may further comprise the steps of:

301. and acquiring an initial image through the camera.

In the embodiment of the application, the terminal device can shoot through a camera arranged on the terminal device to obtain an initial image.

302. And identifying the initial image to obtain a face region image.

In this embodiment of the application, the terminal device may perform face recognition on the initial image, determine a face region map in the initial image, and determine that a portion other than the face region map is a background region map.

Optionally, in the process of performing face recognition on the initial image, facial feature points in the initial image may be detected, and if a certain number of facial feature points are detected in the initial image, it may be determined that a facial region map exists in the initial image.

303. And cutting the initial image according to the face area image to obtain a face image comprising the face area.

In the embodiment of the application, because the face region map may only occupy a part of the initial image, the terminal device may crop the initial image, and remove all background region maps except the face region map to obtain a face image including the face region map.

304. And carrying out global feature extraction on the face image to obtain a global feature vector.

305. And carrying out feature classification on the global feature vector through a full-connection layer network model so as to determine the global expression classification probability corresponding to the face image.

In the embodiment of the present application, for the description of steps 304 to 305, please refer to the detailed description of steps 202 to 203 in the second embodiment, which is not repeated herein.

306. The face image is amplified, and the amplified face image is cut according to a preset direction and a preset size to obtain a plurality of first sub-images.

In the embodiment of the application, when the terminal device cuts the face image, the face image can be amplified by the terminal device in the west security, and then the face image is cut according to the preset direction and the preset size to obtain a plurality of first sub-images.

The human face image is composed of a plurality of pixel points, so that the preset size is the number of the preset pixel points, and the preset direction is the fixed cutting direction.

Optionally, the cropping direction is an order of cropping the face image, and may be a left-to-right parallel boundary traversal cropping, a top-to-bottom parallel boundary traversal cropping, or a left-to-top-to-right bottom diagonal traversal cropping.

It should be noted that, when the face image is cropped, the cropping may be performed in an overlapping manner, that is, one pixel point may exist in a plurality of first sub-images, or may not exist in the overlapping manner, that is, one pixel point only exists in one first sub-image.

For example, as shown in fig. 4 and 5, it is assumed that the face image is an image with 12 × 12 pixels, each grid represents one pixel, the preset size is 4 × 4, and the preset mode is from left to right and from top to bottom.

When the terminal device performs overlapped cropping, fig. 4 only shows a partial cropping step, cropping starts from the top left corner of the face image according to a preset size of 4 × 4, performs moving cropping according to a step size of 2 pixels, namely, after performing cropping at the top left corner as shown by a in fig. 4 to obtain a first sub-image as shown by h in fig. 4, moves two pixel points rightward as shown by b in fig. 4 to obtain a first sub-image as shown by h in fig. 4, thereby continuously performs right-moving cropping, and after performing cropping at the top right corner as shown by c in fig. 4 to obtain a first sub-image as shown by h in fig. 4, moves two pixel points downward as shown by a in fig. 4 to obtain a first sub-image as shown by h in fig. 4, thereby continuously performs downward-moving cropping to the right direction, until a first sub-image shown as h in fig. 4 is obtained by clipping at the lower left corner as shown in e in fig. 4, the clipping is only moved rightwards, and finally, the clipping is performed at the lower right corner as shown by f in fig. 4 to obtain a first sub-image shown as h in fig. 4; this results in 25 first sub-images as indicated by h in fig. 4.

When the terminal device performs non-overlapping cropping, fig. 5 shows all cropping steps, after cropping from the top left corner of the face image according to the preset size of 4 × 4, namely cropping from the top left corner as shown by a in fig. 5 to obtain a first sub-image as shown by j in fig. 5, then moving four pixel points to the right (i.e. preset size) as shown by b in fig. 5 to obtain a first sub-image as shown by j in fig. 5, and after cropping from the top right corner as shown by c in fig. 5 to obtain a first sub-image as shown by j in fig. 5, then moving four pixel points downwards (i.e. preset size) as shown by a in fig. 5 to obtain a first sub-image as shown by j in fig. 5, and then moving four pixel points to the right (i.e. preset size) as shown by e in fig. 5 and f in fig. 5 to obtain two sub-images as shown by j in fig. 5 Then, based on the result shown by d in fig. 5, the first sub-image shown by g in fig. 5 is cut at the lower left corner by moving four (i.e., the preset size) pixel points downward to obtain a first sub-image shown by j in fig. 5, and then, the first sub-image is cut by moving four (i.e., the preset size) pixel points rightward as shown by h in fig. 5 until the first sub-image shown by j in fig. 5 is cut at the lower right corner by moving four (i.e., the preset size) pixel points rightward as shown by i in fig. 5; this results in 9 first sub-images shown as j in fig. 5.

In the embodiment of the application, before the face image is cut according to the preset direction and the preset size, the face image can be firstly amplified to obtain an enlarged image amplified by a preset multiple, and the enlarged image is cut according to the preset direction and the preset size.

Because the face image obtained by removing the background area from the initial image may be smaller, if the face image is cut into a plurality of first sub-images, the first sub-images are smaller and inconvenient for subsequent image processing, the terminal device may first enlarge the face image by a preset multiple and then cut the face image.

307. And respectively performing super-resolution processing and noise reduction processing on the plurality of first sub-images through the first neural network model to obtain a plurality of second sub-images.

In the embodiment of the application, because the terminal device cuts the face image to obtain a plurality of first sub-images, the terminal device may input each first sub-image into the first neural network model, so that the first neural network model performs super-resolution processing and noise reduction processing on each first sub-image to obtain a plurality of second sub-images.

After each first sub-image is input into the first neural network model, one second sub-image can be obtained, namely, the plurality of second sub-images correspond to the plurality of first sub-images in a one-to-one mode.

308. And splicing the plurality of second sub-images to obtain a first image.

In this embodiment, after the terminal device obtains the plurality of second sub-images, the terminal device may stitch the plurality of second sub-images together again to obtain the first image.

Optionally, the splicing the plurality of second sub-images to obtain the first image may specifically include: acquiring a position identifier of each first sub-image in the face image; and respectively splicing the plurality of second sub-images according to the position identification to obtain a first image.

The position identifier of the first sub-image in the face image can be used for representing the corresponding image area position of the first sub-image in the face image, the position identifier of the target first sub-image in the face image is the same as the position identifier of the target second sub-image in the first image, the target first sub-image is any one of the plurality of first sub-images, and the target second sub-image is an image corresponding to the target first sub-image in the plurality of second sub-images.

In this implementation, each first sub-image is subjected to super-resolution processing and noise reduction processing by the first neural network model to obtain a second sub-image, and therefore the position of the first sub-image and the position of the second sub-image should correspond to each other, so that when the terminal device crops the face image, the terminal device can simultaneously obtain the position identifier of the target first sub-image in the face image, and then after the terminal device obtains the target second sub-image corresponding to the target first sub-image by the first neural network model, the target second sub-image can be placed in the first image according to the position identifier, so as to obtain the first image.

It should be noted that the position identifier of the target first sub-image in the face image may be a position coordinate of a certain pixel point in the target first sub-image in the face image, or an average value of the position coordinates of each pixel point in the face image, which is not limited in the embodiment of the present application.

For example, as shown in fig. 5, since the terminal device performs non-overlapping cropping, each area with a preset size can be regarded as a whole, and then the location identifier corresponding to the first sub-image obtained as shown by a in fig. 5 may be (1, 1), the location identifier corresponding to the first sub-image obtained as shown by b in fig. 5 may be (1, 2), the location identifier corresponding to the first sub-image obtained as shown by c in fig. 5 may be (1, 3), the location identifier corresponding to the first sub-image obtained as shown by d in fig. 5 may be (2, 1), the location identifier corresponding to the first sub-image obtained as shown by e in fig. 5 may be (2, 2), the location identifier corresponding to the first sub-image obtained as shown by f in fig. 5 may be (2, 3), the location identifier corresponding to the first sub-image obtained as shown by g in fig. 5 may be (3, 1) the position identifier corresponding to the first sub-image obtained as h in fig. 5 may be (3, 2), and the position identifier corresponding to the first sub-image obtained as i in fig. 5 may be (3, 3).

309. And carrying out local key point detection on the first image to obtain eye key points and mouth key points.

In the embodiment of the application, the terminal device may identify key points in the first image, and since the user generally has a larger change in eyes and mouth than other five sense organs when the user generates a change in facial expression, the terminal device may only identify the eye key points and the mouth key points.

Optionally, the terminal device may detect the first image through a face 68 key point algorithm, where the face 68 key point algorithm is a currently more common algorithm for identifying facial key points, and generally 68 key points may be determined in the face image, and through the face 68 key point algorithm, a plurality of key points may be marked in the first image, where the key points may describe an eye contour, an eyebrow contour, a nose contour, a mouth contour, and a face contour of the user.

310. And extracting the first image according to the eye key points and the mouth key points to obtain an eye region image and a mouth region image.

In this embodiment of the application, after the terminal device acquires the eye key points and the mouth key points, the local regions respectively represented by the eye key points and the mouth key points in the first image may be extracted according to the local contours described by the eye key points and the mouth key points, so as to obtain an eye region map and a mouth region map.

311. And respectively carrying out local feature extraction on the eye region image and the mouth region image through a second neural network model to obtain local feature vectors respectively corresponding to the eye region image and the mouth region image.

In the embodiment of the present application, since the terminal device determines the eye region map and the mouth region map, the eye region map and the mouth region map may both be input into the second neural network model, and the local feature extraction is performed on the eye region map and the mouth region map by using the second neural network model, so as to obtain local feature vectors corresponding to the eye region map and the mouth region map, respectively.

Optionally, the local feature extraction is respectively performed on the eye region map and the mouth region map through the second neural network model to obtain local feature vectors respectively corresponding to the eye region map and the mouth region map, and the method specifically includes: and respectively sliding the eye region image and the mouth region image to a plurality of preset positions according to a preset sliding distance through a preset sliding window in the second neural network model, and extracting local features at each preset position to obtain a plurality of local feature vectors respectively corresponding to the eye region image and the mouth region image.

Wherein the preset sliding window is sized according to the width and height of the eye area map and the mouth area map, respectively.

In this implementation manner, after the terminal device may input both the eye region diagram and the mouth region diagram into the second neural network model, a preset sliding window may be set in the second neural network model, the preset sliding window may be a convolution kernel of the second neural network model, the preset sliding window may slide on the eye region diagram and the mouth region diagram respectively according to a certain preset sliding distance and extract local features respectively, the preset sliding distance is a step length for sliding the preset sliding window, the larger the preset sliding distance is, the smaller the workload calculated by the second neural network model is, the smaller the preset sliding distance is, the smaller the error calculated by the second neural network model is, the higher the accuracy is, and therefore the preset sliding distance may be determined empirically, or may be determined according to historical training data.

It should be noted that the terminal device may determine the size of the preset sliding window according to a second formula, where the second formula is: RC ═ aw × ah; where RC is a size of the preset sliding window, w is a width of the eye region map and the mouth region map, h is a height of the eye region map and the mouth region map, and a is a preset window threshold, where a is a value greater than 0 and less than or equal to 1.

Alternatively, as can be seen from the second formula, the size of the preset sliding window is determined according to the width and height of the eye region map and the mouth region map, and the sizes of the eye region map and the mouth region map may be different, so that the preset sliding windows for extracting the local features from the eye region map and the mouth region map may be different in size.

Optionally, the preset window threshold may be determined empirically or based on historical training data. The larger the preset window threshold value is, the smaller the workload of calculation of the second neural network model is; the smaller the preset sliding distance is, the smaller the error of the second neural network model calculation is, and the higher the accuracy is.

Further, performing local feature extraction at each preset position to obtain a plurality of local feature vectors corresponding to the eye region map and the mouth region map, which may specifically include: cutting the eye region image and the mouth region image at each preset position respectively to obtain a plurality of eye feature images and a plurality of mouth feature images; and performing local feature extraction on each eye feature map and each mouth feature map to obtain a plurality of eye feature vectors corresponding to the eye region map and a plurality of mouth feature vectors corresponding to the mouth region map.

In this implementation, the second neural network model has the same feature extraction method for the eye region map and the mouth region map, and takes the eye region map as an example. When the preset sliding window slides to a preset position in the eye region image, the eye region image is cut according to the size of the preset sliding window to obtain an eye feature image, so that after the eye region image slides, a plurality of eye feature images can be obtained, and similarly, a plurality of mouth feature images can also be obtained; then, the terminal device extracts local features from the eye feature maps and the mouth feature maps, and thus a plurality of eye feature vectors corresponding to the eye region maps and a plurality of mouth feature vectors corresponding to the mouth region maps can be obtained.

312. And determining the corresponding local expression classification probability of the face image according to the local feature vectors respectively corresponding to the eye region image and the mouth region image.

Optionally, since the terminal device obtains a plurality of eye feature vectors corresponding to the plurality of eye region maps and a plurality of mouth feature vectors corresponding to the plurality of mouth region maps, the plurality of eye feature vectors and the plurality of mouth feature vectors may be respectively and correspondingly input into the full-connection layer network model, and in the input process, one eye feature vector and one mouth feature vector are respectively input into the full-connection layer network model to obtain one expression classification probability, so that the plurality of eye feature vectors and the plurality of mouth feature vectors are respectively input into the full-connection layer network model in sequence to obtain a plurality of expression classification probabilities; and then the terminal equipment averages the plurality of expression classification probabilities, and the average value is the local expression classification probability.

313. And acquiring a first weight corresponding to the global expression classification probability and a second weight corresponding to the local expression classification probability.

314. And determining the target expression classification probability corresponding to the facial image according to the global expression classification probability, the first weight, the local expression classification probability and the second weight.

315. And determining the facial expression corresponding to the facial image according to the target expression classification probability.

In the embodiment of the present application, for the descriptions of steps 312 to 315, please refer to the detailed descriptions of steps 206 to 209 in the second embodiment, which is not repeated herein.

Note that the following steps can be applied to the terminal device 12 in fig. 1A.

316. A target rendering corresponding to the facial expression is determined.

In the embodiment of the application, after the terminal device determines the facial expression, a target rendering map corresponding to the facial expression may be determined in a plurality of pre-stored rendering maps.

Optionally, the target rendering may be displayed in the shooting preview screen, or may be displayed in the shooting interaction area. The target rendering may be an image corresponding to a facial expression, such as a facial expression being crying, a corresponding network emoticon showing a network emoticon associated with crying, or a network emoticon showing laughing against crying; the target rendering map may also be a partial map of a face corresponding to a facial expression, such as an eye rendering image corresponding to a facial expression, a mouth rendering image corresponding to a facial expression, and the like.

Optionally, the terminal device may determine multiple rendering graphs corresponding to the facial expressions and display the rendering graphs on a display screen of the terminal device, the user may select any one of the rendering graphs, and the terminal device responds to the determination operation of the user and takes the any one of the rendering graphs selected by the user as the target rendering graph.

317. And rendering the face image through the target rendering image to obtain a target rendering face image.

In the embodiment of the application, the terminal device may perform synthesis processing on the target rendering image and the face image according to the target rendering image, that is, render the face image through the target rendering image, so as to obtain the target rendering face image.

Optionally, the rendering the target rendering image to obtain the target rendering face image may specifically include, but is not limited to, the following implementation manners:

the implementation mode is as follows: and carrying out face recognition on the target rendering image, determining a five-sense organ region image corresponding to the eye region image and the mouth region image in the human face image in the target rendering image, and replacing the five-sense organ region image in the target rendering image by using the eye region image and the mouth region image in the human face image to obtain a target rendering face image.

Illustratively, as shown in fig. 6, the target rendering map 61 includes a right-eye rendering map 611, a left-eye rendering map 612, and a mouth rendering map 613 in the target rendering map 61, where the right-eye rendering map 611 corresponds to the right-eye area map b in fig. 1C, the left-eye rendering map 612 corresponds to the left-eye area map C in fig. 1C, and the mouth rendering map 613 corresponds to the mouth area map d in fig. 1C. The terminal device may replace the right-eye rendering 611 in the target rendering 61 with the right-eye area map b in fig. 1C, replace the left-eye rendering 612 in the target rendering 61 with the left-eye area map C in fig. 1C, and replace the mouth rendering 613 in the target rendering 61 with the mouth area map d in fig. 1C, to obtain the target rendered face image 62.

The implementation mode two is as follows: and carrying out face recognition on the target rendering image, determining a five-sense organ region image corresponding to the eye region image and the mouth region image in the face image in the target rendering image, and replacing the eye region image and the mouth region image in the face image by using the five-sense organ region image in the target rendering image to obtain a target rendering face image.

In the two implementation modes, the eye region image and the mouth region image in the face image are replaced by the facial feature region image in the target rendering image, or the eye region image and the mouth region image in the face image are replaced by the facial feature region image and the mouth region image in the target rendering image, so that the effect of rendering the face image by the target rendering image can be realized, the application scenes of facial expression recognition are enriched, and interestingness is effectively embodied.

318. Outputting the target rendered face image.

In this embodiment, the terminal device may display the target rendering face image on a display screen of the terminal device.

In the embodiment of the application, the face image can be cut to obtain a plurality of first sub-images, the first sub-images are respectively processed and then spliced into the first image, so that the image processing can be more detailed, and compared with the processing of the face image, the super-resolution processing and the noise reduction processing can be more refined by processing the plurality of first sub-images; and through two branch frameworks of the global characteristic and the local characteristic, the global expression classification probability and the local expression classification probability are respectively calculated, the global expression classification probability and the local expression classification probability are combined with respective weights to be fused to determine the facial expression, and the image characteristic is extracted by combining a neural network model, so that the resolution of the facial image and the robustness to the environment can be effectively improved, the global characteristic and the local characteristic can be more accurately extracted, the facial expression can be better expressed, the classification accuracy is improved, the influence of environmental factors on the global characteristic and the local characteristic respectively is effectively reduced, and the accuracy of facial expression detection is improved.

As shown in fig. 7, an embodiment of the present application provides a facial expression recognition apparatus, including:

an obtaining module 701, configured to obtain a face image;

the feature extraction module 702 is configured to perform global feature extraction on the face image to obtain a global feature vector;

the processing module 703 is configured to determine a global expression classification probability corresponding to the face image according to the global feature vector;

the feature extraction module 702 is further configured to extract local features of the face image through the trained neural network model to obtain local feature vectors;

the processing module 703 is further configured to determine a local expression classification probability corresponding to the face image according to the local feature vector;

the processing module 703 is further configured to determine a target expression classification probability corresponding to the facial image according to the global expression classification probability and the local expression classification probability, and determine a facial expression corresponding to the facial image according to the target expression classification probability.

Optionally, the processing module 703 is specifically configured to perform super-resolution processing and noise reduction processing on the face image through the first neural network model to obtain a first image;

the feature extraction module 702 is specifically configured to perform local feature extraction on the first image through the second neural network model to obtain a local feature vector.

Optionally, the processing module 703 is specifically configured to perform local key point detection on the first image to obtain an eye key point and a mouth key point;

the processing module 703 is specifically configured to extract the first image according to the eye key points and the mouth key points to obtain an eye region map and a mouth region map;

the feature extraction module 702 is specifically configured to perform local feature extraction on the eye region map and the mouth region map through the second neural network model, so as to obtain local feature vectors corresponding to the eye region map and the mouth region map respectively.

Optionally, the feature extraction module 702 is specifically configured to slide to a plurality of preset positions on the eye region map and the mouth region map according to a preset sliding distance through a preset sliding window in the second neural network model, and perform local feature extraction at each preset position to obtain a plurality of local feature vectors corresponding to the eye region map and the mouth region map;

Optionally, the processing module 703 is specifically configured to respectively cut the eye region map and the mouth region map at each preset position to obtain a plurality of eye feature maps and a plurality of mouth feature maps;

the feature extraction module 702 is specifically configured to perform local feature extraction on each eye feature map and each mouth feature map to obtain a plurality of eye feature vectors corresponding to the plurality of eye region maps and a plurality of mouth feature vectors corresponding to the plurality of mouth region maps.

Optionally, the processing module 703 is specifically configured to correspondingly input the plurality of eye feature vectors and the plurality of mouth feature vectors into the full-connection layer network model respectively to obtain a plurality of expression classification probabilities corresponding to the plurality of eye feature vectors and the plurality of mouth feature vectors, where each expression classification probability corresponds to one eye feature vector and one mouth feature vector;

the processing module 703 is specifically configured to average the multiple expression classification probabilities to determine a local expression classification probability corresponding to the facial image.

Optionally, the processing module 703 is specifically configured to amplify the face image, and cut the amplified face image according to a preset direction and a preset size to obtain a plurality of first sub-images;

the processing module 703 is specifically configured to perform super-resolution processing and noise reduction processing on the plurality of first sub-images through the first neural network model, so as to obtain a plurality of second sub-images, where the plurality of second sub-images correspond to the plurality of first sub-images one to one;

the processing module 703 is specifically configured to splice the multiple second sub-images to obtain the first image.

Optionally, the obtaining module 701 is specifically configured to obtain a position identifier of each first sub-image in the face image;

the processing module 703 is specifically configured to splice the plurality of second sub-images according to the position identifiers, so as to obtain a first image;

the position identification of the target first sub-image in the face image is the same as the position identification of the target second sub-image in the first image, the target first sub-image is any one of the plurality of first sub-images, and the target second sub-image is an image corresponding to the target first sub-image in the plurality of second sub-images.

Optionally, the obtaining module 701 is specifically configured to obtain a first weight corresponding to the global expression classification probability and a second weight corresponding to the local expression classification probability, where a sum of the first weight and the second weight is 1;

the processing module 703 is specifically configured to determine a target expression classification probability corresponding to the face image according to the global expression classification probability, the first weight, the local expression classification probability, and the second weight.

Optionally, the processing module 703 is further configured to determine a target rendering corresponding to the facial expression;

the processing module 703 is further configured to render the face image through the target rendering image to obtain a target rendering face image;

the processing module 703 is further configured to output a target rendering face image.

In the embodiment of the present application, each module may implement the steps of the facial expression recognition method provided in the above method embodiment, and may achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

As shown in fig. 8, an embodiment of the present application further provides a terminal device, where the terminal device may include:

a memory 801 in which executable program code is stored;

a processor 802 coupled with the memory 801;

the processor 802 calls the executable program code stored in the memory 801 to execute the steps of the facial expression recognition method in the above-described embodiments of the methods.

Embodiments of the present application provide a computer-readable storage medium storing a computer program, where the computer program causes a computer to execute some or all of the steps of the method in the above method embodiments.

Embodiments of the present application also provide a computer program product, wherein when the computer program product runs on a computer, the computer is caused to execute some or all of the steps of the method as in the above method embodiments.

Embodiments of the present application further provide an application publishing platform, wherein the application publishing platform is configured to publish a computer program product, wherein when the computer program product runs on a computer, the computer is caused to perform some or all of the steps of the method as in the above method embodiments.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. Those skilled in the art should also appreciate that the embodiments described in this specification are exemplary embodiments in nature, and that acts and modules are not necessarily required to practice the invention.

In various embodiments of the present application, it should be understood that the sequence numbers of the above-mentioned processes do not imply a necessary order of execution, and the order of execution of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated units, if implemented as software functional units and sold or used as a stand-alone product, may be stored in a computer accessible memory. Based on such understanding, the technical solution of the present application, which is a part of or contributes to the prior art in essence, or all or part of the technical solution, may be embodied in the form of a software product, stored in a memory, including several requests for causing a computer device (which may be a personal computer, a server, a network device, or the like, and may specifically be a processor in the computer device) to execute part or all of the steps of the above-described method of the embodiments of the present application.

It will be understood by those skilled in the art that all or part of the steps in the methods of the embodiments described above may be implemented by hardware instructions of a program, and the program may be stored in a computer-readable storage medium, where the storage medium includes Read-Only Memory (ROM), Random Access Memory (RAM), Programmable Read-Only Memory (PROM), Erasable Programmable Read-Only Memory (EPROM), One-time Programmable Read-Only Memory (OTPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Compact Disc Read-Only Memory (CD-ROM), or other Memory, such as a magnetic disk, or a combination thereof, A tape memory, or any other medium readable by a computer that can be used to carry or store data.

Claims

1. A method for facial expression recognition, the method comprising:

acquiring a face image;

extracting local features of the facial image through the trained neural network model to obtain local feature vectors, and determining local expression classification probability corresponding to the facial image according to the local feature vectors;

2. The method of claim 1, wherein the trained neural network model comprises: the method comprises the following steps of extracting local features of the face image through the trained neural network model to obtain local feature vectors, wherein the first neural network model and the second neural network model comprise:

3. The method of claim 2, wherein the performing local feature extraction on the first image through the second neural network model to obtain the local feature vector comprises:

and respectively carrying out local feature extraction on the eye region graph and the mouth region graph through the second neural network model to obtain local feature vectors respectively corresponding to the eye region graph and the mouth region graph.

4. The method according to claim 3, wherein the performing, by the second neural network model, local feature extraction on the eye region map and the mouth region map respectively to obtain local feature vectors corresponding to the eye region map and the mouth region map respectively comprises:

5. The method according to claim 4, wherein the performing local feature extraction at each of the preset positions to obtain a plurality of local feature vectors corresponding to the eye region map and the mouth region map respectively comprises:

6. The method of claim 5, wherein the determining the local expression classification probability corresponding to the facial image according to the local feature vector comprises:

7. The method of claim 2, wherein performing super-resolution processing and noise reduction processing on the face image through the first neural network model to obtain a first image comprises:

performing super-resolution processing and noise reduction processing on the plurality of first sub-images through the first neural network model to obtain a plurality of second sub-images, wherein the plurality of second sub-images correspond to the plurality of first sub-images one to one;

and splicing the plurality of second sub-images to obtain the first image.

8. The method of claim 7, wherein said stitching the plurality of second sub-images to obtain the first image comprises:

acquiring a position identifier of each first sub-image in the face image;

according to the position identification, the plurality of second sub-images are spliced respectively to obtain the first image;

9. The method according to any one of claims 1 to 8, wherein the determining a target expression classification probability corresponding to the facial image according to the global expression classification probability and the local expression classification probability comprises:

10. The method according to any one of claims 1 to 8, further comprising:

determining a target rendering map corresponding to the facial expression;

outputting the target rendered face image.

11. A terminal device, comprising:

a memory storing executable program code;

and a processor coupled to the memory;

the processor calls the executable program code stored in the memory for performing the steps of the facial expression recognition method according to any one of claims 1 to 10.

12. A computer-readable storage medium, comprising: the computer readable storage medium having stored thereon computer instructions which, when executed by a processor, carry out the steps of the method of facial expression recognition according to any one of claims 1 to 10.